Inspiration
At a previous company, it was frustrating to see that our game engine constrained user behaviour by modeling user behaviour explicitly. Why do we restrict ourselves to such rigid models of storytelling?
What it does
It's like genai fanfic where videos get generated as you create the story.
How we built it
Frontend: typescript Backend: rust + lambda
Challenges we ran into
built a cache for diffusion models. It effectively caches flow fields in the diffusion process for a prompt. So when I see a similar prompt, I figure out which cached latent space I need and the intermediate layer to match the scheduler and trajectories of the cache to what the trajectory should be (I'm using Kalman filters for this). Not only does this improve performance but it also improves consistency across scenes if I can start from a layer that has some structural components (this part is not as reliable)
Accomplishments that we're proud of
The model does a good job with consistency but it's unclear how many artifacts tuning the cached flow field actually creates.. is it even numerically stable?? No idea, this is very clearly an engineer's perspective on how to solve a problem that seems to be solved by researchers with better one-step models
What we learned
Binpacking gpu requests is not as trivial as I thought it'd be and I fallback to luma if I can't hit the target that makes it viable enough to run gpus
What's next for Inkspell
I want to add a rendered env for some game modes where you can walk around. I haven't seen too many nice text to Gaussian splitting models so maybe I'll try that out.
- Lip syncing
- sound
- better consistency
I don't want to add image support because its very disturbing to see the text to video model companies adding kiss your crush stuff.
Briefly tried adding guardrails but it was too aggressive and would need to rethink guardrails since it seems AWS guardrails are effectively useless.
Log in or sign up for Devpost to join the conversation.