Inspiration

At a previous company, it was frustrating to see that our game engine constrained user behaviour by modeling user behaviour explicitly. Why do we restrict ourselves to such rigid models of storytelling?

What it does

It's like genai fanfic where videos get generated as you create the story.

How we built it

Frontend: typescript Backend: rust + lambda

Challenges we ran into

built a cache for diffusion models. It effectively caches flow fields in the diffusion process for a prompt. So when I see a similar prompt, I figure out which cached latent space I need and the intermediate layer to match the scheduler and trajectories of the cache to what the trajectory should be (I'm using Kalman filters for this). Not only does this improve performance but it also improves consistency across scenes if I can start from a layer that has some structural components (this part is not as reliable)

Accomplishments that we're proud of

The model does a good job with consistency but it's unclear how many artifacts tuning the cached flow field actually creates.. is it even numerically stable?? No idea, this is very clearly an engineer's perspective on how to solve a problem that seems to be solved by researchers with better one-step models

What we learned

Binpacking gpu requests is not as trivial as I thought it'd be and I fallback to luma if I can't hit the target that makes it viable enough to run gpus

What's next for Inkspell

I want to add a rendered env for some game modes where you can walk around. I haven't seen too many nice text to Gaussian splitting models so maybe I'll try that out.

  • Lip syncing
  • sound
  • better consistency

I don't want to add image support because its very disturbing to see the text to video model companies adding kiss your crush stuff.

Briefly tried adding guardrails but it was too aggressive and would need to rethink guardrails since it seems AWS guardrails are effectively useless.

Built With

Share this project:

Updates