Introducing Genie 3, our state-of-the-art world model that generates interactive worlds from text, enabling real-time interaction at 24 fps with minutes-long consistency at 720p. 🧵👇
We introduce W.A.L.T, a diffusion model for photorealistic video generation. Our model is a transformer trained on image and video generation in a shared latent space. 🧵👇
How should we leverage internet videos for learning visual correspondence?
In our latest work we introduce SiamMAE: Siamese Masked Autoencoders for self-supervised representation learning from videos.
web: siam-mae-video.github.io
paper: siam-mae-video.github.io/resources/pape… 👇🧵
3/ One emergent capability I find remarkable is long-term consistency, especially because we don’t use any explicit 3D representations or priors. Simply training the model to generate the next frame auto-regressively teaches it to maintain physical consistency across time
1/ Can we replicate the success of large scale pre-training --> task specific fine tuning for robotics?
This is hard as robots have different act/obs space, morphology and learning speed!
We introduce MetaMorph🧵👇
Paper: arxiv.org/abs/2203.11931
Code: github.com/agrimgupta92/m…
When will every pixel be generated? Every 2 years AI systems can generate 10x more pixels. At this rate of progress we will have AI generated TV episodes by 2029 and movies by 2031.
4/ Finally, I think future iterations of models like Genie 3 will have a significant impact on accelerating robotics and real-world AI. Here's a glimpse of what that could look like: an agent pursuing a goal (go to tomatoes) in an environment generated by our model.
1/ Can we build video prediction models by masked visual pretraining via Transformer?
We present MaskViT: a simple & parameter efficient method to generate high res. videos in real time.
Paper: arxiv.org/abs/2206.11894
Web: maskedvit.github.io🧵👇
Introducing Veo 2, our new, state-of-the-art video model (with better understanding of real-world physics & movement, up to 4K resolution). You can join the waitlist on VideoFX. Our new and improved Imagen 3 model also achieves SOTA results, and is coming today to 100+ countries
1/ Excited to share that our work on Deep Evolutionary Reinforcement Learning (DERL): a framework for large scale evolution of embodied agents in physically realistic environments is now published in @NatureComms
Paper nature.com/articles/s4146…
Video youtube.com/watch?v=zltE0w…