After a wonderful 4-year journey at Google Research, I am starting a new chapter of my career at Nvidia Research!
Yin Cui
443 posts
- Our team at NVIDIA is hiring research interns and full-timers! We focus on generative AI for image/video/3D and are looking for candidates with research experiences in vision-language models and transformers. If you are interested, please contact me at [email protected].
- Introducing the Describe Anything Model (DAM), a powerful Multimodal LLM that generates detailed descriptions for user-specified regions in images or videos using points, boxes, scribbles, or masks. Open-source code, models, demo, data, and benchmark at: describe-anything.github.io
00:00 - Introducing Video-Audio-Text Transformer (VATT)! VATT is a conv-free Transformer trained from scratch on unlabeled raw video, audio waveform and text, achieving fine-tuning accuracies of 82.1% on Kinetics-400, 39.4% on AudioSet and 78.7% on ImageNet. arxiv.org/abs/2104.11178
- Our team is actively recruiting at various seniority levels. We’re looking for candidates with deep expertise in video generative models, LLMs, VLMs, large-scale model training, or data processing. Join us in shaping the next generation of Cosmos models for Physical AI!Introducing #NVIDIACosmos, the world foundation model platform built to advance physical #AI. Learn how, through integrations with @nvidiaomniverse, developers can create physics-based, geospatially accurate scenarios. Watch the #CES2025 demo ➡️ nvda.ws/42gViEY
00:00 - Thanks to @_tingliu and @AndreasPSteiner, we have GSAM pre-trained models released under ViT/MLP-Mixer: github.com/google-researc… PyTorch code: github.com/juntang-zhuang… Feel free to try GSAM on your favorite models!Surrogate Gap Minimization Improves Sharpness-Aware Training abs: arxiv.org/abs/2203.08065 Empirically, GSAM consistently improves generalization (e.g., +3.2% over SAM and +5.4% over AdamW on ImageNet top-1 accuracy for ViT-B/32)
- Can we use free-form text to detect any object, especially long-tailed objects? Yes! We train Mask R-CNN by distilling from CLIP to enable zero-shot detection. The model achieves higher AP compared to its supervised counterpart on rare classes. arxiv.org/abs/2104.13921
- Looking for a PhD Student Researcher. The topic and time are flexible. If you are interested, please feel free to contact me at [email protected] and apply via: careers.google.com/jobs/results/1…
- Is self-improvement exclusive to RL? Can we use supervised learning to match LLMs trained with SOTA RL algorithms? In Negative-aware Fine-Tuning (NFT), we introduce a purely supervised learning method to enhance LLMs' math reasoning with no external teachers. NFT matches or
- Can we use audio and motion modality to improve open-vocabulary video classification? We equip CLIP with cross-modal fusion to leverage multimodal information. Our method MOV archives SOTA results on UCF and HMDB zero-shot action recognition. arxiv.org/abs/2207.07646
- Can we directly build upon a frozen vision and language model (VLM) to detect objects described by texts? Yes! Our open-vocabulary detector F-VLM trains simpler than closed-vocabulary counterparts, and achieves SoTA performance on LVIS. arxiv.org/abs/2209.15639
- Is the video playing forward or backward? None of the current AI models can answer this simple question correctly.
00:00 - Our team is hiring research scientists and interns to advance generative AI and democratize content creation!Very proud of the team (@chenhsuan_lin @mli0603 Thomas Müller, Alex Evans) that brought this invention to life, which Time Magazine now recognizes as one of the best inventions of 2023. We are hiring researchers of different seniority to join our mission to democratize content
- We released Cosmos-Reason1 code, model, and part of the data! We also updated our paper to include a section about our RL infra: arxiv.org/abs/2503.15558 - Code: github.com/nvidia-cosmos/… - Model and Data: huggingface.co/collections/nv… - Blog: developer.nvidia.com/blog/curating-…Is the video playing forward or backward? None of the current AI models can answer this simple question correctly.
00:00arxiv.orgCosmos-Reason1: From Physical Common Sense To Embodied ReasoningPhysical AI systems need to perceive, understand, and perform complex actions in the physical world. In this paper, we present the Cosmos-Reason1 models that can understand the physical world and...














