Yin Cui (@YinCuiCV) / X

Yin Cui

443 posts

Yin Cui

@YinCuiCV

Research Scientist @NVIDIA | Formerly @Google, @Cornell | Views are my own

Mountain View, CA

Joined October 2012

Yin Cui
@YinCuiCV
Jul 4, 2023
After a wonderful 4-year journey at Google Research, I am starting a new chapter of my career at Nvidia Research!
49K
Yin Cui
@YinCuiCV
Jul 10, 2023
Our team at NVIDIA is hiring research interns and full-timers! We focus on generative AI for image/video/3D and are looking for candidates with research experiences in vision-language models and transformers. If you are interested, please contact me at [email protected].
62K
Yin Cui
@YinCuiCV
Apr 23, 2025
Introducing the Describe Anything Model (DAM), a powerful Multimodal LLM that generates detailed descriptions for user-specified regions in images or videos using points, boxes, scribbles, or masks. Open-source code, models, demo, data, and benchmark at: describe-anything.github.io
00:00
35K
Yin Cui
@YinCuiCV
Apr 23, 2021
Introducing Video-Audio-Text Transformer (VATT)! VATT is a conv-free Transformer trained from scratch on unlabeled raw video, audio waveform and text, achieving fine-tuning accuracies of 82.1% on Kinetics-400, 39.4% on AudioSet and 78.7% on ImageNet. arxiv.org/abs/2104.11178
Yin Cui
@YinCuiCV
Jan 25, 2025
Our team is actively recruiting at various seniority levels. We’re looking for candidates with deep expertise in video generative models, LLMs, VLMs, large-scale model training, or data processing. Join us in shaping the next generation of Cosmos models for Physical AI!
NVIDIA
@nvidia
Jan 16, 2025
Introducing #NVIDIACosmos, the world foundation model platform built to advance physical #AI. Learn how, through integrations with @nvidiaomniverse, developers can create physics-based, geospatially accurate scenarios. Watch the #CES2025 demo ➡️ nvda.ws/42gViEY
00:00
44K
Yin Cui
@YinCuiCV
Jun 10, 2022
Thanks to @_tingliu and @AndreasPSteiner, we have GSAM pre-trained models released under ViT/MLP-Mixer: github.com/google-researc… PyTorch code: github.com/juntang-zhuang… Feel free to try GSAM on your favorite models!
AK
@_akhaliq
Mar 16, 2022
Surrogate Gap Minimization Improves Sharpness-Aware Training abs: arxiv.org/abs/2203.08065 Empirically, GSAM consistently improves generalization (e.g., +3.2% over SAM and +5.4% over AdamW on ImageNet top-1 accuracy for ViT-B/32)
GitHub - google-research/vision_transformer
From github.com
Yin Cui
@YinCuiCV
Apr 29, 2021
Can we use free-form text to detect any object, especially long-tailed objects? Yes! We train Mask R-CNN by distilling from CLIP to enable zero-shot detection. The model achieves higher AP compared to its supervised counterpart on rare classes. arxiv.org/abs/2104.13921
Yin Cui
@YinCuiCV
Jan 24, 2022
Looking for a PhD Student Researcher. The topic and time are flexible. If you are interested, please feel free to contact me at [email protected] and apply via: careers.google.com/jobs/results/1…
Yin Cui
@YinCuiCV
May 28, 2025
Is self-improvement exclusive to RL? Can we use supervised learning to match LLMs trained with SOTA RL algorithms? In Negative-aware Fine-Tuning (NFT), we introduce a purely supervised learning method to enhance LLMs' math reasoning with no external teachers. NFT matches or
22K
Yin Cui
@YinCuiCV
Jul 18, 2022
Can we use audio and motion modality to improve open-vocabulary video classification? We equip CLIP with cross-modal fusion to leverage multimodal information. Our method MOV archives SOTA results on UCF and HMDB zero-shot action recognition. arxiv.org/abs/2207.07646
Yin Cui
@YinCuiCV
Oct 3, 2022
Can we directly build upon a frozen vision and language model (VLM) to detect objects described by texts? Yes! Our open-vocabulary detector F-VLM trains simpler than closed-vocabulary counterparts, and achieves SoTA performance on LVIS. arxiv.org/abs/2209.15639
Yin Cui
@YinCuiCV
Mar 24, 2025
Is the video playing forward or backward? None of the current AI models can answer this simple question correctly.
00:00
34K
Yin Cui
@YinCuiCV
Oct 25, 2023
Our team is hiring research scientists and interns to advance generative AI and democratize content creation!
Ming-Yu Liu
@liu_mingyu
Oct 25, 2023
Very proud of the team (@chenhsuan_lin @mli0603 Thomas Müller, Alex Evans) that brought this invention to life, which Time Magazine now recognizes as one of the best inventions of 2023. We are hiring researchers of different seniority to join our mission to democratize content
41K
Yin Cui
@YinCuiCV
May 20, 2025
We released Cosmos-Reason1 code, model, and part of the data! We also updated our paper to include a section about our RL infra: arxiv.org/abs/2503.15558 - Code: github.com/nvidia-cosmos/… - Model and Data: huggingface.co/collections/nv… - Blog: developer.nvidia.com/blog/curating-…
Yin Cui
@YinCuiCV
Mar 24, 2025
Is the video playing forward or backward? None of the current AI models can answer this simple question correctly.
00:00
arxiv.org
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
Physical AI systems need to perceive, understand, and perform complex actions in the physical world. In this paper, we present the Cosmos-Reason1 models that can understand the physical world and...
13K