Log inSign up
Yin Cui
443 posts
Image
user avatar
Yin Cui
@YinCuiCV
Research Scientist @NVIDIA | Formerly @Google, @Cornell | Views are my own
Mountain View, CA
ycui.me
Joined October 2012
719
Following
6,940
Followers
  • user avatar
    Yin Cui
    @YinCuiCV
    Jul 4, 2023
    After a wonderful 4-year journey at Google Research, I am starting a new chapter of my career at Nvidia Research!
    49K
  • user avatar
    Yin Cui
    @YinCuiCV
    Jul 10, 2023
    Our team at NVIDIA is hiring research interns and full-timers! We focus on generative AI for image/video/3D and are looking for candidates with research experiences in vision-language models and transformers. If you are interested, please contact me at [email protected].
    Image
    62K
  • user avatar
    Yin Cui
    @YinCuiCV
    Apr 23, 2025
    Introducing the Describe Anything Model (DAM), a powerful Multimodal LLM that generates detailed descriptions for user-specified regions in images or videos using points, boxes, scribbles, or masks. Open-source code, models, demo, data, and benchmark at: describe-anything.github.io
    Image
    00:00
    35K
  • user avatar
    Yin Cui
    @YinCuiCV
    Apr 23, 2021
    Introducing Video-Audio-Text Transformer (VATT)! VATT is a conv-free Transformer trained from scratch on unlabeled raw video, audio waveform and text, achieving fine-tuning accuracies of 82.1% on Kinetics-400, 39.4% on AudioSet and 78.7% on ImageNet. arxiv.org/abs/2104.11178
    Image
  • user avatar
    Yin Cui
    @YinCuiCV
    Jan 25, 2025
    Our team is actively recruiting at various seniority levels. We’re looking for candidates with deep expertise in video generative models, LLMs, VLMs, large-scale model training, or data processing. Join us in shaping the next generation of Cosmos models for Physical AI!
    user avatar
    NVIDIA
    @nvidia
    Jan 16, 2025
    Introducing #NVIDIACosmos, the world foundation model platform built to advance physical #AI. Learn how, through integrations with @nvidiaomniverse, developers can create physics-based, geospatially accurate scenarios. Watch the #CES2025 demo ➡️ nvda.ws/42gViEY
    Image
    00:00
    44K
  • user avatar
    Yin Cui
    @YinCuiCV
    Jun 10, 2022
    Thanks to @_tingliu and @AndreasPSteiner, we have GSAM pre-trained models released under ViT/MLP-Mixer: github.com/google-researc… PyTorch code: github.com/juntang-zhuang… Feel free to try GSAM on your favorite models!
    user avatar
    AK
    @_akhaliq
    Mar 16, 2022
    Surrogate Gap Minimization Improves Sharpness-Aware Training abs: arxiv.org/abs/2203.08065 Empirically, GSAM consistently improves generalization (e.g., +3.2% over SAM and +5.4% over AdamW on ImageNet top-1 accuracy for ViT-B/32)
    Image
    Image
    GitHub - google-research/vision_transformer
    From github.com
  • user avatar
    Yin Cui
    @YinCuiCV
    Apr 29, 2021
    Can we use free-form text to detect any object, especially long-tailed objects? Yes! We train Mask R-CNN by distilling from CLIP to enable zero-shot detection. The model achieves higher AP compared to its supervised counterpart on rare classes. arxiv.org/abs/2104.13921
    Image
  • user avatar
    Yin Cui
    @YinCuiCV
    Jan 24, 2022
    Looking for a PhD Student Researcher. The topic and time are flexible. If you are interested, please feel free to contact me at [email protected] and apply via: careers.google.com/jobs/results/1…
  • user avatar
    Yin Cui
    @YinCuiCV
    May 28, 2025
    Is self-improvement exclusive to RL? Can we use supervised learning to match LLMs trained with SOTA RL algorithms? In Negative-aware Fine-Tuning (NFT), we introduce a purely supervised learning method to enhance LLMs' math reasoning with no external teachers. NFT matches or
    Image
    Image
    Image
    22K
  • user avatar
    Yin Cui
    @YinCuiCV
    Jul 18, 2022
    Can we use audio and motion modality to improve open-vocabulary video classification? We equip CLIP with cross-modal fusion to leverage multimodal information. Our method MOV archives SOTA results on UCF and HMDB zero-shot action recognition. arxiv.org/abs/2207.07646
    Image
    Image
  • user avatar
    Yin Cui
    @YinCuiCV
    Oct 3, 2022
    Can we directly build upon a frozen vision and language model (VLM) to detect objects described by texts? Yes! Our open-vocabulary detector F-VLM trains simpler than closed-vocabulary counterparts, and achieves SoTA performance on LVIS. arxiv.org/abs/2209.15639
    Image
  • user avatar
    Yin Cui
    @YinCuiCV
    Mar 24, 2025
    Is the video playing forward or backward? None of the current AI models can answer this simple question correctly.
    Image
    00:00
    34K
  • user avatar
    Yin Cui
    @YinCuiCV
    Oct 25, 2023
    Our team is hiring research scientists and interns to advance generative AI and democratize content creation!
    user avatar
    Ming-Yu Liu
    @liu_mingyu
    Oct 25, 2023
    Very proud of the team (@chenhsuan_lin @mli0603 Thomas Müller, Alex Evans) that brought this invention to life, which Time Magazine now recognizes as one of the best inventions of 2023. We are hiring researchers of different seniority to join our mission to democratize content
    41K
  • user avatar
    Yin Cui
    @YinCuiCV
    May 20, 2025
    We released Cosmos-Reason1 code, model, and part of the data! We also updated our paper to include a section about our RL infra: arxiv.org/abs/2503.15558 - Code: github.com/nvidia-cosmos/… - Model and Data: huggingface.co/collections/nv… - Blog: developer.nvidia.com/blog/curating-…
    user avatar
    Yin Cui
    @YinCuiCV
    Mar 24, 2025
    Is the video playing forward or backward? None of the current AI models can answer this simple question correctly.
    Image
    00:00
    arXiv logo
    arxiv.org
    Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
    Physical AI systems need to perceive, understand, and perform complex actions in the physical world. In this paper, we present the Cosmos-Reason1 models that can understand the physical world and...
    13K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
Advertisement
Advertisement