Ronghang Hu – Towards strong perception, understanding, and reasoning

Member of Technical Staff, xAI
Email: [email protected]

Google Scholar • GitHub • LinkedIn

About Me / Bio

Ronghang Hu is a member of technical staff at xAI, focusing on pushing the frontier of multimodal AI.
Previously, Ronghang Hu was a research scientist at Meta FAIR (formerly Facebook AI Research), and was devoted to the Segment Anything series of projects to build strong visual perception models, and was a core contributor to SAM 2 and SAM 3. Ronghang obtained his Ph.D. degree in Computer Science from the University of California, Berkeley in 2020, and his B.Eng. degree from Tsinghua University in 2015.

Experiences

xAI (Palo Alto, CA; 11/2025 — present)
Member of Technical Staff
Meta FAIR (Menlo Park, CA; 06/2020 — 11/2025)
Research Scientist
Facebook AI Research (Menlo Park, CA; 05/2019 — 08/2019)
Research Intern
Facebook AI Research (Seattle, WA; 05/2017 — 08/2017)
Research Intern

Education

University of California, Berkeley (Berkeley, CA; 08/2015 — 05/2020)
Ph.D. and M.S. in Computer Science
Tsinghua University (Beijing, China; 08/2011 — 07/2015)
B.Eng. in Electronic Information Science and Technology

Selected Projects

SAM 3: Segment Anything with Concepts

N. Carion, L. Gustafson, Y.-T. Hu, S. Debnath, R. Hu, D. S. Coll-Vinent, C. Ryali, K. V. Alwala, H. Khedr, A. Huang, J. Lei, T. Ma, B. Guo, A. Kalla, M. Marks, J. Greer, M. Wang, P. Sun, R. Rädle, T. Afouras, E. Mavroudi, K. Xu, T.-H. Wu, Y. Zhou, L. Momeni, R. Hazra, S. Ding, S. Vaze, F. Porcher, F. Li, S. Li, A. Kamath, H. K. Cheng, P. Dollár, N. Ravi, K. Saenko, P. Zhang, C. Feichtenhofer
arXiv preprint arXiv:2511.16719, 2025
(PDF, Project, Code, Demo, Blog)

Segment Anything Model 3 (SAM 3) is a unified foundation model for promptable segmentation in images and videos. It can detect, segment, and track objects using text or visual prompts such as points, boxes, and masks. Compared to its predecessor SAM 2, SAM 3 introduces the ability to exhaustively segment all instances of an open-vocabulary concept specified by a short text phrase or exemplars.

SAM 2: Segment Anything in Images and Videos

N. Ravi^*,†, V. Gabeur^*, Y.-T. Hu^*, R. Hu^*, C. Ryali^*, T. Ma^*, H. Khedr^*, R. Rädle^*, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V. Alwala, N. Carion, C.-Y. Wu, R. Girshick, P. Dollár^†, C. Feichtenhofer^*,† (^*: equal technical contribution, ^†: equal advising)
International Conference on Learning Representations (ICLR), 2025 — Outstanding Paper Honorable Mentions
(PDF, Project, Code, Demo, Dataset, Blog)

Segment Anything Model 2 (SAM 2) is a foundation model towards solving promptable visual segmentation in images and videos. We extend SAM to video by considering images as a video with a single frame. The model design is a simple transformer architecture with streaming memory for real-time video processing.

Scaling Language-Image Pre-training via Masking

Y. Li^*, H. Fan^*, R. Hu^*, C. Feichtenhofer^†, K. He^† (^*: equal technical contribution, ^†: equal advising)
Computer Vision and Pattern Recognition (CVPR), 2023
(PDF, Code)

We present Fast Language-Image Pre-training (FLIP), which gives ~3.7x speedup over the traditional CLIP and improves accuracy using the same training data on a large variety of downstream tasks.

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon, S. Xie.
Computer Vision and Pattern Recognition (CVPR), 2023
(PDF, Code)

We propose ConvNeXt V2, a fully convolutional masked autoencoder framework (FCMAE) and a new Global Response Normalization (GRN) layer that can be added to the ConvNeXt architecture to enhance inter-channel feature competition.

FLAVA: A Foundational Language And Vision Alignment Model

A. Singh^*, R. Hu^*, V. Goswami^*, G. Couairon, W. Galuba, M. Rohrbach, D. Kiela (^*: equal contribution)
Computer Vision and Pattern Recognition (CVPR), 2022
(PDF, Project Page)

We propose FLAVA, a foundational model that performs well over a wide variety of 35 tasks on all three target modalities: 1) vision, 2) language, and 3) vision & language, and develop an efficient joint pretraining approach on both unimodal sources.

UniT: Multimodal Multitask Learning with a Unified Transformer

R. Hu, A. Singh
International Conference on Computer Vision (ICCV), 2021
(PDF, Project Page)

We build UniT, a unified transformer encoder-decoder model to simultaneously learn the most prominent tasks across different domains, ranging from object detection to language understanding and multimodal reasoning.