David Fan (@DavidJFan) / X

David Fan

173 posts

David Fan

@DavidJFan

AMI Labs | ex-Meta FAIR | @Princeton CS '19 Building the next revolution of AI models that understand the real world.

New York City

scholar.google.com/citations?user…

Joined June 2013

Pinned
David Fan
@DavidJFan
Mar 4
[1/9] What happens when you treat vision as a first-class citizen during multimodal pretraining? To find out, we studied the design space of training Transfusion-style models that input and output all modalities, from scratch. Here is what we learned about visual representations,
arxiv.org
Beyond Language Modeling: An Exploration of Multimodal Pretraining
The visual world offers a critical axis for advancing foundation models beyond language. Despite growing interest in this direction, the design space for native multimodal models remains opaque....
51K
David Fan
@DavidJFan
Apr 2, 2025
Can visual SSL match CLIP on VQA? Yes! We show with controlled experiments that visual SSL can be competitive even on OCR/Chart VQA, as demonstrated by our new Web-SSL model family (1B-7B params) which is trained purely on web images – without any language supervision.
86K
David Fan
@DavidJFan
Apr 2, 2025
Replying to @DavidJFan
[7/8] This side project started in October when @TongPetersb, @_amirbar, and I were thinking about the rise of CLIP as a popular vision encoder for MLLMs. The community often assumes that language supervision is the primary reason for CLIP's strong performance. However, we
20K
David Fan
@DavidJFan
Apr 2, 2025
Replying to @DavidJFan
[8/8] For more details, please check out the paper and website. Paper: arxiv.org/abs/2504.01017 Website: davidfan.io/webssl/ It was a great privilege and so much fun to work with @TongPetersb, @JiachenAI, @koustuvsinha, @liuzhuang1234, @endernewton, Michael Rabbat, Nicolas
arxiv.org
Scaling Language-Free Visual Representation Learning
Visual Self-Supervised Learning (SSL) currently underperforms Contrastive Language-Image Pretraining (CLIP) in multimodal settings such as Visual Question Answering (VQA). This multimodal gap is...
2.9K
David Fan
@DavidJFan
Apr 15, 2025
Excited to release the training code for MetaMorph! MetaMorph offers a simple yet effective way to convert LLMs into a multimodal LLM that not only takes multimodal inputs, but also generates multimodal outputs via AR prediction. This confers the ability to “think visually”, and
Peter Tong
@TongPetersb
Apr 15, 2025
We're open-sourcing the training code for MetaMorph! MetaMorph offers a lightweight framework for turning LLMs into unified multimodal models: (multimodal) tokens -> transformers -> diffusion -> pixel! This is our best take on unified modeling as of November 2024, and
GitHub - facebookresearch/metamorph: Code for MetaMorph Multimodal Understanding and Generation via...
From github.com
3K
David Fan
@DavidJFan
Apr 2, 2025
Replying to @DavidJFan
[1/8] We trained SSL (Web-DINO) and CLIP on the same web-scale dataset (MC-2B images/image-text pairs) and evaluated fairly on VQA by freezing the vision encoders. Our apples-to-apples comparisons enable novel insights into the scaling behavior of SSL in this new data regime.
2.9K
David Fan
@DavidJFan
Apr 2, 2025
Replying to @DavidJFan
[2/8] Web-DINO (SSL) scales better with model size from 1B - 7B params than CLIP, especially on OCR/Chart VQA – proving that vision-only models can also perform well on tasks that were traditionally dominated by CLIP.
2.8K
David Fan
@DavidJFan
Apr 2, 2025
Replying to @DavidJFan
[3/8] Why can SSL learn text-sensitive features purely from images? Hypothesis: Web images contain text. Observation: Training Web-DINO on text-heavy image subsets drastically improves OCR/Chart VQA performance, matching CLIP trained on the full data.
2.4K
David Fan
@DavidJFan
Apr 2, 2025
Replying to @DavidJFan
[4/8] Web-SSL also performs well on classic vision tasks and is competitive with the original DINOv2. This means we are getting closer to developing vision encoders that excel at both pure vision and multimodal capabilities!
2.3K
David Fan
@DavidJFan
Dec 21, 2024
Piggy-backing off @liuzhuang1234 and @TongPetersb to share some thoughts! [1/5] I think understanding and generation are like two-sides of the same coin. Although our results suggest that developing a degree of understanding is prerequisite for (efficiently learning)
Zhuang Liu
@liuzhuang1234
Dec 19, 2024
How far is an LLM from not only understanding but also generating visually? Not very far! Introducing MetaMorph---a multimodal understanding and generation model. In MetaMorph, understanding and generation benefit each other. Very moderate generation data is needed to elicit
5.5K
David Fan
@DavidJFan
Apr 2, 2025
Replying to @DavidJFan
[5/8] The observed scaling behavior is not unique to DINOv2; other visual SSL methods such as MAE also show similar potential!
2.3K
David Fan
@DavidJFan
Apr 2, 2025
Replying to @DavidJFan
[6/8] Developing higher resolution SSL models is a promising direction. High resolution versions of Web-DINO 7B can nearly match the performance of SOTA off-shelf CLIP-family models such as SigLIP 2, despite seeing only on images.
2.5K
David Fan
@DavidJFan
May 9, 2025
Welcome Rob! So blessed to have you steer the ship! See you around the office :)
Rob Fergus
@rob_fergus
May 8, 2025
1/ Excited to share that I’m taking on the role of leading Fundamental AI Research (FAIR) at Meta. Huge thanks to Joelle for everything. Look forward to working closely again with Yann & team.
916
David Fan
@DavidJFan
May 2, 2025
Replying to @DavidJFan
Web-SSL model weights are now available on GitHub and HuggingFace! You may use your favorite Transformers library API calls or load the model with native PyTorch - up to your preference. For more usage details, please see github.com/facebookresear… HuggingFace collection:
GitHub - facebookresearch/webssl: Code for "Scaling Language-Free Visual Representation Learning"...
From github.com
745