Log inSign up
David Fan
AMI Labs
173 posts
user avatar
David Fan
AMI Labs
@DavidJFan
AMI Labs | ex-Meta FAIR | @Princeton CS '19 Building the next revolution of AI models that understand the real world.
New York City
scholar.google.com/citations?user…
Joined June 2013
487
Following
1,733
Followers
  • Pinned
    user avatar
    David Fan
    AMI Labs
    @DavidJFan
    Mar 4
    [1/9] What happens when you treat vision as a first-class citizen during multimodal pretraining? To find out, we studied the design space of training Transfusion-style models that input and output all modalities, from scratch. Here is what we learned about visual representations,
    arXiv logo
    arxiv.org
    Beyond Language Modeling: An Exploration of Multimodal Pretraining
    The visual world offers a critical axis for advancing foundation models beyond language. Despite growing interest in this direction, the design space for native multimodal models remains opaque....
    51K
  • user avatar
    David Fan
    AMI Labs
    @DavidJFan
    Apr 2, 2025
    Can visual SSL match CLIP on VQA? Yes! We show with controlled experiments that visual SSL can be competitive even on OCR/Chart VQA, as demonstrated by our new Web-SSL model family (1B-7B params) which is trained purely on web images – without any language supervision.
    Image
    86K
  • user avatar
    David Fan
    AMI Labs
    @DavidJFan
    Apr 2, 2025
    Replying to @DavidJFan
    [7/8] This side project started in October when @TongPetersb, @_amirbar, and I were thinking about the rise of CLIP as a popular vision encoder for MLLMs. The community often assumes that language supervision is the primary reason for CLIP's strong performance. However, we
    20K
  • user avatar
    David Fan
    AMI Labs
    @DavidJFan
    Apr 2, 2025
    Replying to @DavidJFan
    [8/8] For more details, please check out the paper and website. Paper: arxiv.org/abs/2504.01017 Website: davidfan.io/webssl/ It was a great privilege and so much fun to work with @TongPetersb, @JiachenAI, @koustuvsinha, @liuzhuang1234, @endernewton, Michael Rabbat, Nicolas
    arXiv logo
    arxiv.org
    Scaling Language-Free Visual Representation Learning
    Visual Self-Supervised Learning (SSL) currently underperforms Contrastive Language-Image Pretraining (CLIP) in multimodal settings such as Visual Question Answering (VQA). This multimodal gap is...
    2.9K
  • user avatar
    David Fan
    AMI Labs
    @DavidJFan
    Apr 15, 2025
    Excited to release the training code for MetaMorph! MetaMorph offers a simple yet effective way to convert LLMs into a multimodal LLM that not only takes multimodal inputs, but also generates multimodal outputs via AR prediction. This confers the ability to “think visually”, and
    user avatar
    Peter Tong
    AMI Labs
    @TongPetersb
    Apr 15, 2025
    We're open-sourcing the training code for MetaMorph! MetaMorph offers a lightweight framework for turning LLMs into unified multimodal models: (multimodal) tokens -> transformers -> diffusion -> pixel! This is our best take on unified modeling as of November 2024, and
    Image
    GitHub - facebookresearch/metamorph: Code for MetaMorph Multimodal Understanding and Generation via...
    From github.com
    3K
  • user avatar
    David Fan
    AMI Labs
    @DavidJFan
    Apr 2, 2025
    Replying to @DavidJFan
    [1/8] We trained SSL (Web-DINO) and CLIP on the same web-scale dataset (MC-2B images/image-text pairs) and evaluated fairly on VQA by freezing the vision encoders. Our apples-to-apples comparisons enable novel insights into the scaling behavior of SSL in this new data regime.
    Image
    2.9K
  • user avatar
    David Fan
    AMI Labs
    @DavidJFan
    Apr 2, 2025
    Replying to @DavidJFan
    [2/8] Web-DINO (SSL) scales better with model size from 1B - 7B params than CLIP, especially on OCR/Chart VQA – proving that vision-only models can also perform well on tasks that were traditionally dominated by CLIP.
    Image
    2.8K
  • user avatar
    David Fan
    AMI Labs
    @DavidJFan
    Apr 2, 2025
    Replying to @DavidJFan
    [3/8] Why can SSL learn text-sensitive features purely from images? Hypothesis: Web images contain text. Observation: Training Web-DINO on text-heavy image subsets drastically improves OCR/Chart VQA performance, matching CLIP trained on the full data.
    Image
    2.4K
  • user avatar
    David Fan
    AMI Labs
    @DavidJFan
    Apr 2, 2025
    Replying to @DavidJFan
    [4/8] Web-SSL also performs well on classic vision tasks and is competitive with the original DINOv2. This means we are getting closer to developing vision encoders that excel at both pure vision and multimodal capabilities!
    Image
    2.3K
  • user avatar
    David Fan
    AMI Labs
    @DavidJFan
    Dec 21, 2024
    Piggy-backing off @liuzhuang1234 and @TongPetersb to share some thoughts! [1/5] I think understanding and generation are like two-sides of the same coin. Although our results suggest that developing a degree of understanding is prerequisite for (efficiently learning)
    user avatar
    Zhuang Liu
    @liuzhuang1234
    Dec 19, 2024
    How far is an LLM from not only understanding but also generating visually? Not very far! Introducing MetaMorph---a multimodal understanding and generation model. In MetaMorph, understanding and generation benefit each other. Very moderate generation data is needed to elicit
    Image
    5.5K
  • user avatar
    David Fan
    AMI Labs
    @DavidJFan
    Apr 2, 2025
    Replying to @DavidJFan
    [5/8] The observed scaling behavior is not unique to DINOv2; other visual SSL methods such as MAE also show similar potential!
    Image
    2.3K
  • user avatar
    David Fan
    AMI Labs
    @DavidJFan
    Apr 2, 2025
    Replying to @DavidJFan
    [6/8] Developing higher resolution SSL models is a promising direction. High resolution versions of Web-DINO 7B can nearly match the performance of SOTA off-shelf CLIP-family models such as SigLIP 2, despite seeing only on images.
    2.5K
  • user avatar
    David Fan
    AMI Labs
    @DavidJFan
    May 9, 2025
    Welcome Rob! So blessed to have you steer the ship! See you around the office :)
    user avatar
    Rob Fergus
    @rob_fergus
    May 8, 2025
    1/ Excited to share that I’m taking on the role of leading Fundamental AI Research (FAIR) at Meta. Huge thanks to Joelle for everything. Look forward to working closely again with Yann & team.
    916
  • user avatar
    David Fan
    AMI Labs
    @DavidJFan
    May 2, 2025
    Replying to @DavidJFan
    Web-SSL model weights are now available on GitHub and HuggingFace! You may use your favorite Transformers library API calls or load the model with native PyTorch - up to your preference. For more usage details, please see github.com/facebookresear… HuggingFace collection:
    Image
    GitHub - facebookresearch/webssl: Code for "Scaling Language-Free Visual Representation Learning"...
    From github.com
    745

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
Advertisement
Advertisement