Log inSign up
Peter Tong
AMI Labs
305 posts
user avatar
Peter Tong
AMI Labs
@TongPetersb
Berkeley 23', CS PhD Student in NYU Courant advised by Professor @ylecun and Professor @sainingxie
tsb0601.github.io
Joined July 2022
258
Following
3,872
Followers
  • Pinned
    user avatar
    Peter Tong
    AMI Labs
    @TongPetersb
    Apr 16
    I defended my thesis today! Sincere thanks to my advisors @sainingxie @ylecun and committee members: @mengyer @YiMaTweets @LukeZettlemoyer @liuzhuang1234. I could not have wished for a better PhD life, and I want to thank everyone who was part of this journey. Slides Link:
    Image
    Image
    Image
    Image
    00:00
    176K
  • user avatar
    Peter Tong
    AMI Labs
    @TongPetersb
    Oct 14, 2025
    The work opened my eyes. Since my PhD, I've been studying visual representations for understanding and generation. I long thought pretrained vision encoders (CLIP, DINO, etc.) produced features too semantic for generation/reconstruction, but that's not true! These features
    user avatar
    Saining Xie
    AMI Labs
    @sainingxie
    Oct 14, 2025
    three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)
    Image
    59K
  • user avatar
    Peter Tong
    AMI Labs
    @TongPetersb
    Dec 19, 2024
    This project really changed how I think about multimodal models and LLMs. I used to believe that multimodal (visual) prediction required significant changes to the model and heavy pretraining, like Chameleon. But surprisingly, the opposite is true! In large autoregressive models,
    user avatar
    Zhuang Liu
    @liuzhuang1234
    Dec 19, 2024
    How far is an LLM from not only understanding but also generating visually? Not very far! Introducing MetaMorph---a multimodal understanding and generation model. In MetaMorph, understanding and generation benefit each other. Very moderate generation data is needed to elicit
    Image
    130K
  • user avatar
    Peter Tong
    AMI Labs
    @TongPetersb
    Apr 2, 2025
    Vision models have been smaller than language models; what if we scale them up? Introducing Web-SSL: A family of billion-scale SSL vision models (up to 7B parameters) trained on billions of images without language supervision, using VQA to evaluate the learned representation.
    Image
    121K
  • user avatar
    Peter Tong
    AMI Labs
    @TongPetersb
    Apr 15, 2025
    We're open-sourcing the training code for MetaMorph! MetaMorph offers a lightweight framework for turning LLMs into unified multimodal models: (multimodal) tokens -> transformers -> diffusion -> pixel! This is our best take on unified modeling as of November 2024, and
    user avatar
    Zhuang Liu
    @liuzhuang1234
    Dec 19, 2024
    How far is an LLM from not only understanding but also generating visually? Not very far! Introducing MetaMorph---a multimodal understanding and generation model. In MetaMorph, understanding and generation benefit each other. Very moderate generation data is needed to elicit
    Image
    Image
    GitHub - facebookresearch/metamorph: Code for MetaMorph Multimodal Understanding and Generation via...
    From github.com
    27K
  • user avatar
    Peter Tong
    AMI Labs
    @TongPetersb
    Apr 24, 2025
    We are open-sourcing all the models in Web-SSL, from ViT-L to ViT-7B! It was super fun to train and play with these massive ViTs. Models: huggingface.co/collections/fa… Github: github.com/facebookresear… Huge credit to @DavidJFan for putting these models together!
    user avatar
    Peter Tong
    AMI Labs
    @TongPetersb
    Apr 2, 2025
    Vision models have been smaller than language models; what if we scale them up? Introducing Web-SSL: A family of billion-scale SSL vision models (up to 7B parameters) trained on billions of images without language supervision, using VQA to evaluate the learned representation.
    Image
    Image
    Web-SSL - a facebook Collection
    From huggingface.co
    59K
  • user avatar
    Peter Tong
    AMI Labs
    @TongPetersb
    Jun 28, 2024
    In Cambrian-1, we tried both Jax and TorchXLA. And we eventually see TorchXLA being easier than JAX from scratch. We are working to release some tutorials within the next week or so. There are many amazing torch codebases and we hope to help others run and develop those on TPU.
    user avatar
    Tanishq Mathew Abraham, Ph.D.
    @iScienceLuvr
    Jun 28, 2024
    This is very interesting: The Cambrian-1 models were trained on preemptible TPU-v4s using FSDP with PyTorch XLA! I haven't seen many examples of PyTorch XLA FSDP being used for large-scale real-world usecases so this intrigued me. Of course, it didn't work out of the box.
    Image
    37K
  • user avatar
    Peter Tong
    AMI Labs
    @TongPetersb
    Aug 11, 2025
    Want to add that even with language-assisted visual evaluations, we're seeing encouraging progress in vision-centric benchmarks like CV-Bench (arxiv.org/abs/2406.16860) and Blink (arxiv.org/abs/2404.12390), which repurpose core vision tasks into VQA format. These benchmarks do help
    user avatar
    Martin Ziqiao Ma
    @ziqiao_ma
    Aug 10, 2025
    So the key concern is: Using large language models to initialize vision-language(-action) models is a tempting trap — it lets us appear to make progress without truly achieving it. Most benchmarks have overwhelmingly focused on reasoning and digital domains, without
    arXiv logo
    arxiv.org
    Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
    We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. While stronger language models can enhance multimodal capabilities, the design choices for...
    13K
  • user avatar
    Peter Tong
    AMI Labs
    @TongPetersb
    Jun 26, 2024
    TLDR: We study benchmarks, data, vision, connectors, and recipes (anything other than LLMs in MLLM), and obtain very competitive performance. We hope our project can be a cornerstone for future MLLM research. Data & Model: huggingface.co/nyu-visionx Code: github.com/cambrian-mllm/…
    user avatar
    Saining Xie
    AMI Labs
    @sainingxie
    Jun 26, 2024
    Introducing Cambrian-1, a fully open project from our group at NYU. The world doesn't need another MLLM to rival GPT-4V. Cambrian is unique as a vision-centric exploration & here's why I think it's time to shift focus from scaling LLMs to enhancing visual representations.🧵[1/n]
    Image
    Image
    nyu-visionx (VISIONx @ NYU)
    From huggingface.co
    15K
  • user avatar
    Peter Tong
    AMI Labs
    @TongPetersb
    Jul 11, 2023
    Text-to-image systems like #midjourney, #dalle2 , and #stablediffusion produce beautiful images, but are hard to evaluate. Our new system, #MultiMon automatically finds systematic failures of these systems using language models, without knowing what to look for beforehand [1/7]
    Image
    14K
  • user avatar
    Peter Tong
    AMI Labs
    @TongPetersb
    Apr 2, 2025
    Replying to @TongPetersb
    [8/8] See more details: Paper: arxiv.org/abs/2504.01017 Project Page: davidfan.io/webssl/ We hope to release Web-SSL models soon. Thanks to @DavidJFan for co-leading this project and believing in these crazy experiments from the beginning. Grateful to work with @JiachenAI,
    arXiv logo
    arxiv.org
    Scaling Language-Free Visual Representation Learning
    Visual Self-Supervised Learning (SSL) currently underperforms Contrastive Language-Image Pretraining (CLIP) in multimodal settings such as Visual Question Answering (VQA). This multimodal gap is...
    2.5K
  • user avatar
    Peter Tong
    AMI Labs
    @TongPetersb
    Apr 2, 2025
    Replying to @TongPetersb
    [3/8] Model Scaling: As we scale up ViT from 1B to 7B, our visual SSL continues improving up to 7B parameters, while CLIP plateaus. At 5B+ parameters, Web-DINO matches CLIP performance on average VQA tasks despite using no language supervision.
    Image
    3.5K
  • user avatar
    Peter Tong
    AMI Labs
    @TongPetersb
    Dec 19, 2024
    Replying to @TongPetersb
    For more technical details, check out the paper and website: Paper: arxiv.org/abs/2412.14164 Website: tsb0601.github.io/metamorph/ This has been an incredibly fun internship! I’m deeply grateful to all my mentors and advisors (@liuzhuang1234, @DavidJFan, @JiachenAI, @YoungXiong1,
    arXiv logo
    arxiv.org
    MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
    In this work, we propose Visual-Predictive Instruction Tuning (VPiT) - a simple and effective extension to visual instruction tuning that enables a pretrained LLM to quickly morph into an unified...
    19K
  • user avatar
    Peter Tong
    AMI Labs
    @TongPetersb
    Apr 2, 2025
    Replying to @TongPetersb
    [7/8] We adopt the Platonic Representation Hypothesis to measure alignment between our Web-SSL models and LLMs like LLaMA. We found increasing model size and data quantity lead to stronger alignment. This provides insight into how vision-only models develop multimodal
    Image
    2.8K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
Advertisement
Advertisement