Peter Tong (@TongPetersb) / X

Peter Tong

305 posts

Peter Tong

@TongPetersb

Berkeley 23', CS PhD Student in NYU Courant advised by Professor @ylecun and Professor @sainingxie

Joined July 2022

Pinned
Peter Tong
@TongPetersb
Apr 16
I defended my thesis today! Sincere thanks to my advisors @sainingxie @ylecun and committee members: @mengyer @YiMaTweets @LukeZettlemoyer @liuzhuang1234. I could not have wished for a better PhD life, and I want to thank everyone who was part of this journey. Slides Link:
00:00
176K
Peter Tong
@TongPetersb
Oct 14, 2025
The work opened my eyes. Since my PhD, I've been studying visual representations for understanding and generation. I long thought pretrained vision encoders (CLIP, DINO, etc.) produced features too semantic for generation/reconstruction, but that's not true! These features
Saining Xie
@sainingxie
Oct 14, 2025
three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)
59K
Peter Tong
@TongPetersb
Dec 19, 2024
This project really changed how I think about multimodal models and LLMs. I used to believe that multimodal (visual) prediction required significant changes to the model and heavy pretraining, like Chameleon. But surprisingly, the opposite is true! In large autoregressive models,
Zhuang Liu
@liuzhuang1234
Dec 19, 2024
How far is an LLM from not only understanding but also generating visually? Not very far! Introducing MetaMorph---a multimodal understanding and generation model. In MetaMorph, understanding and generation benefit each other. Very moderate generation data is needed to elicit
130K
Peter Tong
@TongPetersb
Apr 2, 2025
Vision models have been smaller than language models; what if we scale them up? Introducing Web-SSL: A family of billion-scale SSL vision models (up to 7B parameters) trained on billions of images without language supervision, using VQA to evaluate the learned representation.
121K
Peter Tong
@TongPetersb
Apr 15, 2025
We're open-sourcing the training code for MetaMorph! MetaMorph offers a lightweight framework for turning LLMs into unified multimodal models: (multimodal) tokens -> transformers -> diffusion -> pixel! This is our best take on unified modeling as of November 2024, and
Zhuang Liu
@liuzhuang1234
Dec 19, 2024
How far is an LLM from not only understanding but also generating visually? Not very far! Introducing MetaMorph---a multimodal understanding and generation model. In MetaMorph, understanding and generation benefit each other. Very moderate generation data is needed to elicit
GitHub - facebookresearch/metamorph: Code for MetaMorph Multimodal Understanding and Generation via...
From github.com
27K
Peter Tong
@TongPetersb
Apr 24, 2025
We are open-sourcing all the models in Web-SSL, from ViT-L to ViT-7B! It was super fun to train and play with these massive ViTs. Models: huggingface.co/collections/fa… Github: github.com/facebookresear… Huge credit to @DavidJFan for putting these models together!
Peter Tong
@TongPetersb
Apr 2, 2025
Vision models have been smaller than language models; what if we scale them up? Introducing Web-SSL: A family of billion-scale SSL vision models (up to 7B parameters) trained on billions of images without language supervision, using VQA to evaluate the learned representation.
Web-SSL - a facebook Collection
From huggingface.co
59K
Peter Tong
@TongPetersb
Jun 28, 2024
In Cambrian-1, we tried both Jax and TorchXLA. And we eventually see TorchXLA being easier than JAX from scratch. We are working to release some tutorials within the next week or so. There are many amazing torch codebases and we hope to help others run and develop those on TPU.
Tanishq Mathew Abraham, Ph.D.
@iScienceLuvr
Jun 28, 2024
This is very interesting: The Cambrian-1 models were trained on preemptible TPU-v4s using FSDP with PyTorch XLA! I haven't seen many examples of PyTorch XLA FSDP being used for large-scale real-world usecases so this intrigued me. Of course, it didn't work out of the box.
37K
Peter Tong
@TongPetersb
Aug 11, 2025
Want to add that even with language-assisted visual evaluations, we're seeing encouraging progress in vision-centric benchmarks like CV-Bench (arxiv.org/abs/2406.16860) and Blink (arxiv.org/abs/2404.12390), which repurpose core vision tasks into VQA format. These benchmarks do help
Martin Ziqiao Ma
@ziqiao_ma
Aug 10, 2025
So the key concern is: Using large language models to initialize vision-language(-action) models is a tempting trap — it lets us appear to make progress without truly achieving it. Most benchmarks have overwhelmingly focused on reasoning and digital domains, without
arxiv.org
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. While stronger language models can enhance multimodal capabilities, the design choices for...
13K
Peter Tong
@TongPetersb
Jun 26, 2024
TLDR: We study benchmarks, data, vision, connectors, and recipes (anything other than LLMs in MLLM), and obtain very competitive performance. We hope our project can be a cornerstone for future MLLM research. Data & Model: huggingface.co/nyu-visionx Code: github.com/cambrian-mllm/…
Saining Xie
@sainingxie
Jun 26, 2024
Introducing Cambrian-1, a fully open project from our group at NYU. The world doesn't need another MLLM to rival GPT-4V. Cambrian is unique as a vision-centric exploration & here's why I think it's time to shift focus from scaling LLMs to enhancing visual representations.🧵[1/n]
nyu-visionx (VISIONx @ NYU)
From huggingface.co
15K
Peter Tong
@TongPetersb
Jul 11, 2023
Text-to-image systems like #midjourney, #dalle2 , and #stablediffusion produce beautiful images, but are hard to evaluate. Our new system, #MultiMon automatically finds systematic failures of these systems using language models, without knowing what to look for beforehand [1/7]
14K
Peter Tong
@TongPetersb
Apr 2, 2025
Replying to @TongPetersb
[8/8] See more details: Paper: arxiv.org/abs/2504.01017 Project Page: davidfan.io/webssl/ We hope to release Web-SSL models soon. Thanks to @DavidJFan for co-leading this project and believing in these crazy experiments from the beginning. Grateful to work with @JiachenAI,
arxiv.org
Scaling Language-Free Visual Representation Learning
Visual Self-Supervised Learning (SSL) currently underperforms Contrastive Language-Image Pretraining (CLIP) in multimodal settings such as Visual Question Answering (VQA). This multimodal gap is...
2.5K
Peter Tong
@TongPetersb
Apr 2, 2025
Replying to @TongPetersb
[3/8] Model Scaling: As we scale up ViT from 1B to 7B, our visual SSL continues improving up to 7B parameters, while CLIP plateaus. At 5B+ parameters, Web-DINO matches CLIP performance on average VQA tasks despite using no language supervision.
3.5K
Peter Tong
@TongPetersb
Dec 19, 2024
Replying to @TongPetersb
For more technical details, check out the paper and website: Paper: arxiv.org/abs/2412.14164 Website: tsb0601.github.io/metamorph/ This has been an incredibly fun internship! I’m deeply grateful to all my mentors and advisors (@liuzhuang1234, @DavidJFan, @JiachenAI, @YoungXiong1,
arxiv.org
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
In this work, we propose Visual-Predictive Instruction Tuning (VPiT) - a simple and effective extension to visual instruction tuning that enables a pretrained LLM to quickly morph into an unified...
19K
Peter Tong
@TongPetersb
Apr 2, 2025
Replying to @TongPetersb
[7/8] We adopt the Platonic Representation Hypothesis to measure alignment between our Web-SSL models and LLMs like LLaMA. We found increasing model size and data quantity lead to stronger alignment. This provides insight into how vision-only models develop multimodal
2.8K