Peter Tong

Hi, I am Peter Tong, also go by the name Shengbang Tong(童晟邦). I am a third-year PhD student in NYU Courant CS advised by Professor Yann LeCun and Professor Saining Xie. I am funded by OpenAI Superalignment Fellowship (2024-2025) and Meta (2025-2026). I graduated from UC Berkeley with a triple major in Computer Science, Applied Mathematics (Honor) and Statistics (Honor). I am from Nanjing, China and Melbourne, Australia.

Peter Tong

News

Blogs

Publications

RAE

Representation Autoencoders (RAE)

We show pretrained representation encoders (DINO, SigLIP, MAE) paired with trained decoders serve as better latent spaces for diffusion than VAEs — enabling faster convergence, richer semantics, and higher generation quality. We thoroughly develop the method in the ImageNet setting, then demonstrate it scales effectively to large-scale text-to-image generation.

Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

Shengbang Tong*, Boyang Zheng*, Ziteng Wang*, Bingda Tang, Nanye Ma, Ellis Brown, Jihan Yang, Rob Fergus, Yann LeCun, Saining Xie
Technical Report, Jan 2026

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, Saining Xie
ICLR 2026
Web-SSL

Scaling Language-Free Visual Representation Learning

ICCV 2025 Highlight

We introduce Visual SSL 2.0: Scaling up models, data to billion scale and adding VQA to the evaluation suite. Vision-only models scale with model size and data size, eventually catching up/surpassing CLIP models.

MetaMorph

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

ICCV 2025

Visual understanding and visual generation are mutually beneficial in unified models! But visual understanding data is much more effective than visual generation. Capabilities in LLM can also transfer to unified models such as implicit reasoning!

Cambrian

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

NeurIPS 2024 Oral

We provide a vision-centric exploration or cookbook in MLLMs, systematically studying visual representation, vision-language connector, instruction tuning data, training recipe and evaluation protocols. We propose new vision-centric benchmarks, spatial-aware connector, data collection and curation of instruction data, and release very competitive 8B, 13B and 34B models on par with GPT-4V and Gemini.

MMVP

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

CVPR 2024 Oral

Is vision good enough for language? Our research reveals that the visual capabilities in recent multimodal LLMs (MLLMs) still exhibit systematic shortcomings. We identify 'CLIP-blind pairs' - images that CLIP perceives as similar despite their clear visual differences. With these pairs, we construct the Multimodal Visual Patterns (MMVP) benchmark.

MultiMon

Mass-Producing Failures of Multimodal Systems with Language Models

Shengbang Tong*, Erik Jones*, Jacob Steinhardt
NeurIPS 2023

Deployed multimodal systems can fail in ways that evaluators did not anticipate. In order to find these failures before deployment, we introduce MULTIMON, a system that automatically identifies systematic failures.

© 2026 Peter Tong. Last updated: January 2026.