Hi, I am Peter Tong, also go by the name Shengbang Tong(童晟邦). I am a third-year PhD student in NYU Courant CS advised by Professor Yann LeCun and Professor Saining Xie. I am funded by OpenAI Superalignment Fellowship (2024-2025) and Meta (2025-2026). I graduated from UC Berkeley with a triple major in Computer Science, Applied Mathematics (Honor) and Statistics (Honor). I am from Nanjing, China and Melbourne, Australia.
Hard-earned lessons from thousands of hours debugging TPUs (v3-v6e): static shapes, TorchXLA pitfalls, SPMD sharding, and data storage strategies for academic ML research.
We show pretrained representation encoders (DINO, SigLIP, MAE) paired with trained decoders serve as better latent spaces for diffusion than VAEs — enabling faster convergence, richer semantics, and higher generation quality. We thoroughly develop the method in the ImageNet setting, then demonstrate it scales effectively to large-scale text-to-image generation.
We introduce Visual SSL 2.0: Scaling up models, data to billion scale and adding VQA to the evaluation suite. Vision-only models scale with model size and data size, eventually catching up/surpassing CLIP models.
Visual understanding and visual generation are mutually beneficial in unified models! But visual understanding data is much more effective than visual generation. Capabilities in LLM can also transfer to unified models such as implicit reasoning!
We provide a vision-centric exploration or cookbook in MLLMs, systematically studying visual representation, vision-language connector, instruction tuning data, training recipe and evaluation protocols. We propose new vision-centric benchmarks, spatial-aware connector, data collection and curation of instruction data, and release very competitive 8B, 13B and 34B models on par with GPT-4V and Gemini.
Is vision good enough for language? Our research reveals that the visual capabilities in recent multimodal LLMs (MLLMs) still exhibit systematic shortcomings. We identify 'CLIP-blind pairs' - images that CLIP perceives as similar despite their clear visual differences. With these pairs, we construct the Multimodal Visual Patterns (MMVP) benchmark.
Deployed multimodal systems can fail in ways that evaluators did not anticipate. In order to find these failures before deployment, we introduce MULTIMON, a system that automatically identifies systematic failures.