Shilong Zhang (张士龙)

Shilong is a final-year Ph.D. student (2023–present) at MMLab@The University of Hong Kong (HKU), advised by Prof. Ping Luo. Previously, he worked with Kai Chen and contributed as a core developer to MMDetection and MMCV.

He completed his Bachelor's degree from the University of Science and Technology of China (USTC) in 2019, distinguishing himself as one of the top 4% outstanding graduates.

Research

My research agenda connects visual perception/understanding and generation, aiming to build unified models that truly understand the world, rather than merely translating between modalities.

  • 2020–2022 · Object Detection
    My research began with object detection, focusing on efficient architectures and training.
  • 2022–2023 · Multimodal Understanding
    With the rise of large language models around 2022, I recognized that multimodal large language models would be central to the future of AI, which led me to explore vision–language models. However, I soon realized a fundamental limitation of dominant VLM paradigms: projecting vision into language space does not genuinely enable models to learn visual knowledge.
  • 2023–2025 · Generative Models
    This insight motivated my transition to vision generation, where models are forced to internalize visual structure, semantics, and dynamics directly from data—much closer to how humans learn by continuously predicting visual content.
  • 2025 · Unify
    After accumulating substantial experience in generative modeling, by 2025 it became clear—to both the community and myself—that understanding and generation should be unified. I therefore focus on what I consider the most urgent problem in unified modeling: a single, principled visual encoder that supports both understanding and generation. This led to PS-VAE, a key step toward a unified encoder enabling understanding, generation, and editing within a shared representation space.
  • Next · Video Unified Models
    Looking forward, I believe images alone are insufficient for learning rich and grounded world knowledge, and I am particularly excited about video-based unified models, which I see as a potential next major leap beyond LLMs by capturing dynamics, causality, and long-term structure.

If your team shares this perspective, I would be glad to connect and discuss potential collaborations.

Recent News

Publications

Only first-/co-first-author papers are listed. Full list: Google Scholar.
PS-VAE thumbnail

[1] Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

Shilong Zhang, He Zhang, Zhifei Zhang, Chongjian Ge, Shuchen Xue, Shaoteng Liu, Mengwei Ren, Soo Ye Kim, Yuqian Zhou, Qing Liu, Daniil Pakhomov, Kai Zhang, Zhe Lin, Ping Luo

We systematically adapt understanding-oriented encoder features for generation/editing by jointly regularizing semantics and pixel reconstruction, compressing both into a compact 96-channel latent (16×16 downsampling). This points to the potential of a unified encoder that supports understanding + generation/editing within a single model backbone.

FlashVideo thumbnail

[2] FlashVideo: Flowing Fidelity to Detail for Efficient High-Resolution Video Generation

Shilong Zhang*, Webo Li*, Shoufa Chen, Chongjian Ge, Peize Sun, Yida Zhang, Yi Jiang, Zehuan Yuan, Binyue Peng, Ping Luo
AAAI 2026(* Equal contribution)

(a) Dividing the process into prompt fidelity and quality enhancement stages, delivering a stunning reduction in DiT's computational load. (b) Enabling users to preview the initial output and accordingly adjust the prompt before committing to full-resolution generation, thereby significantly reducing wait times and enhancing commercial viability.

DDQ-DETR thumbnail

[6] Dense Distinct Query for End-to-End Object Detection

Shilong Zhang*, Xinjiang Wang*, Jiaqi Wang, Jiangmiao Pang, Chengqi Lyu, Wenwei Zhang, Ping Luo, Kai Chen
CVPR 2023 (* Equal contribution)

DDQ-DETR achieves 52.1 AP on MS-COCO dataset within 12 epochs using a ResNet-50 backbone, outperforming all existing detectors in the same setting.

SEPC thumbnail

[9] Scale-equalizing Pyramid Convolution for Object Detection

Xinjiang Wang*, Shilong Zhang*, Zhuoran Yu, Litong Zhang, Wayne Zhang
CVPR 2020 (* Equal contribution)

We proposed a scale-equalizing pyramid convolution method that relaxes the discrepancy between the feature pyramid and the gaussian pyramid. The module boosts the performance about 3.5 mAP in single-stage object detection with negligible inference time.