I’m now a Principal Researcher at Tencent Singapore. Before that, I was a Research Scientist at ByteDance.
I received my Ph.D. degree and B.E. degree from Huazhong University of Science and Technology (HUST) in 2020 and 2015 respectively, advised by Prof. Wenyu Liu and Prof. Xinggang Wang. I was working as a visiting student (2018-2019) in the IFP group at the University of Illinois at Urbana-Champaign (UIUC), advised by Prof. Thomas S. Huang, Prof. Yunchao Wei and Prof. Humphrey Shi.
I am currently working on Multimodal Interaction, specializing in multimodal agents, multimodal pretraining, and efficient model architecture.
[Hiring!] Our team is actively recruiting scientists, engineers and interns at all levels in China and Singapore, with a particular focus on GUI Agent/MLLM. Feel free to contact me if you are interested.
We propose SuperCLIP, a simple yet effective framework that augments contrastive learning with classification-based supervision. With just a 0.077% increase in total FLOPs and no need for additional annotated data, it can significantly improve the performance of visual models.
We systematically compare SAIL’s properties—including scalability, cross-modal information flow patterns, and visual representation capabilities—with those of modular MLLMs.
We propose Video Depth Anything for high-quality, consistent depth estimation in super-long videos (over several minutes) without sacrificing efficiency.
This work presents Depth Anything, a highly practical solution for robust monocular depth estimation by training on a combination of 1.5M labeled images and 62M+ unlabeled images.
we propose a Semantic Prediction Guidance (SPG) module which learns to re-weight the
local features through the guidance from pixel-wise semantic prediction.
we propose a Scale-Adaptive Network (SAN) which consists of multiple branches with each one taking charge of the segmentation of the objects of a certain range of scales.
we identify several useful properties, including feature resolution, global context information and edge details, and perform rigorous analyses to reveal how to leverage them to benefit the human parsing task.
we propose to train a semantic segmentation network starting from the discriminative regions and
progressively increase the pixel-level supervision using by seeded region growing.
Last updated on March 09, 2022. Thanks to Jon Barron for this minimalist website template.