I'm a Member of Technical Staff at Microsoft AI Superintelligence Team, working on machine learning systems and AI infra for efficient multimodal training and inference.
In 2024 Spring, I was a Research Intern in Creative Vision team, Snap Research, proposing the 1.99 bits quantization on text-to-image generative model, BitsFusion.
In 2022, I was a Research Intern at Media Lab, Tencent America, exploring efficiency and robustness of Learned Image Compression and Transformer models.
In 2019, I was a full-time Algorithm Engineer at JD, working on face verification and recognition.
In 2018, I also spent a wonderful time as a Research and Development Intern and a member of PaddlePaddle (20.4k stars now), initializing the deep learning inference framework Paddle-Lite (6.6k stars now) at Baidu, reported in NeurIPS Expo, Baidu Create, Wave Summit+.
In addition to my academic work, I am passionate about Basketball, DOTA/DOTA2, World of Warcraft. I love Tracy McGrady, Stephen Curry, Lionel Messi, PIS (YaphetS).
My research is primarily focused on Efficient AI. My investigations revolve around developing techniques to achieve resource-efficient deep learning models without compromising their accuracy or performance. I aim to design and innovate compression methods, such as pruning, quantization, and low-rank approximation, to reduce the size and complexity of deep learning models. By doing so, I intend to facilitate the deployment of these models on resource-constrained devices like mobile phones and embedded systems.
Specifically, my research areas include:
Areas:
Efficient AI: Model Compression, Token Compression, Efficient Reasoning.
Machine Learning Systems: Large-scale AI Infra (Training, Inference).
Algorithm-hardware Co-design for AI Model Acceleration.
Text-to-Image/Video Diffusion Models, Large Language Models, Multimodal LLMs.
Previous "Efficient Deep Learning Reading Group" Sessions:
01/2026: One paper is accepted by Transactions on Machine Learning Research (TMLR 2026).
12/2025: I'm excited to join Microsoft AI (MAI) as a Member of Technical Staff!
09/2025: Two papers are accepted by The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025).
08/2025: One paper is accepted by Transactions on Machine Learning Research (TMLR 2025).
05/2025: One paper is accepted by Transactions on Machine Learning Research (TMLR 2025).
04/2025: We are so excited that our work DFloat11: Lossless Compression for LLM is reported by 新智元 and 机器之心.
04/2025: We are so excited that our survey Stop Overthinking is reported by 新智元.
03/2025: We are so excited to release our survey: Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models. We believe "Efficient Reasoning" is a very promising research direction in the future!
02/2025: Three papers are accepted by CVPR 2025. One of the papers, TopV, marks my first experience as an advisor, and I'm so proud of this work. Incredible work by Cheng!
10/2024: Invited to deliver a guest lecture, "Model Compression: Pruning, Quantization, and Recent Advances." at Texas A&M University, CSCE 689 Special Topics: Generative AI.
10/2024: I’m glad to join in the Department of Computer Science at Rice University as a Postdoctoral Associate.
09/2024: One paper is accepted by The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS 2024).
09/2024: One paper is accepted as Findings by The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024).
09/2024: One paper is accepted by 30th Asia and South Pacific Design Automation Conference (ASP-DAC 2025).
07/2024: One paper is accepted by The 35th British Machine Vision Conference (BMVC 2024).
07/2024: One paper is accepted by European Conference on Computer Vision (ECCV 2024).
05/2024: One paper is accepted by IEEE Transactions on Neural Networks and Learning Systems (TNNLS).
04/2024: I’m glad to receive the Paul Panayotatos Scholarship at Rutgers University.
03/2024: One paper is accepted by IEEE/ACM Design Automation Conference (DAC 2024).
02/2024: I’m glad to join the Creative Vision team, Snap Research as a Research Intern. I love the beautiful beach in Santa Monica and vibrant life in Los Angeles.
12/2023: Two papers are accepted as poster by the Data Compression Conference (DCC 2024).
10/2023: One paper is accepted by The International Symposium on High-Performance Computer Architecture (HPCA 2024).
09/2023: One paper is accepted by IEEE/ACM International Conference on Computer-Aided Design (ICCAD 2023).
07/2023: Invited to deliver a talk, "Efficient Diffusion Models and Large Language Models: Quantization, Pruning, and LoRA." (Video)
Paddle-Lite (Paddle-Mobile)
Initial contributor: Yang Sui, Ruilong Liu, Jiaying Zhao, Wang Liu, Yonghui Li.
[Baidu]The authors contributed almost equally to this work. Paddle-Lite Github (6.4k stars)
Paddle-Lite is an updated version of Paddle-Mobile, an open-open source deep learning framework designed to make it easy to perform inference on mobile, embeded, and IoT devices. It is compatible with PaddlePaddle and pre-trained models from other sources, reported in NeurIPS Expo, Baidu Create, Wave Summit+.
Publicity Chair, International Workshop on DL-Hardware Co-Design for Generative AI Acceleration (DCgAA 2024) @ DAC 2024
Program Committee Member and Reviewer:
NeurIPS'22, 23, 24
ICLR'24
ICML'22, 23, 24
CVPR'22, 23, 24
ICCV'23
ECCV'22, 24
KDD'23
AAAI'22, 23, 24, 25
IROS'23
TNNLS
Teaching Experiences
Teaching Assistant at Rutgers University
14:332:351 - Programming Methodology II, Fall 2020
Instructor: Prof. Saman Zonouz   
14:332:351 - Programming Methodology II, Fall 2023
Instructor: Prof. Yao Liu   
Supervised Students
Daniel Gu, Undergraduate student at Rice University
Topic: Quantization on LLMs.
Eric Chien, Master student at Rice University
Topic: Multimodal LLMs.
Cheng Yang, Ph.D. student at Rutgers University
Topic: Token pruning in Multimodal LLMs.
[CVPR 2025] TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model.
Wenjin Zhang, Ph.D. at Rutgers University
Topic: Quantization in Model Compression.
Justin Ding, Master student at Rutgers University
Topic: Pruning in Model Compression of "Graduate Special Problem"
Linqi Xiao, Master student at Rutgers University
Topic: Error Correct Coding of "Graduate Special Problem"
Srinihar Bondalapati, Master student at Rutgers University
Topic: Quantization in Model Compression
Yue Wang, Master student at Rutgers University
Topic: Dataset Distillation, Pruning in Model Compression of "Graduate Special Problem"
Veena Vrushi, Undergraduate student at Rutgers University
Topic: Deep Learning of "Project SUPER Research Program".
Vijay Maddila, Graduate student at Rutgers University
Topic: Large Language Models.
Ayan Patel, High school student at High Technology High School
Topic: Deep Learning.
Zhiyu Chen, Master student at Rutgers University
Topic: Large Language Models of "Graduate Special Problem".
Talks
Efficient Diffusion Models and Large Language Models: Quantization, Pruning, and LoRA. (Video)
July, 2023.
Model Compression: Pruning, Quantization, and Recent Advances.
Texas A&M University, CSCE 689 Special Topics: Generative AI, Instructor: Dr. Zhengzhong Tu, October, 2024.
Efficient Multimodal LLMs and Efficient Large Reasoning Models.
Rice University, COMP 652 001: Natural Language Processing, Instructor: Dr. Hanjie Chen, April, 2025.
Efficient MLLMs and LRMs: Token Pruning in Multimodal Large Language Models and Efficient Reasoning in Large Reasoning Models.
Texas A&M University, CSCE 753 Computer Vision and Robot Perception, Instructor: Dr. Zhengzhong Tu, April, 2025.
Honors & Awards
Doctor of Philosophy (Ph.D.)
Graduate Academic Achievement Awards at Rutgers University
Paul Panayotatos Scholarship at Rutgers University, 2024
Best Paper Runner-Up Award at DCAA Workshop @ AAAI, 2023
Postgraduate Recommendation (10% in EE at Jilin University), 2016
Outstanding Graduates, 2016
Second Prize Scholarship, 2014, 2015, 2016
Hobbies & Interests
In addition to my academic work, I am a fan of Basketball, Soccer, Formula 1, Snooker. I love Tracy McGrady, Stephen Curry, Lionel Messi, PIS (YaphetS).
I'm an experienced player of DOTA/DOTA2, World of Warcraft, Warcraft III, manipulating Druid (Balance Druid), DH (Havoc Demon Hunter) in WoW, and NE (Night Elf) in Warcraft III. I would like record some impressive moments:
World of Warcraft
World top 10 DPS of Balance Druid with H4 (Flamebender Ka'graz) in Blackrock Foundry reported on WCL in 2015.
Raid Leader for defeating the Heroic Highmaul Raid in the Warlords of Draenor in 2014. I gathered my fourteen friends and together we conquered the Heroic Highmaul raid. It was an incredibly impressive experience that we will never forget.
DOTA
Member of DOTA school team (1/5) in Jilin City No.1 High School, DOTA, 2011
Hearthstone Legend
Ladder Rank: 147, in 2016
Diablo III
Ladder Rank: 698, Witch Doctor, Season 9, 2017
I derive great pleasure from listening to the music that owns wonderful rhythm, especially R&B, and classical music from Chopin, Bach, Paganini.
Kim Tae-yeon was my idol during high school, providing me with a strong example and encouragement during my most difficult and depressive times.
Xiaolan (name is inspired from "Detective Conan"), my adorable grey and white cat with sparkling eyes, which appears in my NeurIPS'21 paper "CHIP", is playful, affectionate, and loves to cuddle. Say hi, Xiaolan!