Zonglin Di

Ph.D. Candidate · Computer Science & Engineering · UC Santa Cruz

Vision-Language Models (On-device Unified Model) Desktop Use Agents Trustworthy AI Weakly / Self-supervised Learning

I am a Ph.D. candidate in Computer Science & Engineering at UC Santa Cruz, advised by Prof. Yang Liu. My research focuses on vision-language models (On-device Unified Model), desktop use agents, and trustworthy AI (machine unlearning, learning with noisy labels, and federated learning).

I am currently a research intern at Adobe Research, distilling cloud-scale unified multimodal models into on-device students. Previously, I interned at Apple (proactive AI and streaming video-language understanding) and twice more at Adobe Research (RLHF for LLMs; vision-language representation learning). Before UCSC, I received my M.S. from UC San Diego, where I worked with Prof. Xiaolong Wang on self-supervised learning, and my B.S. from Tongji University.

Publications

* equal contribution

CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation

Xia Su, Ruiqi Chen, Benlin Liu, Jingwei Ma, Zonglin Di, Ranjay Krishna, Jon E. Froehlich

CVPR 2026VLM

A benchmark that tests whether vision-language models can plan indoor navigation routes conditioned on a user's physical capabilities, revealing how far current VLMs are from capability-aware, accessibility-conscious navigation.

Label Smoothing Improves Machine Unlearning

Zonglin Di, Zhaowei Zhu, Jinghan Jia*, Jiancheng Liu*, Bo Jiang, Zafar Takhirov, Yuanshun Yao, Sijia Liu, Yang Liu

ICLR 2026Trustworthy AIMachine Unlearning

We reveal a connection between label smoothing and machine unlearning: a simple negative label-smoothing recipe makes models forget target data more thoroughly while preserving utility on the data that should be remembered.

DiffTell: A Comprehensive Dataset for Image Difference Captioning

Zonglin Di, Jing Shi, Yifei Fan, Hao Tan, Alexander Black, John Collomosse, Yang Liu

ICCV 2025VLM

A large-scale, comprehensive, quality-controlled dataset for image difference captioning.
Finetuning mainstream VLMs (Qwen, InternVL, LLaVA, …) on DiffTell substantially improves their ability to describe what changed between two images.

JARVIS: A Neuro-Symbolic Commonsense Reasoning Framework for Conversational Embodied Agents

Kaizhi Zheng*, Kaiwen Zhou*, Jing Gu*, Yue Fan*, Jialu Wang*, Zonglin Di, Xuehai He, Xin Eric Wang

NeSy 2025VLMAgentic AI

A neuro-symbolic framework that couples large language models with symbolic commonsense reasoning so conversational embodied agents can follow dialogue, ground instructions, and execute household tasks reliably.

Adversarial Machine Unlearning

Zonglin Di*, Sixie Yu*, Yevgeniy Vorobeychik, Yang Liu

ICLR 2025Trustworthy AIMachine Unlearning

Frames machine unlearning as a Stackelberg game between an unlearner and a membership-inference auditor; the resulting algorithm produces models that forget more convincingly under audit while keeping performance on retained data.

Federated Learning with Local Openset Noisy Labels

Zonglin Di, Zhaowei Zhu, Xiaoxiao Li, Yang Liu

ECCV 2024Computer VisionTrustworthy AI

Studies the realistic federated setting where each client's data contains openset label noise (mislabeled samples from unknown classes), and proposes a theoretically grounded method to learn robustly under it.

A Simple and Provable Approach for Learning on Noisy Labeled Multi-modal Medical Images

Nan Wang, Zonglin Di, Houlin He, Qingchao Jiang, Xiaoxiao Li

ACM MM 2024Computer VisionTrustworthy AI

A simple method with provable guarantees for training on multi-modal medical images whose labels are noisy — a common situation in clinical data collection.

Navigation as Attackers Wish? Towards Building Robust Embodied Agents under Federated Learning

Yunchao Zhang, Zonglin Di, Kaiwen Zhou, Cihang Xie, Eric Xin Wang

NAACL 2024Trustworthy AI

Shows that federated training of vision-and-language navigation agents is vulnerable to poisoning attacks that steer agents along attacker-chosen routes, and studies defenses for building robust embodied agents.

T2IAT: Measuring Valence and Stereotypical Biases in Text-to-Image Generation

Jialu Wang, Xinyue Gabby Liu, Zonglin Di, Yang Liu, Xin Eric Wang

ACL 2023 FindingsVLMTrustworthy AI

Adapts the Implicit Association Test from psychology to text-to-image models, providing a principled way to quantify valence and stereotypical biases in generated images.

Test-Time Personalization for Pose Estimation

Test-Time Personalization with a Transformer for Human Pose Estimation

Miao Hao*, Yizhuo Li*, Zonglin Di*, Nitesh B. Gundavarapu, Xiaolong Wang

NeurIPS 2021Computer Vision

Personalizes a human pose estimator to each subject at test time using self-supervised objectives — reconstructing another image of the same person without any annotations — yielding consistent accuracy gains.

Leveraging Crowdsourced GPS Data for Road Extraction from Aerial Imagery

Tao Sun, Zonglin Di, Pengyu Che, Chun Liu, Yin Wang

CVPR 2019Computer Vision

Fuses crowdsourced GPS trajectories with satellite imagery for road extraction, improving accuracy by 5% and boosting generalization to unseen areas by 40% over imagery-only baselines.

Preprints

Improving the Capability of Visual Language Models with Occlusion Reasoning

Jinrui Yang*, Zonglin Di*, Ohi Dibua, Qing Liu, Seun Adekunle, Daniil Pakhomov, Darshan Ganesh Prasad, Cihang Xie, Yuyin Zhou

In submissionVLM

Identifies occlusion understanding as a critical blind spot in current VLMs' spatial reasoning, and develops methods that teach models to reason about what is hidden behind what.

Advancing Machine Unlearning Evaluation Requires Rethinking Retraining

Chris Liu*, Zonglin Di*, Jeffrey Flanigan, Yang Liu

In submissionMachine Unlearning

Argues that the standard "retrain-from-scratch" gold standard for evaluating machine unlearning is flawed, and proposes a more reliable evaluation protocol.

Automatic Dataset Construction (ADC): Sample Collection, Data Curation, and Beyond

Minghao Liu*, Zonglin Di*, et al.

In submission

A framework for automatically constructing high-quality datasets with large language models, reducing the manual cost of dataset curation.

Earlier publications

Zonglin Di, Qi Kang, Daogang Peng, Mengchu Zhou. "Density Peak based Pre-clustering Support Vector Machine for Multi-class Imbalanced Classification." IEEE SMC 2019.
Zonglin Di, Xiaoliang Gong, Jingyu Shi, Hosameldin O. A. Ahmed, Asoke K. Nandi. "Internet Addiction Disorder Detection of Chinese College Students Using Several Personality Questionnaire Data and Support Vector Machine." Addictive Behavior Reports, Elsevier, 2019.
Tao Sun, Zonglin Di, Yin Wang. "Combining Satellite Imagery and GPS Data for Road Extraction." ACM SIGSPATIAL GeoAI 2018.
Zonglin Di, Siya Yao, Qi Kang, Mengchu Zhou. "AP-SVM: An Adaptive Clustering-based Support Vector Machine Algorithm for Imbalance Classification." IEEE SMC 2018.
Zonglin Di, Xiaoliang Gong, Jingyu Shi, Hosameldin O. A. Ahmed, Asoke K. Nandi. "Detection of Internet Addiction Disorder Based on Personality Questionnaires of Chinese College Students and SVMs." ICSP-BMEI 2017.
Xiaoliang Gong, Bozhong Long, Kun Fang, Zonglin Di, Yichu Hou, Lei Cao. "A Prediction Based on Clustering and Personality Questionnaire Data for IGD Risk." ICNC-FSKD 2016.

Ongoing Projects

SkillsBench — benchmarking how well agent skills work

The first benchmark for evaluating how effectively LLM agents use Skills — modular packages of procedural knowledge attached at inference time.
86 expert-curated tasks across 11 domains with deterministic verifiers, each evaluated with no Skills, curated Skills, and self-generated Skills.
Curated Skills raise average pass rate by 16.2 points (with wide domain variance), while self-generated Skills provide no benefit on average; smaller models with Skills can match larger models without them.

website paper code

ClawsBench — capability & safety of LLM productivity agents

Evaluates LLM productivity agents across five high-fidelity, production-conformant mock services (Gmail, Slack, Calendar, Docs, Drive), with deterministic state-based scoring that penalizes irreversible harmful actions.
Releases 7,800+ scored agent trajectories and shows that capability and safety do not track together: top models reach 39–64% task success but 7–33% unsafe-action rates.

website paper code

Resource2Skill — distilling human-created resources into agent skills

A framework that auto-distills multimodal resources (tutorial videos, code repos, articles, reference artifacts) into 3,800+ executable, retrievable agent skills organized as a hierarchical multimodal wiki across seven authoring domains (Web, PowerPoint, Excel, Blender, audio, UE5, CAD).
A two-stage hierarchy-then-LM retrieval and an MCP-based execution runtime let agents compose skills and operate real tools, lifting average task quality by +11.9 points and outperforming Claude Code / Codex baselines on 26 of 28 model–domain cells.

Dynamic MCTS tree with decoupled roles for LLM agents

Decouples the LLM agent into "Planner–Actor–Critic" with a dynamic MCTS tree, addressing sparse rewards and inaccurate process rewards in long-horizon GUI tasks.

Rethinking LLM evaluation

Analyzes the limitations of ELO-based evaluation for LLMs and develops a robust evaluation framework resilient to dataset-size variation and evaluation noise across chat, reasoning, and agent tasks.

Spatial reasoning for VLMs

Identifies critical limitations in VLMs' spatial reasoning and develops methods to enhance occlusion understanding and video-based spatial reasoning.

Improving GRPO sparse reward with an auxiliary reward

Addresses the sparse-reward problem in GRPO with an auxiliary reward formalized in an unbiased, semi-parametric way.

Experience

Internships

Jun 2026 – present

Adobe Research, San Jose, CA — Research Intern

Distilling a cloud-scale unified multimodal model for image and text understanding and generation into an on-device student model.
Benchmarking six backbone architectures (dense, modality-split FFN, two-tower, MoE routing, mixture-of-blocks) at iso-compute to find the most distillation-friendly design.
Developing a cloud-to-device weight-inheritance scheme that initializes the student from the teacher, and exploring memory-efficient tokenization with few-step diffusion distillation.

Jun 2025 – Sep 2025

Apple, Sunnyvale, CA — Research Intern

Gathered data to support the development of proactive AI, human comprehension, and models of user behavior.
Established a development pipeline for proactive AI and compared the performance of open-source and proprietary vision-language models.
Researched streaming online video-language understanding models.

Jun 2024 – Nov 2024

Adobe Research, San Jose, CA — Research Intern

Built a new RLHF pipeline for LLMs based on DPO, SimPO, SPIN, and online RLHF.
Investigated how to improve LLM generalization using limited high-quality data.

Jun 2023 – Nov 2023

Adobe Research, San Jose, CA — Research Intern

Researched vision-language representation learning to improve image difference captioning.
Finetuned mainstream vision-language models (Qwen, InternVL, LLaVA, …) on 100+ A100 GPUs.
Built DiffTell, a large-scale image difference captioning dataset, later published at ICCV 2025.

Research

Jun 2022 – present

UC Santa Cruz — Research Assistant (Prof. Yang Liu)

Proposed a theoretically grounded method for openset noisy labels in federated learning (ECCV 2024).
Improved machine unlearning from a Stackelberg-game perspective (ICLR 2025) and a label-smoothing perspective (ICLR 2026).

Jun 2021 – Jun 2022

UBC — Research Assistant (Prof. Xiaoxiao Li)

Worked on medical image registration using self-supervised methods and reinforcement learning.
Studied un/weakly-supervised learning in federated settings, leading to a paper at ACM MM 2024.

Sep 2020 – Jun 2022

UC San Diego — Research Assistant (Prof. Xiaolong Wang)

Improved human pose estimation with test-time training and self-supervised learning, reconstructing another image of the same subject at test time without annotations (NeurIPS 2021).

Jun 2018 – Sep 2020

Tongji University — Research Assistant (Prof. Yin Wang)

Ran large-scale distributed road-extraction training on 40+ GPUs to study the limits of weakly-supervised semantic segmentation.
Fused deep learning with GPS data to improve road extraction, gaining 5% in accuracy and 40% in generalization to new areas (CVPR 2019).

Education

2022 – 2026 (expected)

University of California, Santa Cruz

Ph.D. in Computer Science and Engineering · Advisor: Prof. Yang Liu

2020 – 2022

University of California, San Diego

M.S. in Electrical and Computer Engineering · Advisor: Prof. Xiaolong Wang

2014 – 2019

Tongji University

B.S. in Electrical and Computer Engineering · Advisor: Prof. Yin Wang

Awards & Service

Sep 2022

Chancellor's Fellowship, UC Santa Cruz (2 awarded among all applicants)

Nov 2019

2nd Prize, 30th National "Challenge Cup" Academic Science and Technology Competition, Ministry of Education of China

Reviewer: CVPR, ICCV, ECCV, NeurIPS, ICLR, ICML, AISTATS, AAAI, ACL, EMNLP, WACV, ACCV, ACM SIGSPATIAL.