Zonglin Di

Ph.D. Candidate Β· Computer Science & Engineering Β· UC Santa Cruz

Vision-Language Models (On-device Unified Model) Desktop Use Agents Trustworthy AI Weakly / Self-supervised Learning

I am a Ph.D. candidate in Computer Science & Engineering at UC Santa Cruz, advised by Prof. Yang Liu. My research focuses on vision-language models (On-device Unified Model), desktop use agents, and trustworthy AI (machine unlearning, learning with noisy labels, and federated learning).

I am currently a research intern at Adobe Research, distilling cloud-scale unified multimodal models into on-device students. Previously, I interned at Apple (proactive AI and streaming video-language understanding) and twice more at Adobe Research (RLHF for LLMs; vision-language representation learning). Before UCSC, I received my M.S. from UC San Diego, where I worked with Prof. Xiaolong Wang on self-supervised learning, and my B.S. from Tongji University.

Publications

* equal contribution

CapNav
Xia Su, Ruiqi Chen, Benlin Liu, Jingwei Ma, Zonglin Di, Ranjay Krishna, Jon E. Froehlich
CVPR 2026VLM
A benchmark that tests whether vision-language models can plan indoor navigation routes conditioned on a user's physical capabilities, revealing how far current VLMs are from capability-aware, accessibility-conscious navigation.
Label Smoothing Improves Machine Unlearning
Zonglin Di, Zhaowei Zhu, Jinghan Jia*, Jiancheng Liu*, Bo Jiang, Zafar Takhirov, Yuanshun Yao, Sijia Liu, Yang Liu
ICLR 2026Trustworthy AIMachine Unlearning
We reveal a connection between label smoothing and machine unlearning: a simple negative label-smoothing recipe makes models forget target data more thoroughly while preserving utility on the data that should be remembered.
DiffTell
Zonglin Di, Jing Shi, Yifei Fan, Hao Tan, Alexander Black, John Collomosse, Yang Liu
ICCV 2025VLM
  • A large-scale, comprehensive, quality-controlled dataset for image difference captioning.
  • Finetuning mainstream VLMs (Qwen, InternVL, LLaVA, …) on DiffTell substantially improves their ability to describe what changed between two images.
JARVIS
Kaizhi Zheng*, Kaiwen Zhou*, Jing Gu*, Yue Fan*, Jialu Wang*, Zonglin Di, Xuehai He, Xin Eric Wang
NeSy 2025VLMAgentic AI
A neuro-symbolic framework that couples large language models with symbolic commonsense reasoning so conversational embodied agents can follow dialogue, ground instructions, and execute household tasks reliably.
Adversarial Machine Unlearning
Zonglin Di*, Sixie Yu*, Yevgeniy Vorobeychik, Yang Liu
ICLR 2025Trustworthy AIMachine Unlearning
Frames machine unlearning as a Stackelberg game between an unlearner and a membership-inference auditor; the resulting algorithm produces models that forget more convincingly under audit while keeping performance on retained data.
Federated Learning with Local Openset Noisy Labels
Zonglin Di, Zhaowei Zhu, Xiaoxiao Li, Yang Liu
ECCV 2024Computer VisionTrustworthy AI
Studies the realistic federated setting where each client's data contains openset label noise (mislabeled samples from unknown classes), and proposes a theoretically grounded method to learn robustly under it.
Noisy Labeled Multi-modal Medical Images
Nan Wang, Zonglin Di, Houlin He, Qingchao Jiang, Xiaoxiao Li
ACM MM 2024Computer VisionTrustworthy AI
A simple method with provable guarantees for training on multi-modal medical images whose labels are noisy β€” a common situation in clinical data collection.
Navigation as Attackers Wish?
Yunchao Zhang, Zonglin Di, Kaiwen Zhou, Cihang Xie, Eric Xin Wang
NAACL 2024Trustworthy AI
Shows that federated training of vision-and-language navigation agents is vulnerable to poisoning attacks that steer agents along attacker-chosen routes, and studies defenses for building robust embodied agents.
T2IAT
Jialu Wang, Xinyue Gabby Liu, Zonglin Di, Yang Liu, Xin Eric Wang
ACL 2023 FindingsVLMTrustworthy AI
Adapts the Implicit Association Test from psychology to text-to-image models, providing a principled way to quantify valence and stereotypical biases in generated images.
Test-Time Personalization for Pose Estimation
Miao Hao*, Yizhuo Li*, Zonglin Di*, Nitesh B. Gundavarapu, Xiaolong Wang
NeurIPS 2021Computer Vision
Personalizes a human pose estimator to each subject at test time using self-supervised objectives β€” reconstructing another image of the same person without any annotations β€” yielding consistent accuracy gains.
Road Extraction from Aerial Imagery
Tao Sun, Zonglin Di, Pengyu Che, Chun Liu, Yin Wang
CVPR 2019Computer Vision
Fuses crowdsourced GPS trajectories with satellite imagery for road extraction, improving accuracy by 5% and boosting generalization to unseen areas by 40% over imagery-only baselines.

Preprints

Improving the Capability of Visual Language Models with Occlusion Reasoning
Jinrui Yang*, Zonglin Di*, Ohi Dibua, Qing Liu, Seun Adekunle, Daniil Pakhomov, Darshan Ganesh Prasad, Cihang Xie, Yuyin Zhou
In submissionVLM
Identifies occlusion understanding as a critical blind spot in current VLMs' spatial reasoning, and develops methods that teach models to reason about what is hidden behind what.
Advancing Machine Unlearning Evaluation Requires Rethinking Retraining
Chris Liu*, Zonglin Di*, Jeffrey Flanigan, Yang Liu
In submissionMachine Unlearning
Argues that the standard "retrain-from-scratch" gold standard for evaluating machine unlearning is flawed, and proposes a more reliable evaluation protocol.
Minghao Liu*, Zonglin Di*, et al.
In submission
A framework for automatically constructing high-quality datasets with large language models, reducing the manual cost of dataset curation.

Earlier publications

Ongoing Projects

SkillsBench β€” benchmarking how well agent skills work
  • The first benchmark for evaluating how effectively LLM agents use Skills β€” modular packages of procedural knowledge attached at inference time.
  • 86 expert-curated tasks across 11 domains with deterministic verifiers, each evaluated with no Skills, curated Skills, and self-generated Skills.
  • Curated Skills raise average pass rate by 16.2 points (with wide domain variance), while self-generated Skills provide no benefit on average; smaller models with Skills can match larger models without them.
ClawsBench β€” capability & safety of LLM productivity agents
  • Evaluates LLM productivity agents across five high-fidelity, production-conformant mock services (Gmail, Slack, Calendar, Docs, Drive), with deterministic state-based scoring that penalizes irreversible harmful actions.
  • Releases 7,800+ scored agent trajectories and shows that capability and safety do not track together: top models reach 39–64% task success but 7–33% unsafe-action rates.
Resource2Skill β€” distilling human-created resources into agent skills
  • A framework that auto-distills multimodal resources (tutorial videos, code repos, articles, reference artifacts) into 3,800+ executable, retrievable agent skills organized as a hierarchical multimodal wiki across seven authoring domains (Web, PowerPoint, Excel, Blender, audio, UE5, CAD).
  • A two-stage hierarchy-then-LM retrieval and an MCP-based execution runtime let agents compose skills and operate real tools, lifting average task quality by +11.9 points and outperforming Claude Code / Codex baselines on 26 of 28 model–domain cells.
Dynamic MCTS tree with decoupled roles for LLM agents
Decouples the LLM agent into "Planner–Actor–Critic" with a dynamic MCTS tree, addressing sparse rewards and inaccurate process rewards in long-horizon GUI tasks.
Rethinking LLM evaluation
Analyzes the limitations of ELO-based evaluation for LLMs and develops a robust evaluation framework resilient to dataset-size variation and evaluation noise across chat, reasoning, and agent tasks.
Spatial reasoning for VLMs
Identifies critical limitations in VLMs' spatial reasoning and develops methods to enhance occlusion understanding and video-based spatial reasoning.
Improving GRPO sparse reward with an auxiliary reward
Addresses the sparse-reward problem in GRPO with an auxiliary reward formalized in an unbiased, semi-parametric way.

Experience

Internships

Jun 2026 – present
Adobe Research, San Jose, CA β€” Research Intern
  • Distilling a cloud-scale unified multimodal model for image and text understanding and generation into an on-device student model.
  • Benchmarking six backbone architectures (dense, modality-split FFN, two-tower, MoE routing, mixture-of-blocks) at iso-compute to find the most distillation-friendly design.
  • Developing a cloud-to-device weight-inheritance scheme that initializes the student from the teacher, and exploring memory-efficient tokenization with few-step diffusion distillation.
Jun 2025 – Sep 2025
Apple, Sunnyvale, CA β€” Research Intern
  • Gathered data to support the development of proactive AI, human comprehension, and models of user behavior.
  • Established a development pipeline for proactive AI and compared the performance of open-source and proprietary vision-language models.
  • Researched streaming online video-language understanding models.
Jun 2024 – Nov 2024
Adobe Research, San Jose, CA β€” Research Intern
  • Built a new RLHF pipeline for LLMs based on DPO, SimPO, SPIN, and online RLHF.
  • Investigated how to improve LLM generalization using limited high-quality data.
Jun 2023 – Nov 2023
Adobe Research, San Jose, CA β€” Research Intern
  • Researched vision-language representation learning to improve image difference captioning.
  • Finetuned mainstream vision-language models (Qwen, InternVL, LLaVA, …) on 100+ A100 GPUs.
  • Built DiffTell, a large-scale image difference captioning dataset, later published at ICCV 2025.

Research

Jun 2022 – present
UC Santa Cruz β€” Research Assistant (Prof. Yang Liu)
  • Proposed a theoretically grounded method for openset noisy labels in federated learning (ECCV 2024).
  • Improved machine unlearning from a Stackelberg-game perspective (ICLR 2025) and a label-smoothing perspective (ICLR 2026).
Jun 2021 – Jun 2022
UBC β€” Research Assistant (Prof. Xiaoxiao Li)
  • Worked on medical image registration using self-supervised methods and reinforcement learning.
  • Studied un/weakly-supervised learning in federated settings, leading to a paper at ACM MM 2024.
Sep 2020 – Jun 2022
UC San Diego β€” Research Assistant (Prof. Xiaolong Wang)
Improved human pose estimation with test-time training and self-supervised learning, reconstructing another image of the same subject at test time without annotations (NeurIPS 2021).
Jun 2018 – Sep 2020
Tongji University β€” Research Assistant (Prof. Yin Wang)
  • Ran large-scale distributed road-extraction training on 40+ GPUs to study the limits of weakly-supervised semantic segmentation.
  • Fused deep learning with GPS data to improve road extraction, gaining 5% in accuracy and 40% in generalization to new areas (CVPR 2019).

Education

2022 – 2026 (expected)
Ph.D. in Computer Science and Engineering Β· Advisor: Prof. Yang Liu
2020 – 2022
M.S. in Electrical and Computer Engineering Β· Advisor: Prof. Xiaolong Wang
2014 – 2019
B.S. in Electrical and Computer Engineering Β· Advisor: Prof. Yin Wang

Awards & Service

Sep 2022
Chancellor's Fellowship, UC Santa Cruz (2 awarded among all applicants)
Nov 2019
2nd Prize, 30th National "Challenge Cup" Academic Science and Technology Competition, Ministry of Education of China

Reviewer: CVPR, ICCV, ECCV, NeurIPS, ICLR, ICML, AISTATS, AAAI, ACL, EMNLP, WACV, ACCV, ACM SIGSPATIAL.