I am a Research Scientist at NVIDIA Research, pursuing research on Adaptive Physical Intelligence, focusing on developing efficient, adaptive AI systems for vision-language-action models (VLA), world modeling, embodied reasoning, and physical AI.
I received my Ph.D. from National Taiwan University (NTU) in Jul. 2023, supervised by Prof. Yu-Chiang Frank Wang. Previously, I was a research intern at NVIDIA Research (Feb. 2023-Aug. 2023), focusing on efficient model personalization and vision-language models. Also, I was a Ph.D. program researcher at ASUS AICS from Sep. 2020 to Oct. 2022, specializing in visual transfer learning.
[Nov. 2025] Our papers "SANTA" (mitigating hallucinations in video LLMs), "TA-Prompting" (video temporal understanding), and "VADER" (video anomaly understanding) are accepted at WACV 2026.
[Sep. 2025] Our paper "ThinkAct" is accepted at NeurIPS 2025.
[Jun. 2025] One co-authored paper "LongSplat" is accepted at ICCV 2025.
[Feb. 2025] Our paper "VideoMage" is accepted at CVPR 2025.
[Jul. 2024] Our papers "Receler" and "Select and Distill" are accepted at ECCV 2024.
My research goal is to advance Embodied and Physical AI Research, developing fast adaptive and self-evolving AI agents that seamlessly integrate dynamics, reasoning, and action in physical environments. I focus on vision-language-action models that enable intelligent agents to understand and interact with the world through multimodal reasoning, sophisticated world modeling for predictive understanding of dynamic environments, and embodied reasoning that bridges abstract cognition with physical reality. I am driven by the vision that AI should not merely process information, but should adaptively learn from and intelligently respond to the rich complexity of physical experience, ultimately creating more capable and contextually aware artificial agents. Full list of publications here.
GR00T N1.6: An Improved Open Foundation Model for Generalist Humanoid Robots GR00T Team NVIDIA Tech Blog, 2025  
blog
/
code
NVIDIA Isaac GR00T N1.6 is an open vision-language-action (VLA) model for generalized humanoid robot skills. This cross-embodiment model takes multimodal input, including language and images, to perform manipulation tasks in diverse environments.