Sanjay Haresh

I am a Senior Machine Learning Researcher at Qualcomm AI Research working in Roland Memisevic's group on the intersection of large vision language models and robotics. Before that, I completed my MSc. (Thesis) at Simon Fraser University (SFU), where I was advised by Prof. Manolis Savva .

Prior to SFU, I worked at Retrocausal for 2 years as a Research Engineer (Computer Vision) under the supervision of Dr. Quoc Huy Tran and Dr. Zeeshan Zia.

In 2019, I completed my undergrad Computer Science from FAST-NUCES Karachi, Pakistan, where I worked on Class Imbalance under the guidance of Prof. Tahir Syed.

Email / CV / Google Scholar / Twitter

Publications and Preprints

Papers are in reverse chronological order. '*' denotes equal contribution.

	Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance? Apratim Bhattacharyya, Bicheng Xu, Sanjay Haresh, Reza Pourreza, Litian Liu, Sunny Panchal, Leonid Sigal, Roland Memisevic NeurIPS, 2025 project page / Openreview We present Qualcomm Interactive Cooking, a dataset and a benchmark, to enable development of live step-by-step task guidance by situated AI assistants.
	Focusing on What Matters: Object-Agent-centric Tokenization for Vision Language Action Models Rokas Bendikas, Daniel Dijkman, Markus Peschl, Sanjay Haresh, Pietro Mazzaglia, CoRL, 2025 arXiv We introduce Oat-VLA, a VLA with Object-Agent-centric tokenization which drastically reduces the number of vision tokens enabling ~2x faster training than OpenVLA while outperforming it on real-world tasks.
	ClevrSkills: Compositional Language And Visual Reasoning in Robotics Sanjay Haresh, Daniel Dijkman, Apratim Bhattacharyya, Roland Memisevic NeurIPS (Datasets and Benchmarks Track), 2024 project page / arXiv We present a benchmark for compositional learning in robotics. The benchmark consists of 33 manipulation tasks over 3 levels of compositionality. We also open-source a large dataset of ground-truth trajectories generated using oracle solvers.
	Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives Kristen Grauman, ... Sanjay Haresh, Yongsen Mao, Manolis Savva, ... CVPR, 2024 project page / arXiv We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge to push the frontier of first-person video understanding of skilled human activity.
	Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation Mukul Khanna, Yongsen Mao, Hanxiao Jiang, Sanjay Haresh, Brennan Shacklett, Dhruv Batra, Alexander Clegg, Eric Undersander, Angel Chang, Manolis Savva CVPR, 2024 project page / arXiv We present the Habitat Synthetic Scene Dataset, a dataset of 211 high-quality 3D scenes, and use it to investigate the impact of synthetic 3D scene dataset scale and realism on the task of training embodied agents to find and navigate to objects.
	Articulated 3D Human-Object Interactions from RGB Videos: An Empirical Analysis of Approaches and Challenges Sanjay Haresh, Xiaohao Sun, Hanxiao Jiang, Angel Chang, Manolis Savva 3DV, 2022 project page / arXiv We canonicalize the task of reconstruction 3D human object from videos and benchmark 5 families of methods on the task.
	Timestamp-Supervised Action Segmentation with Graph Convolutional Networks Hamza Khan Sanjay Haresh, Awais Ahmed, Shakeeb Siddiqui, Andrey Konin , M. Zeeshan Zia, Quoc-Huy Tran IROS, 2022 project page / arXiv We leverage graph convolutional networks to propagate timestamp labels to the whole video resulting in a 97% reduction of required labels.
	Unsupervised Action Segmentation by Joint Representation Learning and Online Clustering Sanjay Haresh, Sateesh Kumar, Awais Ahmed, Andrey Konin , M. Zeeshan Zia, Quoc-Huy Tran CVPR, 2022 project page / arXiv We proposed temporal optimal transport for jointly learning representations and performing online clustering in an unsupervised manner.
	Learning by Aligning Video in Time Sanjay Haresh, Sateesh Kumar, Huseyin Coskun, Shahram N. Syed, Andrey Konin , M. Zeeshan Zia, Quoc-Huy Tran CVPR, 2021 project page / arXiv Good frame representations can be learned by learning global alignment across pairs of videos via differentiable dynamic time warping.
	Towards Anomaly Detection in Dashcam Videos Sanjay Haresh, Sateesh Kumar, M. Zeeshan Zia Quoc-Huy Tran IV, 2020 talk / arXiv We curated a large dataset of dashcam videos for road anomalies understanding. We proposed an object-object interaction reasoning approach for detecting anomalies without additional supervision.

Website layout is from Jon Barron