Hi 👋, I am a Research Scientist at Apple MLR. I completed my PhD from the Department of Computer Science at University of Maryland, College Park, where I was advised by Prof. Dinesh Manocha. My research focuses on advancing multimodal foundation models, with an emphasis on multimodal reasoning, agentic capabilities, fine-grained perception, and robustness across visual, audio, and language modalities.
During my PhD, I worked closely with several industry research groups. I was a Research Intern at Apple Machine Learning Research hosted by Chun-Liang Li and Karren Yang. I also spent the summer of '24 at Meta Reality Labs as a Research Scientist Intern hosted by Ruohan Gao. Previously, I was a Student Researcher at Google Research with Avisek Lahiri and Vivek Kwatra on speech-driven facial synthesis in the Talking Heads team. I also worked with Adobe Research as a PhD Research Intern with Joseph K J on multimodal audio generation. I have also collaborated with Prof. Kristen Grauman, Prof. Salman Khan, and Prof. Mohamed Elhoseiny.
Before starting my PhD, I worked as a Machine Learning Scientist with the Camera and Video AI team at ShareChat, India. I was also a Visiting Researcher at the Computer Vision and Pattern Recognition Unit at Indian Statistical Institute Kolkata under Prof. Ujjwal Bhattacharya. Earlier, I was a Senior Research Engineer with the Vision Intelligence Group at Samsung R&D Institute Bangalore, where I worked on AI-powered perception and vision systems for consumer devices.
I received my MTech in Computer Science & Engineering from IIIT Hyderabad, where I was advised by Prof. C V Jawahar. During my undergraduate studies, I worked as a research intern with Prof. Pabitra Mitra at IIT Kharagpur and at the CVPR Unit at ISI Kolkata.
Feel free to reach out if you're interested in research collaboration!
Oct 2021 - Paper on audio-visual summarization accepted in BMVC 2021.
Sep 2021 - Blog on Video Quality Enhancement released at Tech @ ShareChat.
July 2021 - Paper on reflection removal got accepted in ICCV 2021.
June 2021 - Joined ShareChat Data Science team.
May 2021 - Paper on audio-visual joint segmentation accepted in ICIP 2021.
Dec 2018 - Accepted Samsung Research offer. Will be joining in June'19.
Sep 2018 - Received Dean's Merit List Award for academic excellence at IIIT Hyderabad.
Oct 2017 - Our work on a multi-scale, low-latency face detection framework received Best Paper Award at NGCT-2017.
Selected publications
I am broadly interested in problems at the intersection of Computer Vision, Computer Audition, and Machine Learning, with the goal of building AI systems that can perceive, reason, and interact with complex real-world environments. My research focuses on multimodal learning (Vision + X), particularly for generative modeling and cross-modal understanding with minimal supervision.
In the past, I have also worked on problems in computational photography, including image reflection removal, intrinsic image decomposition, inverse rendering, and video quality assessment.
Representative papers are highlighted below. For a complete list of publications, please refer to my
Google Scholar.
AMusE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding
Sanjoy Chowdhury, Karren D. Yang, Xudong Liu, Fartash Faghri, Pavan Kumar Anasosalu Vasu, Oncel Tuzel, Dinesh Manocha, Chun-Liang Li, Raviteja Vemulapalli
Conference on Computer Vision and Pattern Recognition (CVPR), 2026
Paper /
Project Page (Coming soon) /
MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks
Sanjoy Chowdhury, Mohamed Elmoghany, Yohan Abeysinghe, Junjie Fei, Sayan Nag, Salman Khan, Mohamed Elhoseiny, Dinesh Manocha
Annual Conference on Neural Information Processing Systems (NeurIPS), 2025
Paper /
Project Page /
Poster /
Huggingface /
Kaggle /
Code
AURELIA: Test-time Reasoning Distillation in Audio-Visual LLMs
Sanjoy Chowdhury*, Hanan Gani*, Nishit Anand, Sayan Nag, Ruohan Gao, Mohamed Elhoseiny, Salman Khan, Dinesh Manocha
International Conference on Computer Vision (ICCV), 2025
Paper /
Project Page /
Code
AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs
Sanjoy Chowdhury*, Sayan Nag*, Subhrajyoti Dasgupta, Yaoting Wang, Mohamed Elhoseiny, Ruohan Gao, Dinesh Manocha
International Conference on Computer Vision (ICCV), 2025
Paper /
Project Page /
Code
EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception