Sanjoy's webpage

Hi 👋, I am a Research Scientist at Apple MLR. I completed my PhD from the Department of Computer Science at University of Maryland, College Park, where I was advised by Prof. Dinesh Manocha. My research focuses on advancing multimodal foundation models, with an emphasis on multimodal reasoning, agentic capabilities, fine-grained perception, and robustness across visual, audio, and language modalities.

During my PhD, I worked closely with several industry research groups. I was a Research Intern at Apple Machine Learning Research hosted by Chun-Liang Li and Karren Yang. I also spent the summer of '24 at Meta Reality Labs as a Research Scientist Intern hosted by Ruohan Gao. Previously, I was a Student Researcher at Google Research with Avisek Lahiri and Vivek Kwatra on speech-driven facial synthesis in the Talking Heads team. I also worked with Adobe Research as a PhD Research Intern with Joseph K J on multimodal audio generation. I have also collaborated with Prof. Kristen Grauman, Prof. Salman Khan, and Prof. Mohamed Elhoseiny.

Before starting my PhD, I worked as a Machine Learning Scientist with the Camera and Video AI team at ShareChat, India. I was also a Visiting Researcher at the Computer Vision and Pattern Recognition Unit at Indian Statistical Institute Kolkata under Prof. Ujjwal Bhattacharya. Earlier, I was a Senior Research Engineer with the Vision Intelligence Group at Samsung R&D Institute Bangalore, where I worked on AI-powered perception and vision systems for consumer devices.

I received my MTech in Computer Science & Engineering from IIIT Hyderabad, where I was advised by Prof. C V Jawahar. During my undergraduate studies, I worked as a research intern with Prof. Pabitra Mitra at IIT Kharagpur and at the CVPR Unit at ISI Kolkata.

Feel free to reach out if you're interested in research collaboration!

Email / GitHub / Google Scholar / LinkedIn / Twitter

Updates

Feb 2026 - Paper on multi-speaker benchmark and alignment framework accepted at CVPR 2026!
Oct 2025 - Recognized as a Top Reviewer by NeurIPS 2025!
Oct 2025 - Invited presentation at CLVL , AVGenL with oral at BinEgo‑360° workshops at ICCV 2025.
Sep 2025 - Our paper MAGNET on multi-agent framework for audio-visual RAG got accepted to NeurIPS 2025
Aug 2025 - Selected for Doctoral Consortium at ICCV 2025!
Aug 2025 - Passed my PhD proposal. Officially a candidate now!
Jun 2025 - Three papers Aurelia, AVTrustBench, EgoAdapt got accepted to ICCV 2025 .
Jun 2025 - Co-organising Gen4AVC workshop to be held in conjunction with ICCV 2025.
Mar 2025 - Joined Apple MLR as a ML Research intern.
Feb 2025 - Invited talk at NYC Computer Vision Day 2025 organised by New York University.
Oct 2024 - Invited talk on assessing and addressing the gaps in existing Audio-Visual LLMs at AIR lab at University of Rochester
July 2024 - Work on Audio-Visual LLM got accepted to ECCV 2024
June 2024 - Invited talk at the Sight and Sound workshop at CVPR 2024
May 2024 - Joined Meta Reality Labs as a Research Scientist intern.
May 2024 - Paper on Improving Robustness Against Spurious Correlations got accepted to ACL 2024 Findings
May 2024 - Our paper on determining perceived audience intent from multi-modal social media posts got accepted to Nature Scientific Reports
Mar 2024 - Paper on LLM guided navigational instruction generation got accepted to NAACL 2024
Feb 2024 - MeLFusion (Highlight, Top 2.8%) got accepted to CVPR 2024
Feb 2024 - Joined Google Research as a student researcher.
Oct 2023 - APoLLo gets accepted to EMNLP 2023
Oct 2023 - Invited talk on AdVerb at AV4D Workshop, ICCV 2023
July 2023 - AdVerb got accepted to ICCV 2023
May 2023 - Joined Adobe Research as a research intern.
Aug 2022 - Joined as a CS PhD student at University of Maryland College Park . Awarded Dean's fellowship.
Oct 2021 - Paper on audio-visual summarization accepted in BMVC 2021.
Sep 2021 - Blog on Video Quality Enhancement released at Tech @ ShareChat.
July 2021 - Paper on reflection removal got accepted in ICCV 2021.
June 2021 - Joined ShareChat Data Science team.
May 2021 - Paper on audio-visual joint segmentation accepted in ICIP 2021.
Dec 2018 - Accepted Samsung Research offer. Will be joining in June'19.
Sep 2018 - Received Dean's Merit List Award for academic excellence at IIIT Hyderabad.
Oct 2017 - Our work on a multi-scale, low-latency face detection framework received Best Paper Award at NGCT-2017.

Selected publications

I am broadly interested in problems at the intersection of Computer Vision, Computer Audition, and Machine Learning, with the goal of building AI systems that can perceive, reason, and interact with complex real-world environments. My research focuses on multimodal learning (Vision + X), particularly for generative modeling and cross-modal understanding with minimal supervision.

In the past, I have also worked on problems in computational photography, including image reflection removal, intrinsic image decomposition, inverse rendering, and video quality assessment.

Representative papers are highlighted below. For a complete list of publications, please refer to my Google Scholar.

	AMusE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding Sanjoy Chowdhury, Karren D. Yang, Xudong Liu, Fartash Faghri, Pavan Kumar Anasosalu Vasu, Oncel Tuzel, Dinesh Manocha, Chun-Liang Li, Raviteja Vemulapalli Conference on Computer Vision and Pattern Recognition (CVPR), 2026 Paper / Project Page (Coming soon) /
	MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks Sanjoy Chowdhury, Mohamed Elmoghany, Yohan Abeysinghe, Junjie Fei, Sayan Nag, Salman Khan, Mohamed Elhoseiny, Dinesh Manocha Annual Conference on Neural Information Processing Systems (NeurIPS), 2025 Paper / Project Page / Poster / Huggingface / Kaggle / Code
	AURELIA: Test-time Reasoning Distillation in Audio-Visual LLMs Sanjoy Chowdhury, Hanan Gani, Nishit Anand, Sayan Nag, Ruohan Gao, Mohamed Elhoseiny, Salman Khan, Dinesh Manocha International Conference on Computer Vision (ICCV), 2025 Paper / Project Page / Code
	AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta, Yaoting Wang, Mohamed Elhoseiny, Ruohan Gao, Dinesh Manocha International Conference on Computer Vision (ICCV), 2025 Paper / Project Page / Code
	EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception Sanjoy Chowdhury, Subrata Biswas, Sayan Nag, Tushar Nagarajan, Calvin Murdock, Ishwarya Ananthabhotla, Yijun Qian, Vamsi Krishna Ithapu, Dinesh Manocha, Ruohan Gao International Conference on Computer Vision (ICCV), 2025 Paper / Project Page / Code
	Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta, Jun Chen, Mohamed Elhoseiny, Ruohan Gao, Dinesh Manocha European Conference on Computer Vision (ECCV)*, 2024 Paper/ Project Page / Poster / Video / Dataset / Code
	Towards Determining Perceived Human Intent for Multimodal Social Media Posts using The Theory of Reasoned Action Trisha Mittal, Sanjoy Chowdhury, Pooja Guhan, Snikhita Chelluri, Dinesh Manocha Nature Scientific Reports, 2024 Paper / Dataset
	ASPIRE: Language-Guided Data Augmentation for Improving Robustness Against Spurious Correlations Sreyan Ghosh, Chandra Kiran Reddy Evuru, Sonal Kumar, Utkarsh Tyagi, Sakshi Singh, Sanjoy Chowdhury, Dinesh Manocha Association for Computational Linguistics(ACL Findings), 2024 Paper / Code
	Can LLM’s Generate Human-Like Wayfinding Instructions? Towards Platform-Agnostic Embodied Instruction Synthesis Vishnu Sashank Dorbala, Sanjoy Chowdhury, Dinesh Manocha North American Chapter of the Association for Computational Linguistics (NAACL), 2024 Paper

MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models (Highlight, Top 2.8%)

Sanjoy Chowdhury*, Sayan Nag*, Joseph KJ, Balaji Vasan Srinivasan, Dinesh Manocha
Conference on Computer Vision and Pattern Recognition (CVPR), 2024
Paper/ Project Page / Poster / Video / Dataset / Code

APoLLo : Unified Adapter and Prompt Learning for Vision Language Models

Sanjoy Chowdhury*, Sayan Nag*, Dinesh Manocha
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Paper / Project Page / Poster / Video / Code

	AdVerb: Visually Guided Audio Dereverberation Sanjoy Chowdhury, Sreyan Ghosh, Subhrajyoti Dasgupta, Anton Ratnarajah, Utkarsh Tyagi, Dinesh Manocha International Conference on Computer Vision (ICCV), 2023 Paper / Project Page / Video / Poster / Code
	Measured Albedo in the Wild: Filling the Gap in Intrinsics Evaluation Jiaye Wu, Sanjoy Chowdhury, Hariharmano Shanmugaraja, David Jacobs, Soumyadip Sengupta International Conference on Computational Photography (ICCP), 2023 Paper / Project Page / Dataset
	AudViSum: Self-Supervised Deep Reinforcement Learning for Diverse Audio-Visual Summary Generation Sanjoy Chowdhury, Aditya P. Patra, Subhrajyoti Dasgupta, Ujjwal Bhattacharya British Machine Vision Conference (BMVC), 2021 Paper / Code / Presentation
	V-DESIRR: Very Fast Deep Embedded Single Image Reflection Removal B H Pawan Prasad, Green Rosh K S, Lokesh R B, Kaushik Mitra, Sanjoy Chowdhury International Conference on Computer Vision (ICCV), 2021 Paper / Code
	Listen to the Pixels Sanjoy Chowdhury, Subhrajyoti Dasgupta, Sudip Das, Ujjwal Bhattacharya International Conference on Image Processing (ICIP), 2021 Paper / Code / Presentation