Ali Vosoughi

PhD Candidate, University of Rochester

World models, multimodal agents, and agentic understanding and generation of 3D scenes.

Ali Vosoughi’s research centers on multimodal foundation models — spanning world models, agentic systems, and 3D scene understanding and generation. His work spans experience with Apple, Microsoft Research, Smule, Bosch AI Research, and DARPA.

654 citations · h-index 12 US Patent US20250124292A1 ICASSP 2026

Multimodal Agents 3D Scene Understanding World Models Agentic Generation

📧 ali.vosoughi@rochester.edu

📍 CS Department, Wegmans Hall 3211

🍎 Apple

Machine Learning Intern
Agentic Multimodal AI

🎵 Smule AI

Research Scientist Intern
Spatial Audio Generation

🏢 Microsoft Research

Research Intern
Audiovisual LLM and Video Understanding

🚗 Bosch AI Research

Research Intern
Audio LLM and Counterfactual Learning

🛡️ DARPA PTG

Graduate Researcher
Autonomous Multimodal Perception and AR

🏆

AAAI 2026 Best Demonstration Award Runner-up
Caption Anything in Video (Spatiotemporal Multimodal Prompting)

📹

Video Understanding with LLMs
Comprehensive survey with 241+ citations (IEEE TCSVT 2025)

🔬

PW-VQA
Causal debiasing for visual question answering with 50+ citations (IEEE TMM 2024)

🏆

First counterfactual audio methods
ICASSP 2024 + US Patent US20250124292A1 (published Jan 2025)

🔊

PromptReverb
First text-to-spatial-audio generation at 48kHz (ICASSP 2026)

🎬

AVVA
Unified audiovisual foundation model with LLM curation (EUSIPCO 2025)

🤝

Autonomous multimodal copilot
Real-time audiovisual AR demonstrations (DARPA)

📊

VERIFY benchmark
Reasoning verification framework for multimodal LLMs

🧠

Video LMM Post-Training
Deep dive into video reasoning with large multimodal models

📦

AVE-2 Dataset
Open audiovisual benchmark for cross-modal event understanding

Recent News & Updates

02/2026

🏆 AAAI 2026 Best Demonstration Award Runner-up: Caption Anything in Video (Spatiotemporal Video Understanding and Multimodal Prompting)

12/2025
📄 NeurIPS 2025 paper accepted: MMPerspective (Multimodal LLM Reasoning, Video and Visual Perception)

01/2026

📄 ICASSP 2026 paper accepted: PromptReverb (Text-to-Spatial-Audio Generation at 48kHz)

09/2025

✅ Completed research internship at Smule AI (Spatial Audio Generation and Synthesis)

06/2025

🎵 Started research internship at Smule AI (Spatial Audio Generation and Immersive Computing)

03/2025

📊 Published VERIFY benchmark (Multimodal Reasoning Verification for Video and Vision LLMs)

10/2024
🎤 Presented at SANE 2024, DeepMind Boston (Audio Understanding, Video LLMs, and Spatial Audio)

10/2024

📄 ACM Multimedia 2024: EAGLE (Egocentric Video Understanding and Language Generation)

08/2024
💼 Research presentation at Microsoft Research, Seattle (Audiovisual LLM, Video and Audio Understanding)

03/2024

📄 NAACL 2024: OSCaR (Video Object State Captioning, Autonomous Video Perception)

02/2024

📄 IEEE Transactions on Multimedia 2024: PW-VQA (Causal Visual Question Answering, Video Reasoning)

08/2023

🎯 Two ICCV 2023 papers accepted (Audiovisual Sound Separation and Autonomous AR Perception System)

04/2023

🏢 Started internship at Bosch Center for AI (Audio Language Models and Counterfactual Reasoning)

Publications

PromptReverb: Multimodal Room Impulse Response Generation Through Latent Rectified Flow Matching
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2026
[Paper][Website]

VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity
Under Review’26
[Paper][Website][🤗 Hugging Face]

Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model
European Signal Processing Conference (EUSIPCO) 2025
[Paper][Website]

EAGLE: Egocentric AGgregated Language-video Engine
ACM International Conference on Multimedia (ACM MM) 2024
[Paper]

PW-VQA: Cross Modality Bias in Visual Question Answering: A Causal View with Possible Worlds VQA
IEEE Transactions on Multimedia (TMM) 2024

[Paper][Code][Website]

OSCaR: Object State Captioning and State Change Representation
North American Chapter of the Association for Computational Linguistics (NAACL) 2024
[Paper][Code]

Video Understanding with Large Language Models: A Survey
IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) 2025
[Paper][Code]

Learning Audio Concepts from Counterfactual Natural Language
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024
[Paper][Code][Patent]

AVSA-Sep: Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware Sound Separation
IEEE/CVF International Conference on Computer Vision (ICCV) 2023: ICCV AV4D Workshop
[Paper]

MISAR: A Multimodal Instructional System with Augmented Reality
IEEE/CVF International Conference on Computer Vision (ICCV) 2023: ICCV AV4D Workshop
[Paper][Code][Video]

Relation Discovery in Nonlinearly Related Large-scale Settings
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022
[Paper][Code]

Leveraging Pre-Images to Discover Nonlinear Relationships in Multivariate Environments
European Signal Processing Conference (EUSIPCO) 2021
[Paper]

Large-scale Nonlinear Granger Causality for Inferring Directed Dependence from Short Multivariate Time-series Data
Scientific Reports, Nature Publishing Group (Nature) 2021
[Paper][Code]

Personal Gallery

Ali Vosoughi

Recent News & Updates

Publications

PromptReverb: Multimodal Room Impulse Response Generation Through Latent Rectified Flow MatchingIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2026[Paper][Website]

VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning FidelityUnder Review’26[Paper][Website][🤗 Hugging Face]

Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model European Signal Processing Conference (EUSIPCO) 2025[Paper][Website]

EAGLE: Egocentric AGgregated Language-video Engine ACM International Conference on Multimedia (ACM MM) 2024[Paper]

PW-VQA: Cross Modality Bias in Visual Question Answering: A Causal View with Possible Worlds VQAIEEE Transactions on Multimedia (TMM) 2024

[Paper][Code][Website]

OSCaR: Object State Captioning and State Change RepresentationNorth American Chapter of the Association for Computational Linguistics (NAACL) 2024[Paper][Code]

Video Understanding with Large Language Models: A SurveyIEEE Transactions on Circuits and Systems for Video Technology (TCSVT) 2025[Paper][Code]

Learning Audio Concepts from Counterfactual Natural LanguageIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024[Paper][Code][Patent]

AVSA-Sep: Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware Sound SeparationIEEE/CVF International Conference on Computer Vision (ICCV) 2023: ICCV AV4D Workshop[Paper]

MISAR: A Multimodal Instructional System with Augmented RealityIEEE/CVF International Conference on Computer Vision (ICCV) 2023: ICCV AV4D Workshop[Paper][Code][Video]

Relation Discovery in Nonlinearly Related Large-scale SettingsIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022[Paper][Code]

Leveraging Pre-Images to Discover Nonlinear Relationships in Multivariate EnvironmentsEuropean Signal Processing Conference (EUSIPCO) 2021[Paper]

Large-scale Nonlinear Granger Causality for Inferring Directed Dependence from Short Multivariate Time-series DataScientific Reports, Nature Publishing Group (Nature) 2021[Paper][Code]