Ali Vosoughi, PhD candidate in multimodal AI at the University of Rochester
Ali Vosoughi
PhD Candidate, University of Rochester
World models, multimodal agents, and agentic understanding and generation of 3D scenes.
Ali Vosoughi’s research centers on multimodal foundation models β€” spanning world models, agentic systems, and 3D scene understanding and generation. His work spans experience with Apple, Microsoft Research, Smule, Bosch AI Research, and DARPA.
654 citations Β· h-index 12 US Patent US20250124292A1 ICASSP 2026
Multimodal Agents 3D Scene Understanding World Models Agentic Generation
πŸ“§ ali.vosoughi@rochester.edu
πŸ“ CS Department, Wegmans Hall 3211
🍎 Apple
Machine Learning Intern
Agentic Multimodal AI
🎡 Smule AI
Research Scientist Intern
Spatial Audio Generation
🏒 Microsoft Research
Research Intern
Audiovisual LLM and Video Understanding
πŸš— Bosch AI Research
Research Intern
Audio LLM and Counterfactual Learning
πŸ›‘οΈ DARPA PTG
Graduate Researcher
Autonomous Multimodal Perception and AR
πŸ†
AAAI 2026 Best Demonstration Award Runner-up
Caption Anything in Video (Spatiotemporal Multimodal Prompting)
πŸ“Ή
Video Understanding with LLMs
Comprehensive survey with 241+ citations (IEEE TCSVT 2025)
πŸ”¬
PW-VQA
Causal debiasing for visual question answering with 50+ citations (IEEE TMM 2024)
πŸ†
First counterfactual audio methods
ICASSP 2024 + US Patent US20250124292A1 (published Jan 2025)
πŸ”Š
PromptReverb
First text-to-spatial-audio generation at 48kHz (ICASSP 2026)
🎬
AVVA
Unified audiovisual foundation model with LLM curation (EUSIPCO 2025)
🀝
Autonomous multimodal copilot
Real-time audiovisual AR demonstrations (DARPA)
πŸ“Š
VERIFY benchmark
Reasoning verification framework for multimodal LLMs
🧠
Video LMM Post-Training
Deep dive into video reasoning with large multimodal models
πŸ“¦
AVE-2 Dataset
Open audiovisual benchmark for cross-modal event understanding

Recent News & Updates

12/2025
πŸ“„ NeurIPS 2025 paper accepted: MMPerspective (Multimodal LLM Reasoning, Video and Visual Perception)
01/2026
πŸ“„ ICASSP 2026 paper accepted: PromptReverb (Text-to-Spatial-Audio Generation at 48kHz)
09/2025
βœ… Completed research internship at Smule AI (Spatial Audio Generation and Synthesis)
06/2025
🎡 Started research internship at Smule AI (Spatial Audio Generation and Immersive Computing)
10/2024
🎀 Presented at SANE 2024, DeepMind Boston (Audio Understanding, Video LLMs, and Spatial Audio)
10/2024
πŸ“„ ACM Multimedia 2024: EAGLE (Egocentric Video Understanding and Language Generation)
08/2024
πŸ’Ό Research presentation at Microsoft Research, Seattle (Audiovisual LLM, Video and Audio Understanding)
03/2024
πŸ“„ NAACL 2024: OSCaR (Video Object State Captioning, Autonomous Video Perception)
02/2024
πŸ“„ IEEE Transactions on Multimedia 2024: PW-VQA (Causal Visual Question Answering, Video Reasoning)
08/2023
🎯 Two ICCV 2023 papers accepted (Audiovisual Sound Separation and Autonomous AR Perception System)

Publications

Image

PromptReverb: Multimodal Room Impulse Response Generation Through Latent Rectified Flow Matching
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2026
[Paper][Website]

Image

VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity
Under Review’26
[Paper][Website][πŸ€— Hugging Face]

Image

Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model
European Signal Processing Conference (EUSIPCO) 2025
[Paper][Website]

Image

EAGLE: Egocentric AGgregated Language-video Engine
ACM International Conference on Multimedia (ACM MM) 2024
[Paper]

Image

PW-VQA: Cross Modality Bias in Visual Question Answering: A Causal View with Possible Worlds VQA
IEEE Transactions on Multimedia (TMM) 2024

[Paper][Code][Website]

Image

OSCaR: Object State Captioning and State Change Representation
North American Chapter of the Association for Computational Linguistics (NAACL) 2024
[Paper][Code]

Image

Video Understanding with Large Language Models: A Survey
IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) 2025
[Paper][Code]

Image

Learning Audio Concepts from Counterfactual Natural Language
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024
[Paper][Code][Patent]

Image

AVSA-Sep: Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware Sound Separation
IEEE/CVF International Conference on Computer Vision (ICCV) 2023: ICCV AV4D Workshop
[Paper]

Image

MISAR: A Multimodal Instructional System with Augmented Reality
IEEE/CVF International Conference on Computer Vision (ICCV) 2023: ICCV AV4D Workshop
[Paper][Code][Video]

Image

Relation Discovery in Nonlinearly Related Large-scale Settings
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022
[Paper][Code]

Image

Leveraging Pre-Images to Discover Nonlinear Relationships in Multivariate Environments
European Signal Processing Conference (EUSIPCO) 2021
[Paper]

Image

Large-scale Nonlinear Granger Causality for Inferring Directed Dependence from Short Multivariate Time-series Data
Scientific Reports, Nature Publishing Group (Nature) 2021
[Paper][Code]


Personal Gallery

Image
Ali Vosoughi
Image
Ali Vosoughi