🎓 PhD researcher at Prof. Chenliang Xu's Lab | Unified multimodal reasoning, understanding, and generation | alivosoughi.com
👋 Welcome to Ali Vosoughi's code repository. I try to provide value for you, so I put some of my work here so you can better search through the codes and find what you need.
🧠 My interest is in systems beyond human capabilities—not just through intelligence, but through combining visual, auditory, and other signals [EEG, behavior, haptics, lidar, IR, ultrasound] that evolution forgot to put in us. Look at bats, for example—they see signals we can't see! 🦇
🔬 Currently we're limited to audio, visual, and semantic modalities. Today we connect video and images with language models, and the importance of language has made multimodal branches extremely popular. Consider work on 3D perspective understanding 🏗️, video understanding with LLMs 🎥, visual question answering ❓, and visual reasoning 🧩. But these works cannot fill the audio gap at all.
🎵 For example, you can refer to several examples of multimodal audio work: SoundCLIP 🔊, counterfactual audio learning 🎯, and multimodal instructional system with augmented reality 🎼. Audio reasoning with semantic domain support is also possible, or audio and video can help each other with source separation, and the semantic domain can play a role in connections between vision, language, and audio.
🚀 These systems can ultimately understand, comprehend, and generate unified outputs between vision, language, speech, audio, and video—though not like natural humans, but to get us to our goals faster.

