Skip to content
View ali-vosoughi's full-sized avatar
🎯
Focusing
🎯
Focusing

Highlights

  • Pro

Organizations

@Large-scale-causality-inference

Block or report ali-vosoughi

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
ali-vosoughi/README.md

Ali Vosoughi ⭐ Total Stars

🎓 PhD researcher at Prof. Chenliang Xu's Lab | Unified multimodal reasoning, understanding, and generation | alivosoughi.com


👋 Welcome to Ali Vosoughi's code repository. I try to provide value for you, so I put some of my work here so you can better search through the codes and find what you need.

🧠 My interest is in systems beyond human capabilities—not just through intelligence, but through combining visual, auditory, and other signals [EEG, behavior, haptics, lidar, IR, ultrasound] that evolution forgot to put in us. Look at bats, for example—they see signals we can't see! 🦇

🔬 Currently we're limited to audio, visual, and semantic modalities. Today we connect video and images with language models, and the importance of language has made multimodal branches extremely popular. Consider work on 3D perspective understanding 🏗️, video understanding with LLMs 🎥, visual question answering ❓, and visual reasoning 🧩. But these works cannot fill the audio gap at all.

🎵 For example, you can refer to several examples of multimodal audio work: SoundCLIP 🔊, counterfactual audio learning 🎯, and multimodal instructional system with augmented reality 🎼. Audio reasoning with semantic domain support is also possible, or audio and video can help each other with source separation, and the semantic domain can play a role in connections between vision, language, and audio.

🚀 These systems can ultimately understand, comprehend, and generate unified outputs between vision, language, speech, audio, and video—though not like natural humans, but to get us to our goals faster.

Pinned Loading

  1. yunlong10/Awesome-LLMs-for-Video-Understanding yunlong10/Awesome-LLMs-for-Video-Understanding Public

    🔥🔥🔥 [IEEE TCSVT] Latest Papers, Codes and Datasets on Vid-LLMs.

    3k 136

  2. yunlong10/CAT-V yunlong10/CAT-V Public

    [AAAI 26 Demo] Offical repo for CAT-V - Caption Anything in Video: Object-centric Dense Video Captioning with Spatiotemporal Multimodal Prompting

    Python 63 4

  3. SoundCLIP SoundCLIP Public

    Audio-Visual Even Evaluation (AVE-2) dataset

    HTML 1

  4. PW-VQA PW-VQA Public

    🔥🔥🔥 Possible Worlds Visual Question Answering

    Python 11 1

  5. nguyennm1024/OSCaR nguyennm1024/OSCaR Public

    🔥🔥🔥 Object State Description & Change Detection

    10

  6. counterfactual-audio counterfactual-audio Public

    🔥🔥🔥 ICASSP 2024: CLAP pretraining

    3