What can you build?
Vision Agents makes it simple to prototype and scale a wide range of AI-powered video apps, including:- Coaching & Training — live sports coaching, guided workouts
- Collaboration — meeting assistants, note-taking, transcription
- Automation & Robotics — IoT control, surveillance, manufacturing workflows
- Video AI — video avatars, character agents
Get Started
Installation
Install Vision Agents and set up your first project
Voice Agents
Build real-time voice agents with AI
Video Agents
Create AI-powered video applications
Integrations
Connect with popular AI providers
Built-in AI integrations
Out of the box, Vision Agents supports 23+ providers across the AI stack:- LLMs: OpenAI, Gemini, xAI, OpenRouter (Anthropic, GPT, Gemini & more)
- Realtime APIs: OpenAI (WebRTC), Gemini, AWS Bedrock, Qwen
- Speech-to-Text: Deepgram, Fast-Whisper, Wizper, Fish Audio
- Text-to-Speech: ElevenLabs, Cartesia, AWS Polly, Inworld, Kokoro
- Turn Detection: Smart Turn, Vogent
- Video Processing: Ultralytics (YOLO), Moondream, Roboflow, Decart, HeyGen
- Memory & Context: In-memory, Stream Chat
BaseProcessor or VideoProcessorMixin, you can plug in custom computer-vision models. See Create Your Own Plugin for details.

