Skip to main content
Build low-latency voice and video AI agents using any model. Vision Agents is an open-source, edge-agnostic Python framework with 25+ integrations, production-ready deployment, and Stream’s global edge network for sub-500ms latency.

Quickstart

Build your first agent in under 5 minutes

GitHub

Star the project and explore examples

What You Can Build

AI Golf Coach

YOLO pose detection watches your swing via camera while Gemini gives real-time coaching feedback.

Phone Support Agent

Twilio-powered agent answers inbound calls with RAG-backed knowledge bases via TurboPuffer.

Smart Security Camera

Face recognition and package detection with YOLO, sending automated alerts in real time.

Live Sports Commentator

Roboflow object detection tracks players and ball while an LLM delivers play-by-play.

Live Video Restyler

Camera feed transformed into narrated stories with Decart video style transfer.

Interactive Avatar

HeyGen avatars that see, hear, and respond with real-time voice and video.

Capabilities

  • 25+ integrations — OpenAI, Gemini, Anthropic, Deepgram, ElevenLabs, YOLO, and more
  • Two modes — Realtime APIs (WebRTC/WebSocket) or custom STT → LLM → TTS pipelines
  • Video processing — Run YOLO, Roboflow, or custom models on every frame
  • Phone support — Twilio integration for voice calls with bi-directional audio
  • RAG — TurboPuffer vector search and Gemini FileSearch for knowledge retrieval
  • Production ready — HTTP server, Prometheus metrics, Docker and Kubernetes deployment

Next Steps

Quickstart

Install and build your first agent

Integrations

Browse 25+ supported AI providers

Guides

Deploy to production with Docker and metrics

Try Stream Video

Get 333,000 free participant minutes