otto

One voice. Every tool. Zero friction.

A voice-first productivity assistant that unifies your entire workflow.

otto is an AI-powered voice agent that transforms how you interact with your work tools. Instead of context-switching between GitHub, Gmail, Google Calendar, and countless other platforms, just talk to otto.

Inspiration 💡

As software developers and tech professionals, we juggle an overwhelming number of tools and platforms daily:

GitHub for code reviews and project management
Gmail for communication
Google Calendar for scheduling
Zoom for meetings
LinkedIn for connections
Slack, Linear, Jira, and more...

The problem? Information is scattered. You need to check three different apps just to answer "What's on my plate today?" Cross-platform references make it impossible to maintain context. We waste hours every week just switching between tools.

otto solves this. By voice. In seconds.

Ask otto to "send an email to John about tomorrow's meeting," "what PRs need my review," or "schedule a 1:1 with Sarah next Monday at 3pm" and just like that, it's done. No clicking, no context switching, no hassle. 😌

What it does ✨

otto is your voice-first command center for productivity:

Feature	Description
Email Management	Read unread emails, send messages via voice
Smart Scheduling	Create calendar events with natural language ("tomorrow at 2pm", "next Tuesday")
GitHub Integration	Check commits, PRs, and activity across personal & organization repos
Web Search	Quick web lookups when you need external info
Daily Briefing	AI-generated morning summary of all your services
Voice-First UX	Natural conversation powered by Google Gemini Realtime API

Example Interactions

You: "What's on my calendar today?"
otto: "You have 3 meetings today: Team standup at 10am, 
       Design review at 2pm, and 1:1 with Sarah at 4pm."

You: "What happened on GitHub yesterday?"
otto: "There were 4 commits on the otto repo. Sarah merged a fix 
       for the audio stream, and John updated the auth logic."

You: "Schedule a meeting called 'Sprint Planning' for next Monday at 10am"
otto: "Done! I've scheduled Sprint Planning for January 20th at 10am."

You: "Send an email to [email protected] about the deployment"
otto: "Sure, what should the subject be?"

What makes otto different 🎯

Traditional Assistants	otto
Text-first, voice is an afterthought	Voice-first — Otto is designed for spoken conversation
Generic responses	Context-aware — Otto knows your calendar, repos, and inbox
Single-service integrations	Unified workflow — Otto gives you one interface for all your tools
Forgets everything	Persistent context — Otto remembers your preferences

How we built it 🛠️

┌─────────────────────────────────────────────────────────────────────────────┐
│                           OTTO VOICE PIPELINE                               │
└─────────────────────────────────────────────────────────────────────────────┘

  🎤 User Voice Input
        │
        ▼
  ┌─────────────┐     ┌─────────────┐     ┌──────────────────────┐
  │   LiveKit   │────▶│  Deepgram   │────▶│  Gemini 2.5 Flash    │
  │   WebRTC    │     │     VAD     │     │  (Realtime Audio)    │
  └─────────────┘     └─────────────┘     └──────────────────────┘
                                                    │
                                          ┌─────────┴─────────┐
                                          ▼                   ▼
                                   Intent Recognition    Function Tools
                                          │                   │
                    ┌─────────────────────┼───────────────────┤
                    ▼           ▼         ▼         ▼         ▼
               ┌────────┐ ┌─────────┐ ┌────────┐ ┌──────┐ ┌────────┐
               │ Gmail  │ │Calendar │ │ GitHub │ │ Web  │ │ Send/  │
               │  API   │ │   API   │ │  API   │ │Search│ │ Create │
               └────────┘ └─────────┘ └────────┘ └──────┘ └────────┘
                    │           │         │         │         │
                    └───────────┴─────────┴─────────┴─────────┘
                                          │
                                          ▼
                              ┌──────────────────────┐
                              │   TTC Bear-1 Model   │
                              │  (36-70% Compression)│
                              └──────────────────────┘
                                          │
                                          ▼
                              ┌──────────────────────┐
                              │   Context Builder    │
                              └──────────────────────┘
                                          │
                                          ▼
                              ┌──────────────────────┐
                              │   Gemini Response    │
                              └──────────────────────┘
                                          │
                                          ▼
                                   🔊 Voice Output

Architecture Overview

Layer	Technology	Purpose
Frontend	Next.js 16, React 19, TypeScript	Dashboard UI and voice interface
Voice Pipeline	LiveKit Cloud, Deepgram VAD, Google Gemini 2.5 Flash Realtime	Real-time voice with intelligent turn detection
LLM Fallback	OpenAI GPT-4o	Automatic fallback if Gemini is unavailable
Backend APIs	Next.js API Routes	Service integration and authentication
Agent Runtime	Python + LiveKit Agents SDK	Tool execution and business logic
Authentication	Supabase Auth + OAuth 2.0	User sessions and multi-provider tokens
Token Optimization	TTC Bear-1 Model	36%+ token compression for context efficiency

Key Components (Detailed)

1. Voice Agent (agent/main.py)

Google Gemini 2.5 Flash Native Audio (gemini-2.5-flash-native-audio-preview) — the latest multimodal realtime model for natural voice conversations
Deepgram VAD (Voice Activity Detection) — lightweight, accurate neural network for turn detection that runs locally
LiveKit Agents SDK — connects Python agent to WebRTC rooms with full audio/video capabilities
6 function tools (email, calendar, GitHub, search, send email, create event)
Participant metadata parsing — extracts authenticated user_id from JSON metadata to make secure API calls

2. LiveKit Integration — Real-Time Voice Pipeline

We built a production-ready voice pipeline using LiveKit's WebRTC SDK for both the Next.js frontend and Python agent backend. Here's how it all connects:

Frontend (Next.js + LiveKit Client SDK):

LiveKitSession wrapper component — A React component that manages the entire WebRTC lifecycle: connecting to rooms, handling participant events, and managing audio tracks
/api/connection-details endpoint — Generates secure LiveKit access tokens server-side, tied to the authenticated Supabase user session. The token includes the user's ID in the participant metadata so the agent knows who it's talking to.
Bidirectional audio streaming — We use LiveKit's useVoiceAssistant hook to handle microphone input and speaker output with automatic echo cancellation

Backend (Python + LiveKit Agents SDK):

Deepgram-powered Turn Detection — We integrated Deepgram's VAD (Voice Activity Detection) through LiveKit's agents framework. This is critical for natural conversation, the agent knows exactly when the user has stopped speaking, avoiding awkward cutoffs or long pauses.
Preemptive Response Generation — The agent starts generating a response while the user is still finishing their sentence. By the time they stop talking, the first words are already ready. This dramatically reduces perceived latency.
OpenAI Fallback — If Google Gemini's Realtime API is unavailable or rate-limited, we automatically fall back to OpenAI GPT-4o. The user never notices — they just get a response. This gives us near-100% uptime.
Participant Metadata Parsing — When a user connects, the agent reads their participant.metadata JSON to extract the user_id. This ID is then used in all API calls (X-User-ID header) so the agent can access the user's Gmail, Calendar, and GitHub on their behalf.

Why LiveKit? We evaluated several WebRTC solutions, and LiveKit stood out for its Python Agents SDK. Being able to write our agent logic in Python (where all the ML/AI libraries are) while seamlessly connecting to a React frontend was a game-changer. The built-in VAD support and room management saved us weeks of development time. :)

3. The Token Company (TTC) Integration — Context Compression

This is one of the most important technical decisions we made. Here's the problem:

The Context Problem: To give otto useful, personalized responses, we need to pass the user's data as context to Gemini:

📧 Recent emails (sender, subject, body snippets)
📅 Calendar events (title, time, attendees)
🐙 GitHub activity (commits, PRs, repos)
📋 Daily briefing summaries

We serialize all this as JSON and pass it to the LLM. But here's the catch: a single briefing context can easily hit 3,000-5,000+ tokens. At scale, this is:

Expensive — Token costs add up fast with every voice interaction
Slow — More tokens = longer processing time = higher latency
Limited — We hit context window limits faster

Our Solution: TTC Bear-1 Model

We integrated The Token Company's Bear-1 compression model to solve this. Here's exactly how it works:

from tokenc import TokenClient

client = TokenClient(api_key=TTC_API_KEY)

# Before sending to Gemini, compress the JSON context
result = client.compress_input(
    input=json_context,      # Raw JSON with emails, events, etc.
    aggressiveness=0.7       # Balance between compression and detail
)

compressed_context = result.output  # 36-70% smaller!

Key Implementation Details:

Semantic Compression — Bear-1 doesn't just remove words. It understands the meaning of the text and compresses it intelligently, preserving all critical information while removing redundancy.
Configurable Aggressiveness — We use 0.7 (on a 0-1 scale). Higher values compress more aggressively; we found 0.7 is the sweet spot for maintaining quality.
Short-Text Optimization — We skip compression for payloads under 500 characters. The API overhead isn't worth it for small contexts.
Server-Side Caching — Compressed context is cached to avoid redundant API calls on repeated requests within the same session.
Graceful Fallback — If the tokenc library isn't installed or the TTC API is unavailable, we simply use uncompressed context. The agent still works — just with higher token costs, so there is no error.

Results: | Data Type | Original Tokens | Compressed | Reduction | |-----------|----------------|------------|-----------| | Email threads | ~2,000 | ~700 | 65% | | Calendar week | ~800 | ~400 | 50% | | GitHub summary | ~1,200 | ~750 | 38% | | Full briefing | ~4,000 | ~1,800 | 55% |

Why This Matters: Without TTC, our context payloads would cost 2-3x more and add noticeable latency. Bear-1 literally pays for itself in saved API costs and the speed improvement makes conversations feel more natural. :)

4. Frontend Dashboard

Editorial-style daily briefing with AI narrative
Real-time service integrations with live status
Collapsible sidebar and dark/light theme support

5. API Integrations

OAuth-based GitHub, Google Calendar, Gmail access
Automatic token refresh engine — silently refreshes expired credentials using Supabase Service Role privileges
Agent-specific authentication via X-User-ID header
Smart repository resolution (finds repos by name across orgs)

🚧 Challenges we ran into

1. LiveKit WebRTC Connection

Integrating LiveKit's WebRTC infrastructure with Next.js 16 was complex. We had to:

Handle participant metadata to pass user IDs from the frontend to the Python agent
Generate secure room tokens with proper grants (publish, subscribe, data)
Debug audio stream issues, turns out you need to handle track subscription events carefully
Implement reconnection logic for unstable network conditions

Solution: Created a dedicated LiveKitSession wrapper component that manages the entire lifecycle and /api/connection-details endpoint with Supabase user authentication. The agent reads participant.metadata to get the authenticated user ID for API calls.

2. Voice Turn Detection

Early versions of otto would cut off users mid-sentence or wait too long after they stopped speaking. Getting turn detection right is crucial for natural conversation.

Solution: We integrated Deepgram VAD through LiveKit's agents framework. Deepgram provides precise, cloud-powered voice activity detection with excellent accuracy. Combined with LiveKit's AgentSession, we get smooth turn-taking that feels natural.

3. Context Token Explosion

To be useful, otto needs to know your calendar, emails, and GitHub activity. But a single briefing could easily hit 5,000+ tokens of context, expensive and slow.

Solution: We integrated TTC's Bear-1 model for semantic compression. By passing our JSON context through Bear-1 before sending to Gemini, we reduced token usage by 36-70% while preserving all the meaningful information. The compression happens server-side in Python, and we cache compressed context to avoid redundant API calls.

4. OAuth Token Expiration (The 401 Plague 😱)

Google and GitHub tokens expire. Mid-conversation, the agent would suddenly fail with "Unauthorized" errors, originally leading to a mid UX.

Solution: Built a dedicated authentication library (lib/google-auth.ts) that intercepts API calls, checks token expiry (with a 5-minute buffer), and performs background refresh using Supabase Service Role privileges. The user never sees a 401, tokens refresh silently. Amazing UX!

5. LLM Reliability

Gemini's Realtime API occasionally has availability issues or rate limits. We couldn't have the agent just fail.

Solution: Implemented an OpenAI fallback. If Gemini returns an error or times out, we automatically retry with GPT-4o. The response quality stays high, and users don't notice the switch.

6. Calendar Events & Email Sending

Creating calendar events and sending emails via voice was harder than expected:

Natural language date parsing ("tomorrow", "next Tuesday at 3pm")
Time format conversions (12hr ↔ 24hr)
OAuth scope management for write permissions
RFC 2822 email formatting for Gmail API

Solution: Built comprehensive date/time parser with 15+ format support (including weekday names) and proper OAuth scope configuration (gmail.send, calendar.events).

🏆 Accomplishments that we're proud of

✅ 6 fully functional voice tools that work in production
✅ Real-time voice conversations with sub-500ms perceived latency (thanks to preemptive generation)
✅ 36-70% token reduction through TTC Bear-1 compression — huge cost savings at scale
✅ LiveKit + Deepgram-powered turn detection — natural conversation flow without awkward pauses
✅ Automatic LLM fallback — Gemini → OpenAI failover for 99.9% uptime
✅ Automatic OAuth refresh — the assistant never "dies" from expired tokens
✅ Editorial-style AI briefings that feel like reading a newspaper
✅ Seamless multi-platform integration (GitHub, Gmail, Calendar)
✅ Production-ready authentication with Supabase OAuth

📚 What we learned

LiveKit makes WebRTC manageable — Real-time audio is hard, but their SDK abstracts the complexity. The agents framework for Python is particularly well-designed.
Deepgram VAD is lightweight and accurate — Local neural-network VAD means no extra API calls and precise turn detection.
Token costs add up fast — Without TTC compression, our context payloads would cost 3x more. Bear-1 pays for itself.
Voice UX is fundamentally different — Responses must be concise and spoken-length. No markdown, no bullet lists, no long paragraphs.
Preemptive generation reduces perceived latency — Starting to generate before the user finishes speaking makes responses feel instant.
Always have a fallback — Gemini is great, but having OpenAI as a backup means we never leave users hanging.
Natural language is messy — Parsing "next Monday at 3pm" requires way more code than we thought. 😅
Reliability is the feature — All the AI in the world doesn't matter if tokens expire after 60 minutes.

What's next for otto 🚀

We're building the ultimate unified workflow ecosystem:

Feature	Status	Description
Linear Integration	🔜 Planned	Voice access to issues and projects
Jira Integration	🔜 Planned	Ticket management via voice
Slack Integration	🔜 Planned	Send messages, check channels
Custom Workflows	💡 Concept	"When X happens, do Y" automations
Multi-user Workspaces	💡 Concept	Team-wide voice assistant

Vision: A single voice interface to replace dozens of apps. Ask otto to "create a Linear ticket for the bug John mentioned in Slack and assign it to Sarah", and it's done. 🎉

Full Tech Stack Summary

Frontend: Next.js 16, React 19, TypeScript, Tailwind CSS 4, Lucide React, LiveKit WebRTC Client SDK
Backend: Next.js API Routes, Supabase (Auth + Database)
Auth / OAuth: Supabase Auth + OAuth 2.0 (GitHub, Google, Notion, LinkedIn, Zoom) Voice: LiveKit Cloud, LiveKit Agents SDK (Python), Deepgram VAD, Google Gemini 2.5 Flash Realtime
LLM Fallback: OpenAI GPT-4o
Token Optimization: TTC Bear-1 Model
Integrations: GitHub API, Google Calendar API, Gmail API, Infrastructure: Vercel (frontend), LiveKit Cloud (voice), Supabase (auth/db)