otto
One voice. Every tool. Zero friction.
A voice-first productivity assistant that unifies your entire workflow.
otto is an AI-powered voice agent that transforms how you interact with your work tools. Instead of context-switching between GitHub, Gmail, Google Calendar, and countless other platforms, just talk to otto.
Inspiration 💡
As software developers and tech professionals, we juggle an overwhelming number of tools and platforms daily:
- GitHub for code reviews and project management
- Gmail for communication
- Google Calendar for scheduling
- Zoom for meetings
- LinkedIn for connections
- Slack, Linear, Jira, and more...
The problem? Information is scattered. You need to check three different apps just to answer "What's on my plate today?" Cross-platform references make it impossible to maintain context. We waste hours every week just switching between tools.
otto solves this. By voice. In seconds.
Ask otto to "send an email to John about tomorrow's meeting," "what PRs need my review," or "schedule a 1:1 with Sarah next Monday at 3pm" and just like that, it's done. No clicking, no context switching, no hassle. 😌
What it does ✨
otto is your voice-first command center for productivity:
| Feature | Description |
|---|---|
| Email Management | Read unread emails, send messages via voice |
| Smart Scheduling | Create calendar events with natural language ("tomorrow at 2pm", "next Tuesday") |
| GitHub Integration | Check commits, PRs, and activity across personal & organization repos |
| Web Search | Quick web lookups when you need external info |
| Daily Briefing | AI-generated morning summary of all your services |
| Voice-First UX | Natural conversation powered by Google Gemini Realtime API |
Example Interactions
You: "What's on my calendar today?"
otto: "You have 3 meetings today: Team standup at 10am,
Design review at 2pm, and 1:1 with Sarah at 4pm."
You: "What happened on GitHub yesterday?"
otto: "There were 4 commits on the otto repo. Sarah merged a fix
for the audio stream, and John updated the auth logic."
You: "Schedule a meeting called 'Sprint Planning' for next Monday at 10am"
otto: "Done! I've scheduled Sprint Planning for January 20th at 10am."
You: "Send an email to [email protected] about the deployment"
otto: "Sure, what should the subject be?"
What makes otto different 🎯
| Traditional Assistants | otto |
|---|---|
| Text-first, voice is an afterthought | Voice-first — Otto is designed for spoken conversation |
| Generic responses | Context-aware — Otto knows your calendar, repos, and inbox |
| Single-service integrations | Unified workflow — Otto gives you one interface for all your tools |
| Forgets everything | Persistent context — Otto remembers your preferences |
How we built it 🛠️
┌─────────────────────────────────────────────────────────────────────────────┐
│ OTTO VOICE PIPELINE │
└─────────────────────────────────────────────────────────────────────────────┘
🎤 User Voice Input
│
▼
┌─────────────┐ ┌─────────────┐ ┌──────────────────────┐
│ LiveKit │────▶│ Deepgram │────▶│ Gemini 2.5 Flash │
│ WebRTC │ │ VAD │ │ (Realtime Audio) │
└─────────────┘ └─────────────┘ └──────────────────────┘
│
┌─────────┴─────────┐
▼ ▼
Intent Recognition Function Tools
│ │
┌─────────────────────┼───────────────────┤
▼ ▼ ▼ ▼ ▼
┌────────┐ ┌─────────┐ ┌────────┐ ┌──────┐ ┌────────┐
│ Gmail │ │Calendar │ │ GitHub │ │ Web │ │ Send/ │
│ API │ │ API │ │ API │ │Search│ │ Create │
└────────┘ └─────────┘ └────────┘ └──────┘ └────────┘
│ │ │ │ │
└───────────┴─────────┴─────────┴─────────┘
│
▼
┌──────────────────────┐
│ TTC Bear-1 Model │
│ (36-70% Compression)│
└──────────────────────┘
│
▼
┌──────────────────────┐
│ Context Builder │
└──────────────────────┘
│
▼
┌──────────────────────┐
│ Gemini Response │
└──────────────────────┘
│
▼
🔊 Voice Output
Architecture Overview
| Layer | Technology | Purpose |
|---|---|---|
| Frontend | Next.js 16, React 19, TypeScript | Dashboard UI and voice interface |
| Voice Pipeline | LiveKit Cloud, Deepgram VAD, Google Gemini 2.5 Flash Realtime | Real-time voice with intelligent turn detection |
| LLM Fallback | OpenAI GPT-4o | Automatic fallback if Gemini is unavailable |
| Backend APIs | Next.js API Routes | Service integration and authentication |
| Agent Runtime | Python + LiveKit Agents SDK | Tool execution and business logic |
| Authentication | Supabase Auth + OAuth 2.0 | User sessions and multi-provider tokens |
| Token Optimization | TTC Bear-1 Model | 36%+ token compression for context efficiency |
Key Components (Detailed)
1. Voice Agent (agent/main.py)
- Google Gemini 2.5 Flash Native Audio (
gemini-2.5-flash-native-audio-preview) — the latest multimodal realtime model for natural voice conversations - Deepgram VAD (Voice Activity Detection) — lightweight, accurate neural network for turn detection that runs locally
- LiveKit Agents SDK — connects Python agent to WebRTC rooms with full audio/video capabilities
- 6 function tools (email, calendar, GitHub, search, send email, create event)
- Participant metadata parsing — extracts authenticated
user_idfrom JSON metadata to make secure API calls
2. LiveKit Integration — Real-Time Voice Pipeline
We built a production-ready voice pipeline using LiveKit's WebRTC SDK for both the Next.js frontend and Python agent backend. Here's how it all connects:
Frontend (Next.js + LiveKit Client SDK):
LiveKitSessionwrapper component — A React component that manages the entire WebRTC lifecycle: connecting to rooms, handling participant events, and managing audio tracks/api/connection-detailsendpoint — Generates secure LiveKit access tokens server-side, tied to the authenticated Supabase user session. The token includes the user's ID in the participant metadata so the agent knows who it's talking to.- Bidirectional audio streaming — We use LiveKit's
useVoiceAssistanthook to handle microphone input and speaker output with automatic echo cancellation
Backend (Python + LiveKit Agents SDK):
- Deepgram-powered Turn Detection — We integrated Deepgram's VAD (Voice Activity Detection) through LiveKit's agents framework. This is critical for natural conversation, the agent knows exactly when the user has stopped speaking, avoiding awkward cutoffs or long pauses.
- Preemptive Response Generation — The agent starts generating a response while the user is still finishing their sentence. By the time they stop talking, the first words are already ready. This dramatically reduces perceived latency.
- OpenAI Fallback — If Google Gemini's Realtime API is unavailable or rate-limited, we automatically fall back to OpenAI GPT-4o. The user never notices — they just get a response. This gives us near-100% uptime.
- Participant Metadata Parsing — When a user connects, the agent reads their
participant.metadataJSON to extract theuser_id. This ID is then used in all API calls (X-User-IDheader) so the agent can access the user's Gmail, Calendar, and GitHub on their behalf.
Why LiveKit? We evaluated several WebRTC solutions, and LiveKit stood out for its Python Agents SDK. Being able to write our agent logic in Python (where all the ML/AI libraries are) while seamlessly connecting to a React frontend was a game-changer. The built-in VAD support and room management saved us weeks of development time. :)
3. The Token Company (TTC) Integration — Context Compression
This is one of the most important technical decisions we made. Here's the problem:
The Context Problem: To give otto useful, personalized responses, we need to pass the user's data as context to Gemini:
- 📧 Recent emails (sender, subject, body snippets)
- 📅 Calendar events (title, time, attendees)
- 🐙 GitHub activity (commits, PRs, repos)
- 📋 Daily briefing summaries
We serialize all this as JSON and pass it to the LLM. But here's the catch: a single briefing context can easily hit 3,000-5,000+ tokens. At scale, this is:
- Expensive — Token costs add up fast with every voice interaction
- Slow — More tokens = longer processing time = higher latency
- Limited — We hit context window limits faster
Our Solution: TTC Bear-1 Model
We integrated The Token Company's Bear-1 compression model to solve this. Here's exactly how it works:
from tokenc import TokenClient
client = TokenClient(api_key=TTC_API_KEY)
# Before sending to Gemini, compress the JSON context
result = client.compress_input(
input=json_context, # Raw JSON with emails, events, etc.
aggressiveness=0.7 # Balance between compression and detail
)
compressed_context = result.output # 36-70% smaller!
Key Implementation Details:
- Semantic Compression — Bear-1 doesn't just remove words. It understands the meaning of the text and compresses it intelligently, preserving all critical information while removing redundancy.
- Configurable Aggressiveness — We use 0.7 (on a 0-1 scale). Higher values compress more aggressively; we found 0.7 is the sweet spot for maintaining quality.
- Short-Text Optimization — We skip compression for payloads under 500 characters. The API overhead isn't worth it for small contexts.
- Server-Side Caching — Compressed context is cached to avoid redundant API calls on repeated requests within the same session.
- Graceful Fallback — If the
tokenclibrary isn't installed or the TTC API is unavailable, we simply use uncompressed context. The agent still works — just with higher token costs, so there is no error.
Results: | Data Type | Original Tokens | Compressed | Reduction | |-----------|----------------|------------|-----------| | Email threads | ~2,000 | ~700 | 65% | | Calendar week | ~800 | ~400 | 50% | | GitHub summary | ~1,200 | ~750 | 38% | | Full briefing | ~4,000 | ~1,800 | 55% |
Why This Matters: Without TTC, our context payloads would cost 2-3x more and add noticeable latency. Bear-1 literally pays for itself in saved API costs and the speed improvement makes conversations feel more natural. :)
4. Frontend Dashboard
- Editorial-style daily briefing with AI narrative
- Real-time service integrations with live status
- Collapsible sidebar and dark/light theme support
5. API Integrations
- OAuth-based GitHub, Google Calendar, Gmail access
- Automatic token refresh engine — silently refreshes expired credentials using Supabase Service Role privileges
- Agent-specific authentication via
X-User-IDheader - Smart repository resolution (finds repos by name across orgs)
🚧 Challenges we ran into
1. LiveKit WebRTC Connection
Integrating LiveKit's WebRTC infrastructure with Next.js 16 was complex. We had to:
- Handle participant metadata to pass user IDs from the frontend to the Python agent
- Generate secure room tokens with proper grants (publish, subscribe, data)
- Debug audio stream issues, turns out you need to handle track subscription events carefully
- Implement reconnection logic for unstable network conditions
Solution: Created a dedicated LiveKitSession wrapper component that manages the entire lifecycle and /api/connection-details endpoint with Supabase user authentication. The agent reads participant.metadata to get the authenticated user ID for API calls.
2. Voice Turn Detection
Early versions of otto would cut off users mid-sentence or wait too long after they stopped speaking. Getting turn detection right is crucial for natural conversation.
Solution: We integrated Deepgram VAD through LiveKit's agents framework. Deepgram provides precise, cloud-powered voice activity detection with excellent accuracy. Combined with LiveKit's AgentSession, we get smooth turn-taking that feels natural.
3. Context Token Explosion
To be useful, otto needs to know your calendar, emails, and GitHub activity. But a single briefing could easily hit 5,000+ tokens of context, expensive and slow.
Solution: We integrated TTC's Bear-1 model for semantic compression. By passing our JSON context through Bear-1 before sending to Gemini, we reduced token usage by 36-70% while preserving all the meaningful information. The compression happens server-side in Python, and we cache compressed context to avoid redundant API calls.
4. OAuth Token Expiration (The 401 Plague 😱)
Google and GitHub tokens expire. Mid-conversation, the agent would suddenly fail with "Unauthorized" errors, originally leading to a mid UX.
Solution: Built a dedicated authentication library (lib/google-auth.ts) that intercepts API calls, checks token expiry (with a 5-minute buffer), and performs background refresh using Supabase Service Role privileges. The user never sees a 401, tokens refresh silently. Amazing UX!
5. LLM Reliability
Gemini's Realtime API occasionally has availability issues or rate limits. We couldn't have the agent just fail.
Solution: Implemented an OpenAI fallback. If Gemini returns an error or times out, we automatically retry with GPT-4o. The response quality stays high, and users don't notice the switch.
6. Calendar Events & Email Sending
Creating calendar events and sending emails via voice was harder than expected:
- Natural language date parsing ("tomorrow", "next Tuesday at 3pm")
- Time format conversions (12hr ↔ 24hr)
- OAuth scope management for write permissions
- RFC 2822 email formatting for Gmail API
Solution: Built comprehensive date/time parser with 15+ format support (including weekday names) and proper OAuth scope configuration (gmail.send, calendar.events).
🏆 Accomplishments that we're proud of
- ✅ 6 fully functional voice tools that work in production
- ✅ Real-time voice conversations with sub-500ms perceived latency (thanks to preemptive generation)
- ✅ 36-70% token reduction through TTC Bear-1 compression — huge cost savings at scale
- ✅ LiveKit + Deepgram-powered turn detection — natural conversation flow without awkward pauses
- ✅ Automatic LLM fallback — Gemini → OpenAI failover for 99.9% uptime
- ✅ Automatic OAuth refresh — the assistant never "dies" from expired tokens
- ✅ Editorial-style AI briefings that feel like reading a newspaper
- ✅ Seamless multi-platform integration (GitHub, Gmail, Calendar)
- ✅ Production-ready authentication with Supabase OAuth
📚 What we learned
- LiveKit makes WebRTC manageable — Real-time audio is hard, but their SDK abstracts the complexity. The agents framework for Python is particularly well-designed.
- Deepgram VAD is lightweight and accurate — Local neural-network VAD means no extra API calls and precise turn detection.
- Token costs add up fast — Without TTC compression, our context payloads would cost 3x more. Bear-1 pays for itself.
- Voice UX is fundamentally different — Responses must be concise and spoken-length. No markdown, no bullet lists, no long paragraphs.
- Preemptive generation reduces perceived latency — Starting to generate before the user finishes speaking makes responses feel instant.
- Always have a fallback — Gemini is great, but having OpenAI as a backup means we never leave users hanging.
- Natural language is messy — Parsing "next Monday at 3pm" requires way more code than we thought. 😅
- Reliability is the feature — All the AI in the world doesn't matter if tokens expire after 60 minutes.
What's next for otto 🚀
We're building the ultimate unified workflow ecosystem:
| Feature | Status | Description |
|---|---|---|
| Linear Integration | 🔜 Planned | Voice access to issues and projects |
| Jira Integration | 🔜 Planned | Ticket management via voice |
| Slack Integration | 🔜 Planned | Send messages, check channels |
| Custom Workflows | 💡 Concept | "When X happens, do Y" automations |
| Multi-user Workspaces | 💡 Concept | Team-wide voice assistant |
Vision: A single voice interface to replace dozens of apps. Ask otto to "create a Linear ticket for the bug John mentioned in Slack and assign it to Sarah", and it's done. 🎉
Full Tech Stack Summary
Frontend: Next.js 16, React 19, TypeScript, Tailwind CSS 4, Lucide React, LiveKit WebRTC Client SDK
Backend: Next.js API Routes, Supabase (Auth + Database)
Auth / OAuth: Supabase Auth + OAuth 2.0 (GitHub, Google, Notion, LinkedIn, Zoom)
Voice: LiveKit Cloud, LiveKit Agents SDK (Python), Deepgram VAD, Google Gemini 2.5 Flash Realtime
LLM Fallback: OpenAI GPT-4o
Token Optimization: TTC Bear-1 Model
Integrations: GitHub API, Google Calendar API, Gmail API,
Infrastructure: Vercel (frontend), LiveKit Cloud (voice), Supabase (auth/db)
Built With
- api
- deepgram
- gemini
- github-api
- gmail-api
- google-calendar-api
- google-gmail-oauth
- livekit
- next.js
- oauth
- postgresql
- python
- react
- supabase
- tailwindcss
- thetokencompany
- typescript
- webrtc
Log in or sign up for Devpost to join the conversation.