Inspiration
Modern professionals lose focus and time constantly switching between devices, apps, and screens. We were inspired by the question: what if you could control your entire digital ecosystem without ever touching a device? Controlla was born from the desire to remove friction and let people stay present while still staying productive.
What it does
Controlla is AI-powered smart glasses that act as a unified digital controller across your phone, laptop, and tablet. Users verbally instruct Controlla to send emails, reply to handle complex tasks: completely hands-free and context-aware.
How we built it
We split the system into three real-time components: the phone client, the central brain, and the device bridges.
Phone client
- Streams camera frames and raw audio over WebSockets
- Displays live assistant responses and task progress
Backend “Brain”
- Uses OpenAI Realtime for low-latency speech-to-text
- Sends transcripts + latest camera frame to a reasoning model
- The reasoning model outputs JSON:
- Either a normal conversational reply
- Or a structured device task with a clear goal
- Handles routing between users and their connected devices
Decision & Action System
- Gemini powers our Local Action Model (LAM) that:
- Interprets screenshots
- Chooses the next single UI action
- Validates results after every step
- The laptop bridge executes actions using OS automation
- After each action:
- Takes a screenshot
- Sends it back for verification
- This creates a closed feedback loop so the agent can self-correct
We intentionally kept actions non-destructive and added:
- Explicit user confirmation
- Visible task progress
- Safety constraints on what the agent can execute
Challenges we ran into
macOS permissions: macOS locked down keyboard control, so we had to switch from PyAutoGUI to AppleScript + cliclick and manually grant Accessibility access.
Retina scaling: Retina screenshots are 2× resolution, so we built a coordinate conversion layer to map model clicks to real screen space.
Image format: PyAutoGUI returns RGBA, but the model wants JPEG, so we convert everything to RGB before sending.
iPhone camera/mic: iOS Safari only allows camera/mic over HTTPS, so we used ngrok to tunnel localhost securely.
Model self-correction: The agent kept running python instead of python3, so we hard-coded recovery rules into the prompt.
Accomplishments that we're proud of
Voice -> Computer control
You can literally talk to your phone and watch your Mac respond in real time.
“Hey Wink, open Terminal” actually works end-to-end.Real-time multimodal AI
Live audio streaming for transcription + camera frames for visual context.
The assistant can see what you see and respond naturally.Built our own LAM
A vision-based model that decides what to click and type from screenshots.
Handles multi-step tasks and Retina coordinate scaling correctly.True cross-device system
Phone and laptop connect over separate WebSockets.
Backend routes commands between them in real time with auto-reconnect.Smart glasses-style UX
Wake phrase activation, conversation timeouts, and smooth TTS on iOS Safari.
Feels like talking to an actual assistant, not an app.macOS automation that actually works
AppleScript + cliclick + proper permissions for reliable control.
No flaky demos, no fake clicks.
What we learned
- Setting up secure tunneling for mobile access to local development servers.
- Designing real-time WebSocket systems for audio, video, and command streaming.
- Navigating macOS automation permissions and system security constraints.
- Handling coordinate mismatches caused by Retina displays.
- Enforcing strict JSON schemas for reliable model outputs.
- Prompt engineering for structured reasoning and error recovery.
- Managing latency in real-time voice and vision pipelines.
- Dealing with browser limitations for camera, microphone, and TTS.
- Building stable cross-device systems with reconnection logic.
What's next for Controlla
- Support for more devices (tablets, smart TVs, IoT).
- Smarter action planning with longer memory and context retention.
- User-defined custom commands and macros.
- Stronger safety controls and permission scopes per app.
- Lower-latency streaming for faster response times.
- Multi-user support for shared device control.
- Training a more robust LAM for complex multi-step workflows.
Built With
- elevenlabs
- fastapi
- gemini
- openai
- python
- websockets



Log in or sign up for Devpost to join the conversation.