Controlla | Devpost

GIF
Controlla Website

Inspiration

Modern professionals lose focus and time constantly switching between devices, apps, and screens. We were inspired by the question: what if you could control your entire digital ecosystem without ever touching a device? Controlla was born from the desire to remove friction and let people stay present while still staying productive.

What it does

Controlla is AI-powered smart glasses that act as a unified digital controller across your phone, laptop, and tablet. Users verbally instruct Controlla to send emails, reply to handle complex tasks: completely hands-free and context-aware.

How we built it

We split the system into three real-time components: the phone client, the central brain, and the device bridges.

Phone client
- Streams camera frames and raw audio over WebSockets
- Displays live assistant responses and task progress
Backend “Brain”
- Uses OpenAI Realtime for low-latency speech-to-text
- Sends transcripts + latest camera frame to a reasoning model
- The reasoning model outputs JSON:
- Either a normal conversational reply
- Or a structured device task with a clear goal
- Handles routing between users and their connected devices
Decision & Action System
- Gemini powers our Local Action Model (LAM) that:
- Interprets screenshots
- Chooses the next single UI action
- Validates results after every step
- The laptop bridge executes actions using OS automation
- After each action:
- Takes a screenshot
- Sends it back for verification
- This creates a closed feedback loop so the agent can self-correct

We intentionally kept actions non-destructive and added:

Explicit user confirmation
Visible task progress
Safety constraints on what the agent can execute

Challenges we ran into

macOS permissions: macOS locked down keyboard control, so we had to switch from PyAutoGUI to AppleScript + cliclick and manually grant Accessibility access.

Retina scaling: Retina screenshots are 2× resolution, so we built a coordinate conversion layer to map model clicks to real screen space.

Image format: PyAutoGUI returns RGBA, but the model wants JPEG, so we convert everything to RGB before sending.

iPhone camera/mic: iOS Safari only allows camera/mic over HTTPS, so we used ngrok to tunnel localhost securely.

Model self-correction: The agent kept running python instead of python3, so we hard-coded recovery rules into the prompt.

Accomplishments that we're proud of

Voice -> Computer control
You can literally talk to your phone and watch your Mac respond in real time.
“Hey Wink, open Terminal” actually works end-to-end.
Real-time multimodal AI
Live audio streaming for transcription + camera frames for visual context.
The assistant can see what you see and respond naturally.
Built our own LAM
A vision-based model that decides what to click and type from screenshots.
Handles multi-step tasks and Retina coordinate scaling correctly.
True cross-device system
Phone and laptop connect over separate WebSockets.
Backend routes commands between them in real time with auto-reconnect.
Smart glasses-style UX
Wake phrase activation, conversation timeouts, and smooth TTS on iOS Safari.
Feels like talking to an actual assistant, not an app.
macOS automation that actually works
AppleScript + cliclick + proper permissions for reliable control.
No flaky demos, no fake clicks.