Inspiration

Modern professionals lose focus and time constantly switching between devices, apps, and screens. We were inspired by the question: what if you could control your entire digital ecosystem without ever touching a device? Controlla was born from the desire to remove friction and let people stay present while still staying productive.

What it does

Controlla is AI-powered smart glasses that act as a unified digital controller across your phone, laptop, and tablet. Users verbally instruct Controlla to send emails, reply to handle complex tasks: completely hands-free and context-aware.

How we built it

We split the system into three real-time components: the phone client, the central brain, and the device bridges.

  • Phone client

    • Streams camera frames and raw audio over WebSockets
    • Displays live assistant responses and task progress
  • Backend “Brain”

    • Uses OpenAI Realtime for low-latency speech-to-text
    • Sends transcripts + latest camera frame to a reasoning model
    • The reasoning model outputs JSON:
    • Either a normal conversational reply
    • Or a structured device task with a clear goal
    • Handles routing between users and their connected devices
  • Decision & Action System

    • Gemini powers our Local Action Model (LAM) that:
    • Interprets screenshots
    • Chooses the next single UI action
    • Validates results after every step
    • The laptop bridge executes actions using OS automation
    • After each action:
    • Takes a screenshot
    • Sends it back for verification
    • This creates a closed feedback loop so the agent can self-correct

We intentionally kept actions non-destructive and added:

  • Explicit user confirmation
  • Visible task progress
  • Safety constraints on what the agent can execute

Challenges we ran into

macOS permissions: macOS locked down keyboard control, so we had to switch from PyAutoGUI to AppleScript + cliclick and manually grant Accessibility access.

Retina scaling: Retina screenshots are 2× resolution, so we built a coordinate conversion layer to map model clicks to real screen space.

Image format: PyAutoGUI returns RGBA, but the model wants JPEG, so we convert everything to RGB before sending.

iPhone camera/mic: iOS Safari only allows camera/mic over HTTPS, so we used ngrok to tunnel localhost securely.

Model self-correction: The agent kept running python instead of python3, so we hard-coded recovery rules into the prompt.

Accomplishments that we're proud of

  • Voice -> Computer control
    You can literally talk to your phone and watch your Mac respond in real time.
    “Hey Wink, open Terminal” actually works end-to-end.

  • Real-time multimodal AI
    Live audio streaming for transcription + camera frames for visual context.
    The assistant can see what you see and respond naturally.

  • Built our own LAM
    A vision-based model that decides what to click and type from screenshots.
    Handles multi-step tasks and Retina coordinate scaling correctly.

  • True cross-device system
    Phone and laptop connect over separate WebSockets.
    Backend routes commands between them in real time with auto-reconnect.

  • Smart glasses-style UX
    Wake phrase activation, conversation timeouts, and smooth TTS on iOS Safari.
    Feels like talking to an actual assistant, not an app.

  • macOS automation that actually works
    AppleScript + cliclick + proper permissions for reliable control.
    No flaky demos, no fake clicks.

What we learned

  • Setting up secure tunneling for mobile access to local development servers.
  • Designing real-time WebSocket systems for audio, video, and command streaming.
  • Navigating macOS automation permissions and system security constraints.
  • Handling coordinate mismatches caused by Retina displays.
  • Enforcing strict JSON schemas for reliable model outputs.
  • Prompt engineering for structured reasoning and error recovery.
  • Managing latency in real-time voice and vision pipelines.
  • Dealing with browser limitations for camera, microphone, and TTS.
  • Building stable cross-device systems with reconnection logic.

What's next for Controlla

  • Support for more devices (tablets, smart TVs, IoT).
  • Smarter action planning with longer memory and context retention.
  • User-defined custom commands and macros.
  • Stronger safety controls and permission scopes per app.
  • Lower-latency streaming for faster response times.
  • Multi-user support for shared device control.
  • Training a more robust LAM for complex multi-step workflows.

Built With

Share this project:

Updates