imessage (voice or text input)
Dataflow diagram

Inspiration

ever found yourself away from your computer but realized you quickly need to access a file, push a local commit to github, or run a quick script, and wished you could just text someone to do it for you? that’s why we built doeve (do everything agent), your personal agent to do everything on your computer.

What It Does

doeve lets you text/voice messagae any task or command you want completed on your desktop. it can send you photos and updates as it works, showing progress in real time. you can interrupt, change, or add new tasks all from your imessage chat.

How We Built It

we built doeve to make your computer something you can talk or text with. it combines a fast backend, a lightweight imessage bridge, and a modular ai orchestration layer for voice, text, and computer control. every part is independent but shares a common core, including cli or voice.

the core orchestration engine manages context, tool calls, and updates, supporting two execution modes:

native mode for lightweight direct execution
langchain and langgraph mode for complex workflows with tool chaining, routing, and memory

langchain organizes how llms interact with tools, and langgraph manages state machines, retries, and branching logic, allowing long running multi step automations with error recovery.

iMessage Bridge

built with node.js and express using the @photon-ai/imessage-kit sdk. it connects directly to the macos messages database and emits clean structured events for every new message.

the photon imessage kit sdk connected imessage with the ai backend safely and reliably. it hid all the low level details of the macos messages database with a clean api for reading messages and sending responses with attachments. its real time message watching and error handling made iteration and debugging fast, helping us test and deploy quickly.

when a message arrives, the bridge packages it with sender, group info, and attachments, then sends it to the backend via secure api. the backend processes it, handles optional audio transcription, and queues a task for the automation agent.

as the agent runs, updates and screenshots are sent back through the same sdk, so progress appears right in the user’s imessage thread. this makes it feel like texting a real assistant such as “open notion,” “summarize this pdf,” or “record my screen,” and seeing results live.

Backend and Automation

the backend uses fastapi for async performance and real time streaming. it manages messages, job queuing, and multi session tracking.

for automation, it calls claude 3.5 sonnet’s vision based computer use api, which allows the agent to see the screen, recognize ui elements, and take actions like clicking or typing.
claude’s computer use gave us direct mouse and keyboard control along with screenshot access in a simple and predictable way. gui automation felt natural and debuggable because every action was clearly visible, reliable, and reversible. the vision features combined with desktop control let us handle complex real world tasks like navigating apps, selecting menus, or dragging files without manual intervention. this made autonomous desktop control practical for the first time.

the backend also supports live websocket updates, structured logging, and interruption handling. if you text “stop,” it halts cleanly.

Audio Transcription and Voice Input

to make doeve fully voice capable, we integrated elevenlabs scribe. when an imessage contains audio, the backend detects it, sends it to elevenlabs for transcription, and replaces the body with the transcribed text so it enters the same flow. if transcription fails, the user gets a helpful reply suggesting to resend via text.
we chose elevenlabs for its accuracy across accents and low latency, making voice control feel instant.

Why We Made These Choices

claude computer use: direct mouse and keyboard control and screenshots, reliable and layout agnostic gui automation
langchain and langgraph: beyond simple prompt response with graph style state machines for routing and multi step planning
imessage kit: stable real time macos bridge with structured events and clean abstractions
fastapi: async and scalable with websocket support
elevenlabs transcription: accurate and fast for real time voice
monorepo: modular design for rapid iteration
multiple llms: claude 3.5 for reasoning and vision, gemini for cost efficiency, openai for compatibility

LangChain and LangGraph

langchain and langgraph gave us an agent that went far beyond a single prompt and answer. the graph style state machine in langgraph let us route tasks dynamically based on complexity. simple queries took short paths while complex jobs followed multi step plans.
the middleware design made it straightforward to add tool error handling, retry logic, and observability. although some experimental components required careful testing and safe fallback behavior, the modular structure kept development clean and maintainable.

Development Workflow

we used modular testing, dependency injection, and environment based configuration. all actions and workflows can be automatically recorded for debugging and demos.

The Result

a multimodal agent platform that lets you control your computer through text or voice. you can message “open notion,” “run my script,” or send a voice note. doeve handles everything and replies right in your imessage thread.

Challenges We Ran Into

we initially tried pyautogui and similar screen control tools but hit major accuracy issues. reasoning models with lower cost apis performed worse overall. switching to claude’s vision based control and screen mapping gave the best results.

Accomplishments We’re Proud Of

built a fully working autonomous desktop agent that you can text
integrated claude computer use to achieve stable vision based gui automation
connected imessage to ai backend through the photon sdk for seamless communication
designed a modular orchestration system with langgraph routing, retries, and observability
made automation feel human and debuggable, with clear step by step feedback in real time
built a production ready prototype in under 36 hours while managing real system constraints
collaborated across backend, ai, and product layers to ship a cohesive experience
pushed through setbacks and still shipped despite sleep deprivation and a missing teammate

What We Learned

you don’t need to reinvent every wheel. we tried manually parsing chat.db with python scripts before realizing the imessage kit sdk handled it far better. lesson learned: use good abstractions and build on solid foundations.

What’s Next for Doeve

to eventually do everything from booking dentist appointments and rescheduling meetings to replying to emails. we want to push it further, building the ultimate do anything interface you can text anytime.