General Agent (CU) -- Gemini API Developer Competition 2025
Inspiration
We were inspired by the vision of a truly general-purpose computer agent -- one that doesn't rely on application-specific APIs or scripting, but instead sees the screen like a human and interacts through mouse clicks and keyboard input. When Google released the Gemini Computer Use API with vision-based desktop control, we saw the opportunity to build a multi-agent system that could autonomously perform complex engineering design tasks on a real Linux desktop. The idea of an agent that could research dimensions online, generate professional reports, and then build 3D models in FreeCAD -- all by visually navigating the GUI -- felt like a meaningful step toward general computer use.
What it does
General Agent (CU) is a multi-agent system that autonomously operates an Ubuntu Linux desktop to perform engineering design tasks. It uses Google Gemini's Computer Use API (vision model) to see the screen, reason about what to do, and control the mouse and keyboard to drive applications like FreeCAD and Google Chrome -- all without any application-specific APIs or scripting.
The system includes four key capabilities:
CAD Agent: Designs 3D parts in FreeCAD by visually navigating menus, drawing sketches, applying dimensional constraints, and performing Part Design operations (Pad, Pocket, Thickness, Fillet, Chamfer, etc.) -- just like a human would. It can also execute Python macros directly in FreeCAD's embedded console for precision geometry, with full error capture and reporting.
Research Agent: Opens a browser via Playwright, searches DuckDuckGo, reads multiple pages, and extracts structured data points with confidence scores and source URLs.
Documentation Agent: Converts raw research data into professionally formatted Word (.docx) and PDF reports with tables, citations, and executive summaries.
Multi-Agent Pipeline: A Planner (powered by Gemini 3.1 Pro Preview) routes complex requests through multiple agents. For example, "Make a bracket for an M6 bolt" triggers Research (find M6 bolt clearance hole specs from engineering websites) -> Documentation (save a professional report) -> CAD (build a 3D L-bracket with correctly-sized bolt holes using the researched dimensions).
The system also features a Skill Learning Pipeline that processes YouTube FreeCAD tutorial videos into structured YAML skill files, enabling the agent to learn new CAD techniques from video demonstrations.
Design Philosophy
Through extensive testing, we discovered that less instruction = better performance with vision-based computer use models. The CAD agent uses a minimal system instruction (~130 lines) that teaches basic desktop navigation and FreeCAD menu usage. All task-specific intelligence comes from the Planner, which generates detailed action plans with step-by-step workflows tailored to each shape.
This "minimal instruction + smart planning" approach outperforms longer, more detailed system prompts because the model can focus on what it sees rather than reconciling conflicting instructions.
Key Insight: Model Sophistication Matters
Through extensive testing, we discovered that complex engineering workflows are absolutely possible with Computer Use -- but the quality of the output depends heavily on the model's reasoning capability, not just its ability to see and click. The vision model (Gemini 3 Flash Preview) can see the screen perfectly and click accurately, but CAD design requires reasoning about 3D geometry from 2D screenshots -- understanding which face is the "top face," how a pocket changes the shape, and generating correct Python API code. This is fundamentally a reasoning task layered on top of a vision task.
Our two-model architecture anticipates this: the Planner (running on Gemini 3.1 Pro Preview, the most capable reasoning model available) generates detailed step-by-step workflows with correct FreeCAD API examples and face-finding patterns. The CAD agent (running on Flash for Computer Use) follows those instructions. When a reasoning-class model like 3.1 Pro gains Computer Use support, it will unlock complex multi-feature designs in a single session.
How we built it
The core engine is a shared Agentic Loop that implements a multi-turn screenshot-action cycle: capture screenshot -> send to Gemini -> receive function calls -> execute via desktop/browser executor -> repeat. This loop is shared across all agents.
For desktop control, we use xdotool on X11 for mouse/keyboard automation and scrot + PIL for screenshot capture. The VM runs Ubuntu with XFCE desktop at 1280x800 resolution, with screenshots resized to 1440x900 (matching Gemini's recommended resolution with the same 16:10 aspect ratio, so no distortion occurs).
The CAD Agent uses a hybrid approach: menu-driven GUI interaction for navigation (always clicking FreeCAD's menu bar -- large text targets -- instead of tiny toolbar icons, which dramatically improves click accuracy) combined with Python macro execution for precise geometry. The macro engine writes code to a file, wraps it in try/except error capture, dynamically locates the FreeCAD window via xdotool, and pastes the run command into the Python console via xclip. Errors are captured to a log file and returned to the agent so it can self-correct.
The Planner (powered by Gemini 3.1 Pro Preview, text-only) does more than route requests -- it generates detailed action plans tailored to each shape type. It classifies shapes, extracts dimensions, normalizes parameter names from inconsistent research data, and creates step-by-step FreeCAD workflows using the LLM. Complex shapes are decomposed into sequences of simple operations. For hollow shapes, we discovered that the Thickness tool (one-click face hollowing, ~24 turns) massively outperforms Pocket workflows (sketch-on-face, 65+ turns with low success rate).
The Research Agent uses Playwright to control a Chromium browser, with Gemini's vision model navigating DuckDuckGo search results (chosen for reliability -- no CAPTCHAs or cookie consent walls), clicking links, and extracting structured data.
The Skill Learning Pipeline uses yt-dlp for video download, OpenCV MOG2 background subtraction for keyframe extraction (with FreeCAD 3D viewport masking to ignore rotation), Whisper ASR as a transcription fallback, and Gemini Vision for action labeling and quality filtering.
We built all coordinate handling on Gemini's normalized 0-1000 grid, with executors converting to actual screen pixels. The system includes a one-command GCP deployment script that provisions a complete Ubuntu VM with XFCE, Xvfb virtual display, VNC server, FreeCAD, and all dependencies.
Key technologies: Google Gemini 3 Flash Preview (gemini-3-flash-preview) for Computer Use, Google Gemini 3.1 Pro Preview (gemini-3.1-pro-preview) for planning, xdotool, xclip, scrot, Playwright, FreeCAD 1.0, OpenCV, yt-dlp, Whisper ASR, fpdf2, python-docx, PIL.
Challenges we ran into
Gemini API conversation history management was our biggest challenge. Gemini requires strict role alternation -- consecutive messages from the same role cause 400 INVALID_ARGUMENT errors. Every function call needs a matching response, and the model's internal thought signatures cannot be fabricated. We implemented a robust error recovery system: on 400 errors, history is completely reset to the initial prompt plus a fresh screenshot (trimming doesn't work because it breaks function call/response pairing).
Click accuracy on small UI elements was another major hurdle. FreeCAD's toolbar icons are roughly 24 pixels -- far too small for reliable vision-based clicking. We solved this by instructing the agent to always use FreeCAD's menu bar, which has much larger text targets. This was one of our most impactful design decisions.
Macro execution reliability: Getting Python code to actually run inside FreeCAD's embedded console required solving multiple problems: the console input position varies depending on window size and layout, character-by-character typing via xdotool is slow and error-prone, and FreeCAD's Python errors are invisible unless explicitly captured. We solved this with dynamic window detection (xdotool search), clipboard paste via xclip, and try/except wrapping that writes errors to a log file.
FreeCAD API hallucination: The vision model generates macros with wrong property names (e.g., HoleType instead of the correct property), wrong face references (guessing Face6 instead of finding faces by position), and incorrect constraint syntax. We mitigated this by having the Planner inject correct API examples into the workflow, and by teaching the model to find faces dynamically: max(body.Shape.Faces, key=lambda f: f.CenterOfMass.z).
Shape decomposition complexity -- different shapes require fundamentally different FreeCAD workflows. We built a Planner that uses Gemini 3.1 Pro to generate workflows dynamically instead of hardcoding templates for each shape type.
Search engine reliability: Google search triggered CAPTCHAs that blocked the research agent. We switched to DuckDuckGo exclusively, which works reliably with automated browsing.
PDF generation issues -- control characters (0x00-0x1f, 0x7f-0x9f) in research results caused "Not enough horizontal space" errors. We built a safe() function that strips control characters and handles encoding.
Stuck detection -- the agent would sometimes get stuck repeating the same action. We implemented stage budgets (per-phase turn limits), turn countdown warnings, and detection that injects warning messages.
Video pipeline codec issues -- most VMs lack AV1 hardware decoding, so we had to force h264 encoding in the download pipeline.
Accomplishments that we're proud of
We're proud of building a truly general-purpose computer use system that works without any application-specific APIs. The agent navigates FreeCAD purely through vision -- it reads menus, clicks buttons, types dimensions, and reasons about 3D geometry just by looking at screenshots.
The Research -> CAD pipeline is particularly impressive: the system can take a request like "Make a bracket for an M6 bolt," automatically research real-world M6 bolt clearance hole specifications from engineering websites, generate a professional report with sourced data points, extract the exact dimensions needed, and then build a 3D model using those researched specifications.
The two-model architecture is a key insight -- using Gemini 3.1 Pro (the strongest reasoning model) for planning and workflow generation, while using Gemini 3 Flash (with Computer Use support) for execution. This compensates for Flash's weaker reasoning and proves that complex engineering workflows are possible when you separate the "thinking" from the "doing."
The macro execution engine with error capture and dynamic window detection represents a novel approach to bridging vision-based agents with programmatic precision -- the agent can seamlessly switch between clicking menus and running exact Python code.
The "minimal instruction + smart planning" design philosophy is a reusable insight -- we discovered that a short ~130-line system prompt combined with intelligent Planner-generated action plans outperforms verbose, detailed system instructions.
The one-command GCP deployment (./scripts/deploy.sh --gcp) provisions a complete Linux VM with everything needed in under 10 minutes.
The Skill Learning Pipeline -- processing YouTube tutorial videos into structured skill files that the agent can reference -- enables continuous learning from human demonstrations.
What we learned
We learned that complex engineering workflows are absolutely possible with Computer Use, but require a more sophisticated model than what's currently available for Computer Use. The vision model can see and click perfectly -- the bottleneck is reasoning about 3D geometry, generating correct API code, and maintaining coherence across long sessions. When a reasoning-class model like Gemini 3.1 Pro gets Computer Use support, the same architecture will unlock dramatically more complex designs.
We learned that less instruction = better performance: minimal system prompts combined with smart planning outperform verbose instructions. The vision model works best when it can focus on what it sees.
Menu-driven interaction was one of our most impactful discoveries -- using the menu bar (large text targets) instead of toolbar icons (~24px) dramatically improves click accuracy. UI interaction strategy matters enormously for vision-based agents.
One feature per macro is critical -- a single macro creating sketch + pad + pocket + hole will silently fail at the first error. Separating into one macro per feature with screenshot verification between each allows the agent to detect and respond to failures.
Thickness over Pocket: The same end result (e.g., a hollow box) can be achieved through very different workflows with dramatically different success rates. Choosing the right FreeCAD approach for each shape type matters more than improving the model's general precision.
We learned that research agent reliability depends entirely on the search engine -- Google search triggers CAPTCHAs that block automated agents, while DuckDuckGo works consistently.
Conversation history management with Gemini's strict alternation requirements taught us a lot about building reliable agentic systems. Coordinate system handling (normalized 0-1000 grids, VM resolution vs. model resolution, aspect ratio matching) reinforced the importance of precise geometric reasoning in computer use agents.
What's next for General Agent (CU)
- Waiting for Gemini 3.1 Pro Computer Use -- our architecture is ready. When the strongest reasoning model gets Computer Use support, it will unlock complex multi-feature CAD designs, self-correcting macros, and reliable spatial reasoning.
- Expanding agent capabilities beyond FreeCAD to other engineering tools (KiCad for PCB design, Blender for 3D modeling)
- Adding a web interface for remote agent control and monitoring
- Better parallelism -- running multiple agents simultaneously on different tasks
- More robust error recovery with learning from failure patterns
- Leveraging the Skill Learning Pipeline with stronger models that can process longer contexts
- Multi-part assemblies using FreeCAD's Assembly workbench
Built With
- freecad
- gemini
- google-cloud
- opencv
- playwright
- python
- ubuntu
- vm
- xdotool
Log in or sign up for Devpost to join the conversation.