The Cloud Orchestrator

The Cloud Orchestrator Architecture Diagram
Sample - DAG - Planner

🚀 Inspiration

Spinning up cloud infrastructure today feels like playing whack-a-mole—juggling a dozen dashboards, wrestling with YAML files, dodging IAM errors, and praying you don’t blow past budget limits. It’s powerful, yes—but also painfully manual, fragile, and slow.

As AI evolves beyond chatbots into collaborative reasoning agents, we asked ourselves: "Why can’t we just ask the cloud what we want—and have it build itself?"

That question led us down the rabbit hole of autonomous agents, service orchestration, and natural language interfaces. When we discovered Google Cloud’s Agent Development Kit (ADK), it felt like a missing puzzle piece—a way to make cloud infrastructure not just programmable, but conversational.

We imagined a future where AI agents could talk to each other, plan tasks, enforce safety constraints, and carry out cloud operations—without a single manual click.

Cloud Orchestrator is our first step toward that future. Our goal was to build a system where multiple intelligent agents could understand human intent, validate constraints, and execute cloud workflows—all autonomously.

💡 What It Does

🧭 Our Vision: The Cloud Orchestrator

Instead of oversimplifying Google Cloud (and losing its power), we built something that makes its complexity manageable:

An intelligent, multi-agent system that understands your intent and orchestrates 20+ Google Cloud services through natural language prompts.

Our goal wasn’t to replace Google Cloud—it was to make its full capabilities instantly accessible, even to users without platform expertise.

🏗️ The Foundation: Our Tech Stack

🧠 The Brain — Gemini 2.5 Flash
Lightning-fast AI reasoning across all agents.
🎯 The Orchestrator — Agent Development Kit (ADK)
Enables multi-agent coordination and communication.
☁️ The Platform — Google Cloud Platform
20+ services working together in perfect harmony.
🔧 The Logic — Python
Powers our agent backend and integrates all service tools.
🧵 The Foundation — Cloud Pub/Sub & Cloud Functions
Enables real-time, event-driven execution across the system.

The magic lies in ADK's orchestration engine + Gemini’s speed—giving us the perfect foundation for intelligent automation.

🎼 The Agent Orchestrator: Three-Tier Intelligence Architecture

Rather than relying on one massive agent, we designed a three-tiered system where intelligent roles collaborate seamlessly:

🧠 Tier 1: PlannerAgent — The Strategic Mind

Powered by: Gemini 2.5 Flash
Role: Converts natural language into an execution-ready DAG.
What it does:
When you say “I need a scalable ML pipeline”, the agent plans every step: storage, compute, training, and deployment—mapped out as a Directed Acyclic Graph (DAG).
Why it’s cool:
It understands dependencies and execution order—no human intervention needed.

🛡️ Tier 2: GuardAgent — Your Intelligent Safety Net

Powered by: Gemini 2.5 Flash
Role: Checks for budget, quota, and policy compliance.
What it does:
Validates each action using real-time calls to Google Cloud’s Billing and Quota APIs—so you never hit a surprise cost or failure.
Why it matters:
Think of it as your personal cloud risk manager.

⚡ Tier 3: WorkerHubAgent — The Execution Expert

Powered by: Gemini 2.5 Flash
Role: Executes cloud operations by coordinating 20+ sub-agents.
What it does:
Each GCP service has its own specialized agent (BigQuery, Vertex AI, Compute Engine, etc.) that executes safely and autonomously under WorkerHub’s guidance.
Why it works:
Modular, parallel, fault-tolerant execution—all handled by intelligent agents.

With this system, Cloud Orchestrator lets you deploy complex infrastructure with a single sentence—no scripts, no toggles, just AI agents working in harmony.

We tested the system with complex real-world workflows like:

"Provision a GKE Autopilot cluster, connect it to a VPC, and deploy a Vertex AI model."

Everything—from infrastructure setup to model deployment—was automated with one natural prompt.

Challenges we ran into

LLM hallucinations were a major early blocker. We solved this by enforcing structured JSON output and strict prompt engineering.
Quota and budget enforcement wasn’t trivial—we had to deeply integrate Google’s Billing and Quota APIs and design a GuardAgent to intercept failures before runtime.
Scaling to 20+ GCP services required designing for modularity, error resilience, and async retries.
IAM permission delays caused race conditions in execution—we resolved this with backoff-based polling and smarter task scheduling.

Accomplishments that we're proud of

🚀 Built a fully autonomous, multi-agent system capable of orchestrating 20+ Google Cloud services using just natural language prompts.
🧠 Designed a scalable, modular 3-tier agent architecture separating planning, validation, and execution—enabling easy extension and debugging.
🔐 Implemented real-time quota and budget enforcement using Google Cloud’s Billing and Quotas APIs, ensuring safe and cost-aware deployments.
📊 Generated structured DAGs from LLMs reliably, solving common hallucination issues through constraint-based prompting.
🔁 Enabled full infrastructure lifecycle automation, from provisioning to deployment to monitoring, across services like GKE, Vertex AI, BigQuery, and Cloud Run.
⚡ Demonstrated complex cloud orchestration through a single prompt: “Deploy a GKE Autopilot cluster and serve a Vertex AI model”—no scripts, no toggles, just AI agents in action.
✍️ Published a technical blog post to share our architecture, challenges, and learnings with the broader AI and DevOps community.
🤖 Pushed the boundaries of DevOps automation, showing what's possible when AI agents collaborate intelligently.

What we learned

How to build modular agent architectures that separate planning, validation, and execution
How to leverage Google’s Gemini LLMs for structured task planning and DAG generation
How to safely interact with quotas and budgets via Google’s Billing and Quota APIs
How to scale a multi-agent system to handle 20+ GCP services while staying robust and maintainable
The importance of constraint enforcement and fallback logic when working with real cloud systems