Dev Loop Intelligence

AI agent that automates your entire dev pipeline.


Inspiration

One pipeline, triggered by a spec, that carries a feature from English prose to a carbon-optimized production deploy without a human having to hand anything off.

The green scheduling: cloud deploys when the electrical grid is cleanest. The difference is measurable:

$$ \Delta\text{CO}2 = (\lambda{\text{baseline}} - \lambda_{\text{slot}}) \times P \times \Delta t \times 10^{-3} \quad [\text{kgCO}_2] $$

where $\lambda$ is grid carbon intensity in $\text{gCO}_2\text{eq}/\text{kWh}$,
$P$ is deploy power draw in $\text{kW}$, and
$\Delta t$ is duration in hours.

For a team shipping ten deploys a day on the California grid, this is 0.5–2 kg CO₂ daily — small per deploy, significant at scale, and completely free to capture if you just know when to wait.


What it does

Dev Loop Intelligence is an always-on SDLC agent with five stages that run sequentially and stream live progress back to the browser over SSE.

Stage 1 — Spec Intake. Paste any English (PRD, Slack message, bullet list). Claude parses it into structured GitLab issues with acceptance criteria, effort estimates ($S / M / L \mapsto 1 / 3 / 8$ story points), dependency graphs, and an auto-generated epic parent linking all child stories.

Stage 2 — Semantic Code Review. When a merge request opens, the agent fetches the linked issue's acceptance criteria, embeds them into ChromaDB, builds a coverage map of which code sections relate to which criteria, and asks Claude to return a structured PASS / FAIL verdict for each. Results post as inline MR comments.

Stage 3 — Merge Risk Scoring. Queries 30 days of CI failure history from BigQuery, extracts a 7-feature vector, and sends it to a Vertex AI prediction endpoint. The result is a LOW / MEDIUM / HIGH badge that posts to the MR and gates the release:

$$ \vec{f} = \begin{bmatrix} \delta_{\text{complexity}} \ s_{\text{trouble}} \ \tau_{\text{test}} \ \delta_{\text{deps}} \ |H|/1000 \ \bar{r}_{\text{fail}} \ |C|/10 \end{bmatrix}, \quad \hat{r} = \text{VertexAI}(\vec{f}) \in [0, 1] $$

Stage 4 — Release Orchestration. Checks all gates (issues resolved, no HIGH risk MRs, security scans green), then uses Claude to write grounded release notes from real milestone data. An empty-category filter prevents Claude from hallucinating content for sections with no actual changes.

Stage 5 — Green Deploy Scheduling. Queries a provider chain (Carbon Aware SDK → ElectricityMaps → WattTime → Mock) for a 6-hour carbon intensity forecast, finds the lowest-carbon slot, and defers the deploy if and only if:

$$ \text{defer} = \mathbb{1}!\left[\, t_{\text{wait}} > 300\text{s} \;\wedge\; \frac{\lambda_{\text{baseline}} - \lambda_{\text{slot}}}{\lambda_{\text{baseline}}} > 0.10 \;\wedge\; \ell \neq \texttt{green-urgent} \,\right] $$

If defer = true, the orchestrator opens the deploy MR with a green-hold label and creates a GitLab scheduled pipeline for the green window. Actual CO₂ savings are recorded to BigQuery on completion.


How we built it

Node.js gateway (server.js) orchestrates all five stages. It health-checks each Python microservice on every request and falls back to Claude simulation or heuristics if any service is offline — the pipeline never blocks.

Three Python FastAPI microservices run on Cloud Run:

Service Port Key dependency
review-service 8001 ChromaDB + Claude
risk-service 8002 BigQuery + Vertex AI
green-service 8003 ElectricityMaps / WattTime / BigQuery

All three share a single Dockerfile and are differentiated by the SERVICE_MODULE environment variable — one build context, three services.

GitLab Duo integration runs through the AI Gateway. Stages 1–2 are implemented as External Agents (.gitlab/duo/agents/*.yml) with injectGatewayToken: true. Stages 3–5 run as a Foundational Flow (.gitlab/duo/flows/release_orchestrator.yml). The green stage uses create_scheduled_pipeline — a native GitLab API call — rather than a cron job.

Carbon intensity provider chain uses a priority-ordered fallback list:

self._chain = []
if cas_url:  self._chain.append(_CarbonAwareSDKProvider(cas_url))
if em_key:   self._chain.append(_ElectricityMapsProvider(em_key))
if wt_user:  self._chain.append(_WattTimeProvider(wt_user, wt_pass))
self._chain.append(_MockProvider())  # always last — never fails

WattTime returns marginal emissions in $\text{lbs CO}_2/\text{MWh}$; everything else is normalised to $\text{gCO}_2\text{eq}/\text{kWh}$ before leaving the adapter:

$$ \lambda_{\text{normalised}} = \lambda_{\text{MOER}} \times \frac{453.592\,\text{g/lb}}{1000\,\text{kWh/MWh}} = \lambda_{\text{MOER}} \times 0.453592 $$

Frontend is a draggable OS-style interface with five live stage cards driven by SSE. The green metrics HUD polls /api/devloop/green/intensity and /api/devloop/green/savings every 10 seconds and renders a real-time carbon intensity bar.


Challenges we ran into

ChromaDB session collision. Using positional IDs (criteria_0, criteria_1) for embeddings meant that reviewing a second MR silently overwrote the embeddings from the first — a bug that only appeared under load. Fixed by switching to content-based MD5 IDs:

$$ \text{id}_i = \text{MD5}(c_i) \quad \text{where } c_i \text{ is the criterion text} $$

Stopword contamination in coverage mapping. The keyword matching step treated "return", "a", and "the" as meaningful signal. The word "return" matched every Python function; "a" matched almost everything. Coverage scores were inflated to near 1.0 for all criteria. Fixed with a _STOPWORDS frozenset filter applied before matching.

Context window pressure on large MRs. A 5,000-line diff with 20 acceptance criteria and a coverage map would exceed the model's context window. Fixed with a hard truncation at $10^5$ characters (~25k tokens), with an explicit notice to Claude that the diff was cut — so it doesn't confabulate coverage for the truncated portion.

WattTime token expiry race. The v3 API issues bearer tokens with a 30-minute TTL. Caching for exactly 30 minutes meant a token fetched at $t=0$ could expire mid-request at $t=29\text{m}\,59\text{s}$. Fixed by caching for 28 minutes:

$$ T_{\text{cache}} = T_{\text{TTL}} - \Delta_{\text{safety}} = 30\text{m} - 2\text{m} = 28\text{m} $$

SQL injection in the BigQuery history query. The days parameter was interpolated directly into an f-string SQL query. Fixed with an explicit int() cast and range guard before interpolation: $1 \leq \texttt{days} \leq 365$.

Epic parent silently discarded. The spec processor created a parent epic issue but never added it to the returned list. Callers had no reference to it and couldn't link child issues. Fixed by prepending the parent to the return value at index 0.


Accomplishments that we're proud of

The pipeline actually runs end-to-end. With only ANTHROPIC_API_KEY set, all five stages complete — stages 2, 3, and 5 fall back to Claude simulation, heuristic scoring, and mock carbon data respectively. docker-compose up is the entire setup.

The green savings formula is auditable. Every deploy writes baseline_intensity, slot_intensity, kwh_consumed, and saved_kgco2 to a partitioned BigQuery table. The ESG number isn't an estimate — it's the integral of measured emissions over measured time, queryable by any BI tool.

Zero credential management for GitLab users. The Duo agent integration uses injectGatewayToken: true, which means GitLab manages the Anthropic API credentials through the AI Gateway. A user onboarding to Dev Loop Intelligence never touches an API key.

The risk feature vector is interpretable. Each of the seven features maps to a human-readable explanation that posts to the MR:

$$ f_0 > 0.5 \Rightarrow \texttt{"High complexity delta"}, \quad f_3 > 0.2 \Rightarrow \texttt{"Dependency changes increase risk"}, \quad \ldots $$

Engineers see not just the risk score but exactly why it was assigned.


What we learned

Fallback design is the actual product. We initially treated the Python microservices as the core and the Node fallbacks as a demo convenience. We got this backwards. The fallbacks are what make the pipeline trustworthy in production — a review service restart shouldn't block a release. Designing the fallback paths with the same care as the primary paths changed how we thought about every stage.

Empty categories are a hallucination trap. Sending Claude a prompt with five change-type buckets when only two have content reliably produces plausible-sounding text for the empty ones. The fix — filtering to active_changes before building the prompt — is three lines of Python, but finding the failure mode required actually reading Claude's output carefully rather than assuming correctness.

Carbon intensity signal choice matters for compliance. Average intensity (ElectricityMaps) is Scope 2 / GHG Protocol compatible. Marginal intensity (WattTime MOER) is theoretically more accurate for load-shifting but is not accepted for Scope 2 reporting. Using WattTime as primary would have made our CO₂ savings numbers unauditable for enterprise customers. The provider chain priority order encodes this decision explicitly.

SSE is underused for agentic workflows. Polling a /status endpoint for pipeline progress introduces latency and wastes requests. SSE gives the frontend a push channel with zero infrastructure overhead — no WebSocket server, no message broker. For a five-stage pipeline where each stage takes 3–8 seconds, the UX difference is significant.


What's next

Vertex AI model training pipeline. The BigQuery schema and model config exist; the training job is not yet scripted. The next step is a Cloud Build trigger that retrains the risk model weekly on fresh CI history, so the risk scores improve as the codebase grows.

Multi-repo blast radius. The current risk scorer counts components touched within a single repo. In a microservices architecture, a change to a shared library should propagate risk scores to downstream services. This requires a dependency graph across repos — queryable from BigQuery if the CI jobs emit the right structured logs.

Per-team deferral budgets. Right now max_delay_hours is a scalar. A richer model would allocate a carbon budget per team per sprint:

$$ B_{\text{team}} = \sum_{\text{sprint}} \Delta\text{CO}2^{\text{target}}, \quad \text{defer} = \text{defer} \;\wedge\; B{\text{remaining}} > 0 $$

Teams that overspend their budget get more aggressive deferral; teams under budget can release immediately with a green-now override.

Spec-to-test generation. Stage 1 already produces acceptance criteria. The natural next step is generating skeleton test files from those criteria before any code is written — closing the TDD loop automatically and giving Stage 2 something real to evaluate coverage against.

GitLab Duo Chat integration. The pipeline today is triggered by a REST call or the OS frontend. The right UX is /devloop run as a Duo Chat slash command — so engineers can kick off a full pipeline without leaving their MR tab.

Built With

Share this project:

Updates