Inspiration

Every organization has an invisible bottleneck that nobody talks about — the L1/L2 support team.

These are the people who get paged at 2am when a service goes down. They manually check dashboards, scan hundreds of log lines, decide if something is a P1 or a P3, write up an incident report, create a GitLab issue, document a Root Cause Analysis, and notify the team — all before they can even start fixing the actual problem.

I've seen this pain firsthand. Support engineers spend 60-70% of their time on repetitive triage work that follows the same pattern every single time:

The pattern never changes. Only the specific error changes. And that's exactly the kind of repetitive, pattern-matching work that AI agents should handle.

That's what inspired OPEROOperational Incident Response & Orchestration Agent.

Not a chatbot that answers questions. An agent that reacts, triages, and acts the moment something goes wrong

What it does

OPERO is an AI-powered incident monitoring and triage agent built on the GitLab Duo Agent Platform.

It watches your services 24/7 and the moment something fails:

  1. Detects — monitors API endpoints, CPU usage, memory, DB connections every 60 seconds
  2. Classifies — identifies the failure type from 15+ known patterns (service down, DB timeout, OOM, auth failure, etc.)
  3. Predicts severity — assigns P1 (Critical) to P4 (Low) with reasoning
  4. Emails the team — sends a beautiful HTML alert to L1/L2 engineers instantly
  5. Creates a GitLab issue — pre-triaged, labelled, and ready to assign
  6. Generates an RCA report — auto-publishes a Root Cause Analysis as a GitLab Wiki page, with a direct link in the email
  7. Shows everything live — real-time web dashboard at localhost:5000

OPERO also lives inside GitLab Duo Chat as a public agent in the AI Catalog. Paste any log or error message and get an instant structured triage response powered by Anthropic Claude.

How I built it

OPERO is built in three layers that work together:

Layer 1 — GitLab Duo Agent (agent.yml)

A custom public agent published in the GitLab AI Catalog. Uses Anthropic Claude via GitLab Duo to analyze any log or error message and respond with a structured triage report — failure type, severity P1-P4, fix steps, triage summary, and whether the incident was preventable.

Layer 2 — Python automation engine

The core automation that runs locally and connects everything:

  • sanity_runner.py — runs API health checks, CPU/memory/DB monitoring every 60 seconds
  • github_log_watcher.py — watches facebook/create-react-app GitHub Actions for real failed CI runs
  • opero_runner.py — the rules engine that reads rules.yml and classifies failures by matching keywords
  • rules/rules.yml — 15+ YAML-defined patterns covering Java/Spring Boot, Python, PostgreSQL, MySQL, CI/CD errors
  • dashboard.py — Flask web server showing a live dashboard at localhost:5000

Layer 3 — GitLab Flow (flow.yml)

A GitLab Duo Flow that auto-triggers when a pipeline fails, runs the OPERO agent, and creates a triage issue — no human needed, no Python script required.

Challenges I ran into

GitLab flow schema kept changing — The flow.yml schema required multiple iterations. The validator kept throwing errors about missing definition, flow, and required properties. Eventually figured out that name, description, and public must be at the root level, with everything else nested under definition.

GitHub API connection drops — The GitHub Actions API would sometimes drop connections mid-request. Added timeout=30 and wrapped everything in try/except so the watcher never crashes.

rules.yml matching too broadly — Early versions of the rules engine matched "unknown" for too many errors because the keywords weren't specific enough. Refined the keyword lists and added context-enriched text to improve matching accuracy.

Accomplishments that I may be proud of

  • OPERO is live in the GitLab AI Catalog as a public agent — anyone can use it right now
  • End-to-end automation works — a service failure goes from detection to email + issue + RCA in under 5 seconds, with zero human action
  • Monitors real CI/CD failures — pulling actual failed runs from facebook/create-react-app on GitHub in production

What I learned

  • How to build and publish custom agents on the GitLab Duo Agent Platform
  • How to create GitLab Flows with the correct YAML schema — definition, components, prompts, routers
  • How to use the GitLab REST API for automated issue creation, wiki management, and catalog integration
  • How Anthropic Claude powers GitLab Duo agents under the hood
  • How AI agents can genuinely remove friction from developer workflows, it's not just answer questions now, but take real action on behalf of the team

What's next for OPERO

  • GitHub Actions CI/CD monitoring = I have already built a working prototype (github_log_watcher.py) that watches real open source projects on GitHub for failed CI runs, pulls actual job logs via the GitHub Actions API, triages them with OPERO, and auto-creates GitLab issues with RCA reports. This currently monitors facebook/create-react-app in real time and is ready to be extended to any public or private GitHub repository.
  • Teams integration = alert engineers directly in their chat channels
  • Auto-assignment = route issues to the right team/group DL based on failure type and service ownership
  • Similar Issue Tracking = escalate incidents which have happened in past automatically
  • Anomaly detection detect unusual patterns before they become incidents

Built With

Share this project:

Updates