AI Ops Factory
Inspiration
Modern AI systems break across multiple layers at once—latency spikes, GPU pressure, serving instability, alert storms, and job queue failures. We were inspired by how hard it is for operators to connect these signals quickly during incidents. Synapse AI Ops was built to turn fragmented telemetry into one clear, explainable diagnosis workflow.
What it does
Synapse AI Ops is a multi-agent AIOps web app that lets users ask natural-language operational questions and get:
- an executive summary,
- evidence-backed findings from specialist agents,
- visual insights (line/bar/area/pie),
- and transparent traces showing what was queried and why.
It correlates signals across inference, node metrics, serving health, alerts/logs, and job queues to help teams prioritize root-cause investigation faster.
How we built it
We built a full-stack system with:
- FastAPI backend for
/chatorchestration, - LangGraph orchestrator for agent routing and synthesis,
- DuckDB-backed specialist agents for deterministic metric extraction,
- LiteLLM + Gemini/OpenAI support for LLM-first planning and summaries,
- Next.js frontend for chat UX, diagnostics, and chart rendering,
- Recharts for safe chart visualization,
- and an offline eval harness for quality checks.
We used schema validation and deterministic fallbacks to keep output reliable when LLM responses are imperfect.
Challenges we ran into
- Getting consistent structured JSON from LLM planner outputs.
- Balancing flexibility with reliability (LLM routing vs deterministic SQL).
- Designing chart generation that is useful without being noisy or unsafe.
- Keeping responses explainable and operator-friendly, not just metric dumps.
- Maintaining trust with traceability while still moving fast on UX improvements.
Accomplishments that we're proud of
- Built a working multi-agent AIOps copilot with real orchestration and traces.
- Implemented robust fallback/salvage logic for planner failures.
- Added validated chart suggestions (line/bar/area/pie) with safe frontend rendering.
- Improved UX with executive summaries, diagnostics, and polished visual design.
- Added evaluation workflow and logging to verify behavior and reduce regressions.
- Shipped iteratively and pushed production-style improvements quickly.
What we learned
- Reliability matters more than “magic” in ops workflows.
- LLMs are strongest when constrained by deterministic data pipelines.
- Transparent diagnostics build user trust faster than opaque automation.
- Good incident UX needs both narrative context and hard evidence.
- Small schema and fallback decisions have huge impact on real-world usability.
What's next for Synapse AI ops
- Add richer time-window and scenario-scoped filtering.
- Improve chart intelligence (better type selection, richer trend detection).
- Expand specialist agent library for deeper subsystem coverage.
- Add alerting/playbook recommendations and action-oriented remediation suggestions.
- Introduce team features (shared investigations, exportable reports, incident timelines).
- Move toward deployment-ready observability and guardrailed autonomous investigations.
Built With
- fast-api
- lang-graph
- python
- typescript


Log in or sign up for Devpost to join the conversation.