AI Ops Factory

Inspiration

Modern AI systems break across multiple layers at once—latency spikes, GPU pressure, serving instability, alert storms, and job queue failures. We were inspired by how hard it is for operators to connect these signals quickly during incidents. Synapse AI Ops was built to turn fragmented telemetry into one clear, explainable diagnosis workflow.

What it does

Synapse AI Ops is a multi-agent AIOps web app that lets users ask natural-language operational questions and get:

an executive summary,
evidence-backed findings from specialist agents,
visual insights (line/bar/area/pie),
and transparent traces showing what was queried and why.

It correlates signals across inference, node metrics, serving health, alerts/logs, and job queues to help teams prioritize root-cause investigation faster.

How we built it

We built a full-stack system with:

FastAPI backend for /chat orchestration,
LangGraph orchestrator for agent routing and synthesis,
DuckDB-backed specialist agents for deterministic metric extraction,
LiteLLM + Gemini/OpenAI support for LLM-first planning and summaries,
Next.js frontend for chat UX, diagnostics, and chart rendering,
Recharts for safe chart visualization,
and an offline eval harness for quality checks.

We used schema validation and deterministic fallbacks to keep output reliable when LLM responses are imperfect.

Challenges we ran into

Getting consistent structured JSON from LLM planner outputs.
Balancing flexibility with reliability (LLM routing vs deterministic SQL).
Designing chart generation that is useful without being noisy or unsafe.
Keeping responses explainable and operator-friendly, not just metric dumps.
Maintaining trust with traceability while still moving fast on UX improvements.

Accomplishments that we're proud of

Built a working multi-agent AIOps copilot with real orchestration and traces.
Implemented robust fallback/salvage logic for planner failures.
Added validated chart suggestions (line/bar/area/pie) with safe frontend rendering.
Improved UX with executive summaries, diagnostics, and polished visual design.
Added evaluation workflow and logging to verify behavior and reduce regressions.
Shipped iteratively and pushed production-style improvements quickly.

What we learned

Reliability matters more than “magic” in ops workflows.
LLMs are strongest when constrained by deterministic data pipelines.
Transparent diagnostics build user trust faster than opaque automation.
Good incident UX needs both narrative context and hard evidence.
Small schema and fallback decisions have huge impact on real-world usability.

What's next for Synapse AI ops

Add richer time-window and scenario-scoped filtering.
Improve chart intelligence (better type selection, richer trend detection).
Expand specialist agent library for deeper subsystem coverage.
Add alerting/playbook recommendations and action-oriented remediation suggestions.
Introduce team features (shared investigations, exportable reports, incident timelines).
Move toward deployment-ready observability and guardrailed autonomous investigations.

Built With

fast-api
lang-graph
python
typescript

Updates

Aashrith Devulapally posted an update — May 17, 2026 12:58 PM EDT

RackVision AI is an intelligent observability copilot built for managing high-performance GPU cluster infrastructure. Designed for modern "AI Factories," the platform acts as a visual control tower that cross-references live telemetry layers such as VRAM utilization, RDMA network latency, and critical hardware faults directly with active engineering runbooks. When complex cluster failures or placement fragmentations occur, RackVision AI bypasses raw log digging to instantly diagnose root causes. It automatically generates precise, data-grounded engineering recommendations alongside a structured JSON payload, enabling infrastructure teams to optimize workloads, eliminate bottlenecks, and maximize expensive GPU uptime.

Log in or sign up for Devpost to join the conversation.

Suhaas Teja Vijjagiri started this project — May 17, 2026 12:55 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.