Inspiration
Devs who know SQL but not Elasticsearch well integrate the product for full text search then make predictable mistakes: wrong field mappings, over-sharded indices, missing replicas, unoptimized queries. ES is still the right tool for full-text search alongside a SQL primary store but teams avoid it because it's another operational burden.
I wanted to take that operational load to zero.
What it does
Preflex is an autonomous Elasticsearch ops agent.
It continuously monitors cluster health, detects anomalies (latency spikes, yellow/red health, fielddata pressure, shard overhead), diagnoses root causes via MCP tools, and directly executes fixes. It reindexes with corrected mappings, shrinking over-sharded indices, tuning replica counts, and adjusting slow-log thresholds. Every action is logged to Slack in real time. A live dashboard shows cluster vitals, incident cards, and the agent's tool-call timeline as it works.
Preflex defends at multiple levels; it comes with a Github bot which scans PRs for poorly written ES queries, or relational DB queries that should probably have mirrored data and be queried through ES. It automatically spins up new PRs to correct the PR which can then be merged by the developer.
How we built it
Elastic Agent powers the core reasoning loop, but we've heavily enhanced it both in it's tools and when we choose to invoke it.
The core is a TypeScript MCP server exposing 11 read tools (cluster_health, index_mapping, query_profile, allocation_explain, etc.) and 6 write tools (reindex, shrink_index, update_settings, manage_aliases, cancel_task) with an optional human-in-the-loop confirmation mode.
A streaming monitor polls ES every 10s, builds statistical baselines, and fires anomaly rules using 2-sigma thresholds with per-anomaly cooldown locks. There are many automated SREs out there;
Preflex is verified and tested on a set of real evals: bash scripts that spin up isolated Docker clusters (ES 9.3 + Kibana + MCP server) with deliberately broken configurations. Evals focus on four main schenarios; bad-mapping (wrong field types), over-sharded (10 shards for 500 docs), slow-queries (leading wildcards, deep pagination, unbounded aggs), and bad-replicas (replicas=3 on single node), each with setup scripts, 1000-doc datasets, multi-phase query bombardment (warmup → analyst → dashboard loads at increasing QPS). The harness then scores the agent's diagnosis against my handwritten notes of what they should have done.
The Github bot uses Github webhooks as a trigger, then the Claude agent framework (essentially Claude Code on the cloud) with the ES agent as a tool for expertise in rewriting the queries.
Challenges we ran into
Token cost is brutal when you cron the Elastic Agent. A single incident investigation can burn $10+ in API calls as the agent chains diagnostic tool calls. This made iteration expensive and testing difficult. Tool-calling reliability also drops sharply below SOTA models; I ended up using gemini-3-flash since it was the cheapest model that still worked on consistent multi-step diagnosis-and-fix chains.
Agent-builder specific feedback
Plugging our custom MCP server directly into Agent Builder gave the agent access to both diagnostic and write tools (reindex, update_settings, manage_aliases) without any glue code. The built-in platform tools like list_indices, get_index_mapping, and execute_esql meant we didn't have to implement basic cluster introspection ourselves, which saved a lot of time. ES|QL tools were a particularly clean primitive — we used them to give the agent parameterized diagnostic queries baked in at the tool level rather than hardcoded in the prompt.
The main challenge we ran into was MCP tool import reliability: _bulk_create_mcp wouldn't always register all tools in a single call, so we had to build a retry loop that checked created + skipped == total before trusting the agent was fully set up.
Built With
- bash
- docker
- elastic
- typescript

Log in or sign up for Devpost to join the conversation.