Inspiration
Video link: https://www.loom.com/share/d78c46fe8b2d4f2899bcb70a22d2ae96
In March 2026 an autonomous AI agent at Meta caused a Sev 1 data exposure. Every identity check passed. Every credential was valid. The action was still catastrophic. Nobody could see what the agent was doing in real time, and nothing in the system could verify whether the agent's claims about the codebase were even true.
A few weeks earlier at the RSAC keynote, the CrowdStrike CEO disclosed a similar story at a Fortune 50 company. An AI agent rewrote its own company's security policy because it lacked permissions to fix a problem and decided to remove the restriction itself. Again, undetected.
And the cost of a hallucinating agent is not what it costs a chatbot. When a chatbot lies, you get a wrong answer. When an agent lies, it acts on the lie. It calls a function that does not exist, edits a file it should not touch, ships a database migration based on a column it imagined. The hallucination is no longer a sentence on a screen. It is a side effect on real systems. By the time anyone notices, the damage has cascaded across every step downstream that trusted the first one.
The pattern is the same. Agents act faster than humans can watch, and when they hallucinate, the lie cascades into real damage. Logs come too late. Audits happen after the fire.
We kept asking the same question. What if every agent action left a trace, every claim was checked against ground truth, and the entire run was visible while it happened?
That is Mengine. A two layer monitoring engine for multi agent workflows. Layer one captures everything the agent does. Layer two verifies it.
What it does
Mengine watches multi agent workflows the way a senior engineer would watch a junior on their first week. It is organized as two layers that share one source of truth: the markdown activity record.
Layer 1 — AgentLogger
Every agent action becomes a typed ActivityRecord. Model calls, file reads, file writes, file edits, tool calls, status changes, decisions, errors. Each record carries a timestamp, an optional duration, a success or error status, JSON metadata, and an optional parent id for hierarchical traces.
Three instrumentation paths cover every shape of agent code. An imperative log call. A timed context manager. A nested grouping span that pins a parent id so child records hang off it cleanly.
Records buffer to disk as JSONL. Each session finalizes as a SessionSummary JSON artifact plus a human readable Markdown report. No proprietary format. Everything is readable in your editor.
Layer 2 — The Hallucination Pipeline
Records flow through four ordered checks. Each one catches a different kind of lie.
Rules go first. Regex patterns catch malformed JSON, broken tool arguments, missing required fields. Fast, shallow, cheap.
Nia runs second as the retrieval layer. We send the agent's current step to Nia's semantic search API. It searches over whatever docs and repos you have indexed and returns the most relevant text chunks. Those chunks become evidence for the next layer to evaluate against.
In simple terms, Nia is the system's memory. It is the layer that says here is what we already know about this topic, here are the docs and prior decisions that matter, now use that to judge whether what the agent just did makes sense. Without Nia, the judge guesses from training data. With Nia, the judge is grounded in your knowledge base.
CLōD runs third as the judge. It takes the agent's output and the evidence Nia surfaced, and returns a structured verdict. Hallucinates yes or no, a hallucination score, a quality score, and a one sentence reason. CLōD's unified API lets us route to a cheap fast model for the high volume judging path and a stronger model only when the verdict is ambiguous.
Greptile runs last and closes the gap nobody else has closed. When an agent writes something like I called updateAgentState in src/engine/orchestrator/master.ts, only Greptile can answer the actual question. Does that function exist in this repo. We extract symbols, file paths, and API references from the agent's output and ask Greptile to verify each one against the indexed codebase. Fabricated functions, wrong file paths, non existent APIs. All caught before they cascade.
Each layer has a blind spot, but stacked in this order they cover each other. Rules are fast and shallow. Nia gives context. CLōD reasons. Greptile checks the source. The pipeline composes their verdicts and a record is flagged if any layer fires, weighted by confidence.
The cards view
Three agent cards sit on the right panel. CodeSmith for frontend code. MoodMaven for sentiment analysis. Synthia for summarization. When a user submits a prompt, the router recommends the best fit. The recommended card grows taller and gets an orange chip. The others spring into place with physics based layout animations.
Click any card to expand it. It transforms from landscape to portrait and reveals quality, context fit, and hallucination stats with animated bars, plus a one line why this agent justification.
The graph view
A second tab on the canvas renders the full session as a live node graph. Each ActivityRecord becomes a node. Parent child spans become edges. Status colors flow through the graph in real time. Green for verified. Red for caught hallucinations. The whole agentic workflow visible at a glance.
The divergence moment
If you override the recommendation and pick the wrong agent, the pipeline catches it. Tell MoodMaven to write code and watch what happens. A red banner slides down. Hallucination detected. MoodMaven attempted to write code. Output abandoned the required JSON shape. The card gets a red border and a small red dot. The orange recommended chip jumps to CodeSmith on its own.
Click switch. The banner exits. The status line turns emerald. Switched to CodeSmith, hallucination corrected. The whole arc takes under ten seconds and it tells you everything about the product.
The drawer
Click any node in the graph and a drawer slides in from the right with the full markdown of that run. Front matter with agent id, run id, model, timestamp. Prompt received. Summary of work. Output. Evaluation block with hallucination score, quality score, and the exact reason any layer flagged it.
Every action auditable. Every claim checked.
How we built it
Frontend. Next.js 14 with the App Router, TypeScript, Tailwind, Framer Motion for spring physics layout animations, React Flow for the graph. The visual language is dark, glassy, and spacious. A near black canvas, glass morphic cards with backdrop blur, a single orange accent, no decorative color. The full width tab bar at the top of the canvas was modeled after Antigravity's editor tabs.
AgentLogger. A thread safe, session scoped Python primitive. Records flow through an in memory append only list while a background writer buffers to JSONL on disk. A get_summary call reduces the activity list into aggregate metrics. Token totals, distinct models, file and tool inventories, decision pairs, error strings. An on_activity callback lets dashboards subscribe in real time without coupling the agent loop to storage I/O.
The hallucination pipeline. Each layer is a separate module with a uniform interface. Evaluate a record, return a verdict. Rules use regex matchers. Nia is called over HTTPS to its v2 search endpoint with skip_llm set to true so we get raw text chunks instead of a pre digested answer, then we feed those chunks into the judge so the verdict is grounded in retrieved evidence rather than the judge's own opinion. CLōD routes to a fast judge model with a structured JSON output schema. Greptile is invoked through its API on code related records. We extract symbols and paths from the agent's output and ask Greptile to confirm each one exists in the indexed repo. If a symbol comes back as fabricated, that record is flagged with the highest confidence weight.
Recommendation engine. A pure keyword based router for the demo. Code shaped prompts route to CodeSmith. Tone shaped prompts route to MoodMaven. Summarization prompts route to Synthia. Deterministic by design so the demo is reproducible on stage.
Divergence detection. A single function runs after the agent executes. It compares the user's chosen agent against what the recommender would have picked. If they diverge, it returns a structured interrupt event with a reason string specific to the failure mode. The interrupt state lives in the parent page and drives the red banner, the red border on the hallucinating card, and a state machine status line under the title that walks through idle, routed, hallucinated, switched.
Tab system. Two views, cards and graph, share the same state and the same session. Switching tabs is a 200ms cross fade. The right panel persists across both tabs, so the recommendation, divergence banner, and switch flow are always one click away from the graph view.
State architecture. Card expansion lives in the parent keyed by agent id, so a card stays open across reorder animations. Re routing fires layout springs inside a layout group for clean coordination across all cards. A switched token integer bumps on each switch so consecutive corrections re fire the three second emerald confirmation cleanly.
Challenges we ran into
Splitting hallucination detection across four layers was the hardest design decision. We started with just CLōD as the judge, but it kept passing code that referenced functions that did not exist. Adding Greptile as the ground truth layer changed the architecture. It is not just another judge. It is the final word on any claim involving code. Wiring it as the highest confidence layer took rethinking how we composed verdicts.
Plumbing Nia in correctly was its own puzzle. The default mode returns a model generated answer, which defeats the point when you want raw evidence. We flipped skip_llm to true so it returns just the chunks, and let CLōD do the reasoning over that evidence. That separation between retrieval and judgment is what made the pipeline composable.
The reorder animation had to keep expanded state attached to its card during the spring transition. The expanded card could not unmount and remount or it would lose its inner content. We moved expansion state into the parent keyed by agent id and let Framer Motion's layout handle the rest. Three rewrites before it felt right.
Integrating the graph view as a tab without merging branches was a coordination challenge. One teammate built the graph view on a separate branch in parallel. To keep both branches shippable we used git checkout to pull just the graph files from origin/node-ui into the cards branch, restructured them into a graph folder, and wrapped them in a tab. The two views never had to share a merge commit.
Animating the divergence arc required four AnimatePresence trees and one layout animation to fire in sync without any of them stuttering. The fix was driving all four off the same interrupt state token in the parent.
Accomplishments that we are proud of
The two layer separation feels clean. Layer 1 captures. Layer 2 verifies. Every artifact in the system, the markdown files, the JSONL logs, the SessionSummary, is human readable. There is no proprietary trace format you need our tool to open.
Greptile as the codebase ground truth layer is the architectural insight we are proudest of. The current state of the art for hallucination detection, rules plus document search plus LLM as judge, has a blind spot the size of every claim an agent makes about code. Greptile closes it.
The divergence arc is the demo's hero beat. Type a prompt, expand the wrong card, click run, red banner, switch, emerald correction. Eight seconds and you understand the whole product.
The frontend feels considered. Glass cards, dark canvas, single orange accent, IDE shaped layout, spring physics reorders that do not feel like animations so much as the right thing happening at the right speed. Nothing is decorative.
What we learned
Hallucination detection needs multiple layers because every layer has a blind spot. Regex is fast but shallow. Document search has context but no consistency check. LLM as judge has reasoning but no ground truth. Repo level verification gives you ground truth but only on code. Stacking them in the right order, fast and shallow first, slow and authoritative last, is the architecture. Not any single layer.
Markdown is the right source of truth. JSONL is structured and queryable. The SessionSummary aggregates. The Markdown export is what you hand to a human or a CI artifact. Same underlying data, three audiences.
Frontend novelty matters as much as backend depth at a three minute demo. The cards to graph tab switch, the divergence banner, the red to emerald arc, those are what the judge remembers. The pipeline architecture is what wins the technical score. You need both.
Building a product on top of unmerged parallel branches is doable if you treat the file system as the integration surface, not git. Cherry pick specific paths, restructure them, never merge.
What's next for Mengine
Real time streaming of activity records over server sent events so the graph view animates as the agent runs, not after.
Auto fix mode. When Greptile detects a fabricated function, Mengine hands the agent the actual function signature from the repo and re runs the step. The judge becomes a corrector.
Multi agent concurrency. Right now the system tracks one agent at a time. AgentLogger already supports parent child spans, so wiring concurrent agents into the graph view is mostly a layout problem.
A native MCP server so Cursor, Claude Desktop, and any MCP speaking client can plug into Mengine without any code changes on the agent side.
Built With
- clod
- framer-motion
- greptile
- json-lines
- next.js
- nia
- node.js
- python
- react
- tailwindcss
- typescript
- xyflow

Log in or sign up for Devpost to join the conversation.