Inspiration

I cofounded Appacella, an AI mobile app developer. While building that, I manually read hundreds of AI coding traces to find issues and improve the agent. Now I work on RL environments and hit the same wall: I need to read hundreds of transcripts of the model attempting a task. I can't trust the model to read the whole thing itself so I wanted an LLM to help me read transcripts quickly, plus a tool to group and keep track of model runs. That's how the idea for this project started.


What it does

RLX-ray is a dashboard for agent run observability. You ingest RL runs from Hugging Face datasets or by triggering SWE-agent on Modal. SWE-agent is the standard harness for SWE-bench. Then each run gets analyzed by an LLM agent that segments the trajectory into sections (good / warning / failure) and summarizes what went wrong. You can search over massive transcripts with Elasticsearch, compare runs within a task with embedding-backed similarity, and see runs with similar failure modes so you don't have to read hundreds of traces by hand. The goal is to better understand how the agent interacts with the environment (where it gets stuck, what it tries, where it goes wrong) without reading every transcript yourself.


How I built it

  • Frontend: Next.js. It's a dashboard for displaying runs and sections; server-side rendering works well for that.
  • Storage & search: Elasticsearch. I needed to store runs and search over massive transcripts; Elasticsearch made that tractable.
  • Comparing runs: I wanted embeddings to compare runs within a task. I use Jina (via Elastic) for embeddings and keep everything in Elastic, so I went with Elastic Agent Builder for the trajectory-analyzer agent. It integrates cleanly with the same DB and lets the agent query via Elasticsearch instead of stuffing the whole transcript into context (transcripts can be near the context limit or larger with autocompaction).
  • User-supplied runs: I ingest runs from Hugging Face, but the product only really works if users can run their own. So I use Modal's sandbox to run the SWE-agent harness on a SWE-bench task: the app triggers a Modal endpoint that runs SWE-agent and writes results back to Elasticsearch.
  • Analysis design: I didn't want to dump the entire transcript into the model. I use Elasticsearch so the agent can query the run (e.g. by event type or step) and do sectioning without blowing context limits.

Challenges I ran into

  1. Section splits: Getting the LLM to produce good section boundaries instead of splitting arbitrarily on errors was hard. I had to tune prompts and tooling so sections reflect meaningful phases (e.g. "got stuck here," "repeated attempt") rather than every failure.
  2. Grouping similar runs: With embeddings, runs that feel similar (e.g. "error + loop") should cluster together. But if the error text differs, embeddings put them in different buckets. I needed a pseudo re-ranking approach (not classic search re-ranking) to surface runs that are behaviorally similar even when the surface text differs.
  3. Modal + SWE-agent: SWE-agent was rough to wire up for a few reasons. It isn't a simple pip-and-run library: it expects a full repo layout (config/, tools/, trajectories/), so I had to build a Modal image that clones the repo and does an editable install, then runs the CLI. SWE-agent also uses Modal internally for code execution (via swe-rex), so I ended up with Modal-in-Modal: my orchestrator runs on Modal and the CLI spawns its own Modal sandboxes for each run. I hit API compatibility issues too: SWE-agent uses litellm, and some models (e.g. GPT-5, Claude) error on unsupported params like top_p, so I added a custom entrypoint that sets litellm.drop_params=True before invoking the real runner. On top of that, I had to discover and parse the trajectory output (.traj files), normalize "history" vs "trajectory" (schema varies by version), and write events plus run status back to Elasticsearch. Getting the whole chain reliable took a lot of iteration.

Accomplishments that I'm proud of

  • End-to-end pipeline: From raw runs (Hugging Face or live Modal) to searchable, sectioned trajectories and "similar runs" in one dashboard.
  • LLM that queries instead of ingesting: The trajectory-analyzer agent uses Elasticsearch as its source of truth and queries by step/event type instead of stuffing full transcripts into context, so I can handle long runs without blowing context limits.
  • Sectioning that surfaces mild failures: Sections don't just split on hard errors; they highlight "got stuck," "repeated attempt," and partial failures so you see where a run went off the rails, not just pass/fail.
  • Custom similar-run grouping: I built a pseudo re-ranking approach so runs that are behaviorally similar (e.g. same failure mode) cluster together even when the literal error text differs, beyond off-the-shelf embedding similarity.
  • Modal + SWE-agent integration: Getting the app → Modal → swe-rex sandbox chain working so users can launch and track SWE-agent runs from the UI.

What I learned

  • Formats: LLM trajectories are stored in many formats; normalizing and ingesting them is a real integration challenge.
  • Elasticsearch: I learned Elasticsearch properly: indices, k-NN, and wiring an agent to query it instead of relying on raw context.
  • Beyond re-ranking: I built a custom approach to group "similar" runs when standard embedding similarity wasn't enough (different error text, same failure mode).
  • Product: Focus on value. The core value is analyzing trajectories: sections, similar runs, "where did it go wrong." The Modal "run SWE-agent from the UI" flow is cool but secondary; the dashboard and analysis are what actually help when reading hundreds of runs.

What's next for RLX-ray

  • Environment versioning: Track versions of environments so you can compare runs across env changes and know exactly what code/config a run used.
  • Custom RL environments: Let people upload their own RL environments (e.g. via Docker plus a standardized interface) and run tasks against them, instead of being limited to the tasks already in the system.
  • More trajectory formats: Support additional agent/RL run formats so more teams can ingest their runs.
  • Smarter similar-run logic: Improve the pseudo re-ranking so "same failure mode, different text" clusters even better.

Built with

TypeScript, React, Next.js, Python, Elasticsearch, Kibana, Elastic Agent Builder, Jina, Hugging Face, Modal, Vercel, SWE-agent, swe-rex, Tailwind CSS, uv

Languages & frameworks

  • TypeScript, React 19, Next.js 16
  • Python (workers, scripts)

Infrastructure & data

  • Elasticsearch (Elastic Cloud): store runs, events, sections; full-text and k-NN search
  • Kibana / Elastic Agent Builder: trajectory-analyzer agent (converse API, tools, query over ES)
  • Jina: section and run-summary embeddings (768-dim), similarity for "find similar runs/sections"
  • Hugging Face: dataset ingestion for runs
  • Modal: run SWE-agent in sandboxes (swe-rex); app triggers Modal endpoint, worker writes to Elasticsearch
  • Vercel: host the Next.js app

APIs & tooling

  • Elasticsearch API, Kibana Agent Builder Converse API
  • Jina Embeddings API
  • SWE-agent / swe-rex (execution harness)

Other

  • Tailwind CSS, react-markdown
  • uv (Python envs and scripts)

Built With

Share this project:

Updates