Inspiration

Healthcare data is broken at the seams. A single patient might have lab results in a CSV from one hospital, clinical notes as a scanned PDF from another, and genomic variants in a VCF file from a third. None of it talks to each other. Researchers and clinicians who need a complete picture spend more time wrangling formats than doing science.

We've seen this problem up close, the bottleneck isn't the AI, it's the data plumbing that comes before it. DataGrid eliminates that bottleneck entirely: give it messy, multi-source patient records and get back clean, standardized, research-ready data with a full audit trail, made ready for US Pharam company. Currently Tempus AI does this for US based hospitals.

What it does

DataGrid is a 6-agent autonomous pipeline that ingests clinical records across formats (PDF, CSV, VCF, JSON), maps entities to ICD-10 and OMOP CDM v5.4 standards, validates low-confidence mappings with a second-pass agent, and outputs structured Parquet datasets with full provenance — all without human intervention.

Each agent has a distinct role and a distinct identity:

  • Ingestion Agent — pulls records via Airbyte, routes to format parsers
  • Modality Agent — identifies and fills data gaps per patient
  • Harmonization Agent — maps clinical entities to ICD-10 / OMOP
  • Validation Agent — second-pass review on low-confidence mappings
  • Output Agent — writes final Parquet tables and provenance to Ghost DB

How we built it

The core pipeline is orchestrated in Python with Modal handling parallel harmonization across patient batches. What makes DataGrid architecturally distinct is how we wired the sponsor infrastructure around it:

Ghost serves as our agent-native database layer. Every pipeline run forks a fresh Postgres database, uses it for cache, job state, and provenance, then discards it on completion. Agents never share stale state across runs.

Auth0 enforces least-privilege access across the pipeline. Each agent requests an M2M token scoped to exactly what it needs — read:records, write:harmonized, read:validated, write:omop — before it executes. The frontend uses a PKCE login flow so patient consent is explicit and auditable.

Airbyte replaces raw file-system ingestion with a proper source connector, decoupling the pipeline from any specific data delivery mechanism. The parsers downstream are untouched — Airbyte just delivers the bytes.

Overmind wraps every LLM call in the pipeline with zero code changes to core logic. Every ICD-10 mapping decision, every validation judgment, every confidence score is traced, queryable, and fed back into continuous improvement.

Challenges

The hardest problem wasn't the AI — it was state management across autonomous agents. When five agents run in sequence (and some in parallel on Modal), you need a shared source of truth that's fast, ephemeral, and doesn't leak state between runs. Ghost's fork-per-run model solved this cleanly, but wiring it into the orchestrator required rethinking how cache reads and writes propagated through the pipeline.

Auth0's M2M pattern also required careful scope design. It's easy to give every agent full access and call it done — the challenge was defining meaningful, enforceable boundaries that actually reflect the trust model of a real clinical system.

What we learned

Building DataGrid made it clear that the infrastructure layer is the trust layer in agentic AI. A pipeline that produces the right answer but can't prove how it got there is not deployable in healthcare. Every design decision — ephemeral databases, scoped tokens, LLM tracing — was in service of making the system's behavior auditable, not just accurate.

We also learned that agent boundaries map surprisingly well to auth scopes. If you can't describe what an agent is allowed to do, you probably haven't thought carefully enough about what it should do.

Built with

Python · FastAPI · Modal · Anthropic Claude · Auth0 · Ghost · Airbyte · Overmind · OMOP CDM v5.4 · FHIR R4 · Apache Parquet

Built With

Share this project:

Updates