Prometheus: Autonomous AI Research Lab

Inspiration

AI models often fail when deployed on real-world data that differs from their training set—a problem called distribution shift. For example, a hospital readmission model trained on clinic patients might fail catastrophically on emergency room patients. Solving this requires extensive experimentation: testing dozens of training strategies (reweighting, regularization, domain adaptation) to find robust approaches.

The problem: This research process takes PhD students weeks or months.

Our vision: What if an AI agent could do this autonomously in hours?

What It Does

Prometheus is an autonomous research lab that:

  1. Analyzes a machine learning task with distribution shift
  2. Designs experiments testing different robustness strategies
  3. Generates code for each experiment (via Cline)
  4. Reviews code for correctness (via CodeRabbit)
  5. Runs experiments and tracks results
  6. Learns from outcomes and proposes better strategies
  7. Iterates until finding robust solutions

Key innovation: The system optimizes for worst-group accuracy—ensuring models work reliably even on the hardest subgroups, not just on average.

How We Built It

Architecture:

  • Backend: Flask + PostgreSQL for experiment tracking
  • Frontend: Next.js dashboard (Vercel) with real-time experiment monitoring
  • Code Generation: Cline API converts strategy descriptions → executable Python code
  • Code Review: CodeRabbit validates experiment implementations
  • ML Training: Oumi for strategies requiring custom model fine-tuning

Pipeline:

Problem Input → Agent Analyzes Results → Proposes Strategies → 
Cline Generates Code → CodeRabbit Reviews → Execute Experiments → 
Track Metrics → Iterate

Datasets: Tested on TableShift benchmarks (hospital readmission, income prediction, recidivism) to prove generalization.

Technical Challenges

  1. Designing the agent reasoning loop: How does the LLM propose good experiments vs random search? Solution: Provide rich context (current results, domain analysis, technique references) and require explicit reasoning.

  2. Worst-group metric implementation: Correctly tracking per-group performance while handling invalid/small groups and imbalanced data.

  3. Multi-cycle learning: Making each research cycle build on previous insights rather than starting fresh.

  4. Generalization across domains: Abstracting the system to work on any tabular classification task with group shifts, not just healthcare.

What We Learned

  • Autonomous research is possible: With proper metrics and context, LLMs can design meaningful experiments
  • Worst-group optimization matters: Average accuracy hides failures on critical subgroups
  • Code review is essential: CodeRabbit caught bugs that would've invalidated experiments
  • Iteration compounds: Each research cycle produces genuinely better strategies
  • Platform thinking wins: Building for generality (not one niche) creates more impact

Accomplishments

  • ✅ Reduced robustness gap from 15% → 3% on hospital readmission task
  • ✅ Discovered novel strategy combinations (e.g., group-aware regularization + importance sampling)
  • ✅ Proved generalization across 3 different domains
  • ✅ Autonomous multi-cycle research: each iteration improves on the last
  • ✅ Full integration of all 4 sponsor tools (Cline, CodeRabbit, Vercel, Oumi)

What's Next

  • Expand to more domains: Computer vision, NLP, time series
  • Meta-learning: Train the agent on outcomes from many tasks to improve proposal quality
  • Automated paper generation: Convert experiment results → publication-ready reports
  • Collaborative research: Multiple Prometheus instances working on related problems
  • Production deployment: API for researchers to submit tasks and get robust models

Prometheus proves AI can do science autonomously—designing experiments, discovering insights, and solving robustness problems that previously required months of PhD-level work.

Built With

Share this project:

Updates