Inspiration

At Prophet Hacks, we wanted an agent that forecasts like a superforecaster—not from stale memory, but from fresh evidence, multiple models, and calibrated probabilities. TomMind AI targets the Forecasting track: minimize Brier score on real Kalshi-style prediction markets.


What it does

TomMind AI exposes POST /predict (compatible with the ai-prophet CLI). Given a market (title, rules, category, close_time), it returns:

  • p_yes — probability in [0.01, 0.99]
  • rationale — short explanation

Pipeline (v3, default):

  1. Load Kalshi implied price (when available) as context
  2. Analyst → 2–3 keyword queries → Serper news (window anchored to close_time)
  3. Parallel ensemble — GPT-4o, Phi-4, Gemini Flash with different prompt styles
  4. Median aggregate → blend with market price (stronger when models disagree) → calibrate toward 0.5

Offline eval: on Prophet’s sample-resolved (26 markets) and our 5-market smallTest slice, Brier is well below the 0.25 “always 50%” baseline (e.g. ~0.06–0.08 on the full resolved set in local runs). These are dev benchmarks, not the final hackathon leaderboard.


How we built it

  • FastAPI server + LangGraph orchestration (parallel forecasters via Send)
  • OpenRouter for LLMs; Serper for news; Kalshi public API for market priors
  • Prompt variants inspired by forecasting research (superforecaster, base-rate, scratchpad / 7-step)
  • evaluate_local.py + Prophet CLI forecast retrieve / evaluate for Brier scoring
  • Iteration: single-model MVP → analyst + search → v3 ensemble + calibration + market blend

Challenges we ran into

  • Raw event titles often return no news; we needed an LLM analyst to generate short search queries
  • Latency — multiple models + several Serper calls per market; batch runs need high timeouts
  • Models default to extreme probabilities; calibration and market blending were required
  • Hard markets (politics, entertainment, long-horizon) hurt more than post-match sports with clear headlines
  • Balancing offline wins on resolved sets vs. performance on the open hackathon slate

Accomplishments that we're proud of

  • End-to-end Forecasting-track agent with research-backed design (retrieval + silicon-crowd ensemble)
  • v3 LangGraph pipeline shipping as production default
  • Measurable offline Brier gains over the 0.25 baseline on resolved-market evals
  • Reproducible eval harness (smallTest for cheap runs, full sample-resolved for regression)

What we learned

  • Search + structured prompts beat parametric memory on time-sensitive markets
  • Ensemble diversity (models and prompts) matters more than one strong model
  • Calibration is part of the product, not an afterthought, on a Brier leaderboard
  • Small resolved benchmarks are useful for iteration but not a guarantee on the live slate

What's next for TomMind AI

  • Related-market consistency across tickers on the same event
  • Rigorous A/B (with vs. without search) on fixed eval sets
  • Larger regression sets (Prophet Arena 100, pa1200 cleaned data)
  • Faster, cheaper runs (parallel search, fallback models, tighter concurrency)

Built With

  • ai-prophet-cli
  • digitalocean-app-platform
  • docker
  • fastapi
  • gemini-flash)
  • json/csv
  • kalshi-api
  • langgraph
  • openai-sdk
  • openrouter-(gpt-4o
  • phi-4
  • pydantic
  • python-3.13
  • serper
  • uv
  • uvicorn
Share this project:

Updates