Inspiration
At Prophet Hacks, we wanted an agent that forecasts like a superforecaster—not from stale memory, but from fresh evidence, multiple models, and calibrated probabilities. TomMind AI targets the Forecasting track: minimize Brier score on real Kalshi-style prediction markets.
What it does
TomMind AI exposes POST /predict (compatible with the ai-prophet CLI). Given a market (title, rules, category, close_time), it returns:
p_yes— probability in[0.01, 0.99]rationale— short explanation
Pipeline (v3, default):
- Load Kalshi implied price (when available) as context
- Analyst → 2–3 keyword queries → Serper news (window anchored to
close_time) - Parallel ensemble — GPT-4o, Phi-4, Gemini Flash with different prompt styles
- Median aggregate → blend with market price (stronger when models disagree) → calibrate toward 0.5
Offline eval: on Prophet’s sample-resolved (26 markets) and our 5-market smallTest slice, Brier is well below the 0.25 “always 50%” baseline (e.g. ~0.06–0.08 on the full resolved set in local runs). These are dev benchmarks, not the final hackathon leaderboard.
How we built it
- FastAPI server + LangGraph orchestration (parallel forecasters via
Send) - OpenRouter for LLMs; Serper for news; Kalshi public API for market priors
- Prompt variants inspired by forecasting research (superforecaster, base-rate, scratchpad / 7-step)
evaluate_local.py+ Prophet CLIforecast retrieve/evaluatefor Brier scoring- Iteration: single-model MVP → analyst + search → v3 ensemble + calibration + market blend
Challenges we ran into
- Raw event titles often return no news; we needed an LLM analyst to generate short search queries
- Latency — multiple models + several Serper calls per market; batch runs need high timeouts
- Models default to extreme probabilities; calibration and market blending were required
- Hard markets (politics, entertainment, long-horizon) hurt more than post-match sports with clear headlines
- Balancing offline wins on resolved sets vs. performance on the open hackathon slate
Accomplishments that we're proud of
- End-to-end Forecasting-track agent with research-backed design (retrieval + silicon-crowd ensemble)
- v3 LangGraph pipeline shipping as production default
- Measurable offline Brier gains over the 0.25 baseline on resolved-market evals
- Reproducible eval harness (
smallTestfor cheap runs, fullsample-resolvedfor regression)
What we learned
- Search + structured prompts beat parametric memory on time-sensitive markets
- Ensemble diversity (models and prompts) matters more than one strong model
- Calibration is part of the product, not an afterthought, on a Brier leaderboard
- Small resolved benchmarks are useful for iteration but not a guarantee on the live slate
What's next for TomMind AI
- Related-market consistency across tickers on the same event
- Rigorous A/B (with vs. without search) on fixed eval sets
- Larger regression sets (Prophet Arena 100, pa1200 cleaned data)
- Faster, cheaper runs (parallel search, fallback models, tighter concurrency)
Built With
- ai-prophet-cli
- digitalocean-app-platform
- docker
- fastapi
- gemini-flash)
- json/csv
- kalshi-api
- langgraph
- openai-sdk
- openrouter-(gpt-4o
- phi-4
- pydantic
- python-3.13
- serper
- uv
- uvicorn
Log in or sign up for Devpost to join the conversation.