TomMind AI | Devpost

Inspiration

At Prophet Hacks, we wanted an agent that forecasts like a superforecaster—not from stale memory, but from fresh evidence, multiple models, and calibrated probabilities. TomMind AI targets the Forecasting track: minimize Brier score on real Kalshi-style prediction markets.

What it does

TomMind AI exposes POST /predict (compatible with the ai-prophet CLI). Given a market (title, rules, category, close_time), it returns:

p_yes — probability in [0.01, 0.99]
rationale — short explanation

Pipeline (v3, default):

Load Kalshi implied price (when available) as context
Analyst → 2–3 keyword queries → Serper news (window anchored to close_time)
Parallel ensemble — GPT-4o, Phi-4, Gemini Flash with different prompt styles
Median aggregate → blend with market price (stronger when models disagree) → calibrate toward 0.5

Offline eval: on Prophet’s sample-resolved (26 markets) and our 5-market smallTest slice, Brier is well below the 0.25 “always 50%” baseline (e.g. ~0.06–0.08 on the full resolved set in local runs). These are dev benchmarks, not the final hackathon leaderboard.

How we built it

FastAPI server + LangGraph orchestration (parallel forecasters via Send)
OpenRouter for LLMs; Serper for news; Kalshi public API for market priors
Prompt variants inspired by forecasting research (superforecaster, base-rate, scratchpad / 7-step)
evaluate_local.py + Prophet CLI forecast retrieve / evaluate for Brier scoring
Iteration: single-model MVP → analyst + search → v3 ensemble + calibration + market blend

Challenges we ran into

Raw event titles often return no news; we needed an LLM analyst to generate short search queries
Latency — multiple models + several Serper calls per market; batch runs need high timeouts
Models default to extreme probabilities; calibration and market blending were required
Hard markets (politics, entertainment, long-horizon) hurt more than post-match sports with clear headlines
Balancing offline wins on resolved sets vs. performance on the open hackathon slate

Accomplishments that we're proud of

End-to-end Forecasting-track agent with research-backed design (retrieval + silicon-crowd ensemble)
v3 LangGraph pipeline shipping as production default
Measurable offline Brier gains over the 0.25 baseline on resolved-market evals
Reproducible eval harness (smallTest for cheap runs, full sample-resolved for regression)

What we learned

Search + structured prompts beat parametric memory on time-sensitive markets
Ensemble diversity (models and prompts) matters more than one strong model
Calibration is part of the product, not an afterthought, on a Brier leaderboard
Small resolved benchmarks are useful for iteration but not a guarantee on the live slate

What's next for TomMind AI

Related-market consistency across tickers on the same event
Rigorous A/B (with vs. without search) on fixed eval sets
Larger regression sets (Prophet Arena 100, pa1200 cleaned data)
Faster, cheaper runs (parallel search, fallback models, tighter concurrency)

Built With

ai-prophet-cli
digitalocean-app-platform
docker
fastapi
gemini-flash)
json/csv
kalshi-api
langgraph
openai-sdk
openrouter-(gpt-4o
phi-4
pydantic
python-3.13
serper
uv
uvicorn

Updates

Dev Jayesh Patel started this project — May 17, 2026 09:38 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.