Inspiration

Prophet Arena's scoring is brutal: your number isn't "how accurate are you?" — it's $(\text{your Brier} - \text{market Brier}) \times \text{completion rate}$. You don't win by being smart on the hard questions; you win by matching the market on the questions where you have no edge, and being decisively right on the ones where you do.

Most LLM-as-forecaster baselines pick one prompt strategy and hope. We thought the answer was an ensemble — multiple analysts with genuinely different priors, blended in log-odds space, then anchored to the public market whenever our internal agreement is too weak to trust the divergence.

What it does

Ensemble Prophet is a forecasting agent that, given a binary or multi-outcome question, returns a calibrated probability and a human-readable rationale. For each event it:

  1. Researches — generates category-aware search queries, hits DuckDuckGo (with UA rotation + throttle backoff), extracts content with trafilatura, compiles an ~8K-char brief.
  2. Runs three independent analysts in parallel — an evidence-weighted analyst (weighs sources by recency/authoritativeness), a base-rate analyst (anchors to the outside view via reference-class reasoning), and a contrarian analyst (stress-tests the consensus).
  3. Adds a market signal — exact-ticker lookup against Kalshi, fuzzy-title fallback to Polymarket.
  4. Deliberates — a meta-LLM call adjudicates (not averages) the four estimates.
  5. Ensembles in log-odds space with confidence-weighted shrinkage:

$$\hat{p} = \sigma!\left(\frac{\sum_i w_i \cdot \text{logit}(p_i)}{\sum_i w_i}\right)$$

where $w_i$ is each analyst's self-reported confidence, and the result is shrunk toward 0.5 by a factor that depends on internal agreement and time-to-close.

  1. Market-anchors — if the ensemble disagrees with the market by ≥10% AND internal agreement is weak, it blends 60/40 toward the market; if agreement is strong, it trusts itself. Caps downside on events with no edge, preserves upside where the strategies converge against the crowd.

It ships three interfaces: a CLI (prophet forecast predict --strategy ensemble), a REST /predict endpoint with batch support, and an OpenAI-compatible /v1/chat/completions endpoint that drops directly into the Prophet Arena evaluation harness.

How we built it

  • Python 3.12, FastAPI server, dataclasses + Protocol typing throughout.
  • LLM stack: tier-based fallback across Groq (primary, fast), OpenRouter, and Anthropic. Two named tiers — research (cheap, parallel) and reasoning (smarter, sequential).
  • Research: DuckDuckGo HTML search with 3-UA rotation, jittered inter-query pauses, POST→GET retry on 202/403/429 throttle codes. Trafilatura for content extraction.
  • Ensemble math: log-odds blending with adaptive shrinkage, temporal decay buckets (1.0 / 0.8 / 0.5 / 0.3 by time-to-close), and Bayesian-style confidence rebalancing when research comes back thin.
  • Caching: file-based JSON cache with thread-safe lock + 6-hour TTL, because the eval harness polls the same events repeatedly over a two-week window.
  • Deployment: Dockerized, deployed to Railway. Live at https://ai-prophet-production.up.railway.app.
  • Calibration tool: a built-in calibrate CLI that runs the agent blind on resolved events and emits per-strategy Brier, calibration curves, ECE, and an 8-value shrinkage sweep with an auto-recommended winner.

Challenges we ran into

  • DuckDuckGo throttling. Our initial Brier was 0.26 — fine, but suspicious. Logs showed ~30% of search calls returning 202/403 because we were bursting too fast. UA rotation + jittered pauses + POST→GET retry dropped Brier to 0.20.
  • OpenAI ChatCompletion format mismatch. Discovered late that the eval harness sends events via /v1/chat/completions with a specific JSON-string-inside-content protocol, not raw /predict JSON. Reverse-engineered it from the evaluator source and built a parser that extracts the event title + market list from chat messages and re-serializes the response correctly.
  • Kalshi auth. The public elections endpoint is unauthenticated but has limited inventory; the real trading-api requires bearer auth. Made the URL switch automatic based on whether KALSHI_API_KEY is set.
  • Calibration discipline. Our first-pass strategies were all overconfident. The calibration tool's per-strategy Brier breakdown showed the evidence-weighted analyst alone scored 0.043 — better than the ensemble. That drove a strategy-confidence rebalance and a DEFAULT_SHRINKAGE tune from 0.15 to 0.10.

Accomplishments that we're proud of

  • Brier 0.0977 vs always-0.5 baseline 0.2500 — 3.4× better. First-pass blind run on 19 resolved binary events.
  • 324 unit tests, all green — ensemble math, log-odds invariants, temporal bucket boundaries, cache TTL, market-anchor decision branches, fast-resolve smart routing, OpenAI ChatCompletion handler, batch endpoint, request logging.
  • Three live interfaces — CLI, REST, OpenAI-compatible — running on Railway with ~5 s per cached event and ~30 s per fresh event.
  • A bonus PR: while auditing the codebase we found two long-broken wiring bugs in the unrelated trade CLI (dry_run= vs paper= mismatch, and a stale call to a removed BettingEngine.on_forecast method). Filed upstream as PR #29.

What we learned

  • Log-odds is the right ensemble space for probabilities. Averaging in probability space pulls everything toward 0.5; averaging logits preserves confident agreement and decays disagreement gracefully.
  • Calibration > capability. Our most "capable" strategy (contrarian, lots of reasoning) was the worst-calibrated. Evidence-weighted, the simplest one, was the best. Measure before you trust.
  • The market is a co-author, not a competitor. Once you internalize $(\text{ours} - \text{market}) \times \text{completion}$, you stop trying to beat the market on every question and start trying to match it on the ones where you have no edge. Different — and easier — game.
  • Shipping multiple endpoints is cheap insurance. We almost shipped /predict-only. Adding the OpenAI-compatible route took an afternoon and was the only thing the eval harness actually called.

What's next for Ensemble Prophet: Multi-Strategy Forecasting Agent

  • More analysts — a Bayesian model-averaging analyst, a "red-team" analyst that searches specifically for disqualifying evidence, and a category-specialist router that swaps in different analyst weights for Sports vs Economics vs Politics.
  • Continuous calibration — rerun the calibration sweep nightly during the two-week eval window and auto-adjust DEFAULT_SHRINKAGE.
  • Real betting integration — the codebase has a trade CLI with a Kalshi adapter. Once the trading-track API stabilizes, wire the ensemble's p_yes into the betting engine and run paper trades.
  • Learned market-anchor blend — right now the blend weight is a fixed 0.6/0.4. Learn it per category from historical Brier deltas.

Built With

Share this project:

Updates