Brier Patch: A Market-Anchored Forecasting Agent
Markets as informative priors, LLMs as evidence sources, calibration as the bound. A forecasting cascade engineered for 14 days of unattended evaluation.
Inspiration
Most LLM-based forecasting agents treat the LLM as the forecaster: take a question, ask Claude or GPT for a probability, return it. That approach ignores the most calibrated source of forecasts available, which is the prediction market price itself. Markets aggregate the beliefs of bettors with skin in the game, and on liquid questions they are very hard to beat. Prophet Arena's own paper documents that LLMs aggregate information more slowly than markets near resolution.
We wanted to invert that stack. Treat Kalshi and Polymarket prices as the prior; use a cross-vendor LLM ensemble only where markets lack signal (illiquid books, multi-outcome questions outside market coverage, edge bands) or where a structural reason exists for the model to add value. Every deviation from the market prior is treated as a bounded Bayesian update.
Two facts about Brier scoring drove the architecture.
First, the quadratic penalty means confident wrong predictions are catastrophically expensive. A 0.95 prediction on the wrong side costs nearly four times what a hedged 0.70 costs. So calibration beats sharpness: pulling the LLM toward 0.5 in the tails is free Brier insurance against the overconfidence mode that dominates losses.
Second, completion rate is multiplicative on the final score. A hung or crashed request hurts as much as a wrong prediction. So we engineered the agent so that no single failure can break a /predict response.
How we built it
The agent is a tiered fallback chain. Each tier acts only when it has a structural reason to.
For binary events (2 outcomes):
- Kalshi market lookup by market_ticker. Depth-weighted midprice; more demand on the bid shifts the true price toward the ask.
- Tail-anchor triage. When Kalshi is confidently at one tail (p < 0.05 or p > 0.95) with 24h volume above $500, return the market price directly with a 3% safety shrink. Tails are settled consensus; LLM disagreement there only hurts Brier.
- Cross-venue agreement gate. Fetch Polymarket and check whether the two venues agree. When the gap is within 3 cents, skip the blend (no information in agreement); otherwise volume-weighted-blend the two prices.
- Volume-weighted shrinkage toward 0.5, tuned aggressive on thin books and gentle on deep ones.
- Typed external-data priors when markets are silent: NWS sigmoid for weather, yfinance plus lognormal for crypto thresholds, ESPN moneylines for sports games, Manifold for political and tournament questions.
- Cross-vendor LLM ensemble (Claude Opus extended-thinking + GPT-5-mini + Gemini 2.5 Flash) with shared web search: Anthropic anchors the search; its findings are injected as search_context into the OpenAI and Gemini calls running in parallel. One set of searches across three reasoning passes.
- Tail-aware non-linear LLM shrinkage across three tiers, with extra penalty in the deep tails. Decisive (alpha = 0.02) when the rationale describes a resolved outcome. Grounded (alpha = 0.05) when it cites current data. Speculative (alpha = 0.15) on base-rate reasoning.
- Market sanity guardrail. If our final probability deviates by more than 0.30 from a deep liquid Kalshi mid, anchor 60/40 toward the market.
- Path-stratified calibration, where every prediction is labeled with which pipeline branch produced it (tail-anchor, kalshi-anchor, kalshi+poly-blend, llm-grounded, llm-speculative, multi-outcome-kalshi, and so on). The nightly refit fits one table per branch. Per-bucket yes-rates are Beta-Bernoulli shrunk toward the bucket's mean prediction so that small-N buckets behave sensibly: the posterior mean is (n × mean_actual + N_0 × mean_pred) / (n + N_0), with N_0 = 10. The final shift is bounded to plus or minus 0.05 from the raw value, which is AGM-style bounded belief revision applied to calibration.
For multi-outcome events (3+ outcomes), Kalshi exposes a parent event with N independent child YES/NO markets. We detect K (single-winner vs top-K) using a three-tier signal hierarchy: Kalshi's mutually_exclusive flag is primary, an explicit "top N" text regex on the title is secondary, and the rounded sum of child probabilities is tertiary (with an ambiguity guard at distance 0.30 from the nearest integer). The agent submits a distribution that sums to K rather than to 1 when the event is genuinely top-K. This matches the official scoring rule, which sums per-outcome squared errors as-is without normalization.
Challenges
The output contract evolved during the build. Early upstream code suggested {p_yes: float} was the submission format. The official spec was clarified to require {probabilities: [{market, probability}]} with markets matching event outcomes exactly. We refactored every code path through a single _wrap_binary and _normalize_distribution chokepoint to enforce this.
Sum-to-K vs sum-to-1 on top-K events. When the docs said "probabilities are normalized before scoring," we shipped sum-to-1 enforcement. Two organizers separately confirmed on Discord that the server actually does not normalize, so sum-to-K is the right submission shape for events like Bundesliga top-4 or Eurovision top-5. We added a conservative K detector (explicit numeric K in the title, bounded to between 2 and n-1) and a target_sum parameter threaded through every multi-outcome return path. Live probes against Kalshi confirmed child prices naturally sum to K on these events.
Kalshi's status filter was rejecting settled markets. Our initial code treated status="finalized" and status="closed" as "no signal," falling back to the LLM ensemble. But those are exactly the states where Kalshi has the most signal: per-child result="yes"/"no" is the canonical answer. We accept these statuses now and use the result field directly when present. Coverage on our verification sweep went from 2/7 multi-outcome events to 7/7.
Defensive exception handling on sequential paths. The parallel branch of the ensemble caught vendor exceptions cleanly via per-future try/except, but two sequential paths (the shared-search anchor call and the single-model short-circuit) let exceptions propagate. A single SDK exception during the anchor would have killed the whole ensemble. We wrapped both call sites to treat exceptions as None returns, identical semantics to the per-vendor functions' internal handling.
Local DNS doesn't resolve Kalshi or Polymarket on our dev machine, but production (Cloud Run) does. We use a dnspython socket override for live local scripts.
Multi-outcome subtitle mapping is fuzzy. Polymarket structures events differently than Kalshi, prices return as integer cents on some endpoints but dollar decimals on others, and child subtitles vary ("Bayer Leverkusen" vs "Leverkusen"). We wrote a price reader that tries both conventions and a child-to-outcome matcher that does exact match first, then token-subset fallback.
Cloud Run filesystem ephemerality. Predictions written to local disk are lost on container restart. We added a GCS mirror (one JSON object per event) so the daily calibration refit has a durable read source and post-hoc audit is possible.
What we learned
Calibration beats sharpness. Brier's quadratic penalty makes the confident-wrong path catastrophic. Tail-aware shrinkage on the LLM output is the most important single defense.
Ensemble independence is purchased at a cost. Cross-vendor diversification (Anthropic, OpenAI, Google have different training data) works. Same-family variants are correlated and don't help. Shared web search trades a bit of evidence-side independence for substantial cost and latency wins.
The market is hard to beat in its own zone. Roughly 50 to 70% of binary events should hit the tail-anchor or cross-venue paths, where we are the market. The edge has to come from the 30 to 50% in illiquid bands, multi-outcome events that lack venue coverage, and the categories where the LLM ensemble has a structural information advantage.
Engineering for the eval window, not just the build window. Path-stratified calibration with bounded shift, daily refit cron, GCS-mirrored predictions, async ensemble deadline, retry without web search on total failure, defensive exception handling on every external call, diff-sanity guard before a calibration table is published. These do not show up in unit tests. They are what makes a 14-day evaluation actually run without intervention.
The official rules want public, reproducible, evaluator-runnable. We shipped a public GitHub repo under MIT, a one-shot scripts/evaluate_agent.sh that handles Python version checks, virtualenv setup, dependency install, API-key verification, dataset retrieval with future-event filtering, and a human-readable summary. Reviewers can clone, set three API keys, and run a smoke test for under $0.25 in LLM costs.
Log in or sign up for Devpost to join the conversation.