The Oracles is a probabilistic forecasting agent built for Prophet Arena's forecasting track. It answers POSTed event webhooks with calibrated per-outcome probabilities, backed by recent web evidence and anchored to any cited market odds.
Five observable pipeline stages run for every event:
- Build a Brave Search query from the event title and most informative outcome.
- Retrieve five web-evidence snippets, deduped by domain with .gov / .edu / exchanges of record prioritized.
- Rank and cap the chunks.
- Single Anthropic Claude Opus 4.7 forecast call. The system prompt enforces a 0.50 to 0.90 calibration scale and explicit market-odds anchoring: if the evidence cites an implied probability, anchor to it and move more than 5pp only with specific contrary signal.
- Kalshi longshot floor: every per-outcome probability floored at min(0.10, max(0.05, 0.5/n)), then renormalized. The 0.10 cap is the empirical Kalshi threshold below which buyers historically lose more than 60%.
Each call writes a full trace to disk: Brave query, raw model output, parse path, per-stage latency, fuzzy outcome-label matches, and warnings. None of those fields are returned to Prophet Arena; they live in the auth-gated /observatory dashboard.
The submission is grounded in three findings and two negative results.
Findings:
- Bug-fix dominance. About 85% of the headline Brier improvement over the Sonnet baseline came from a one-line fix to the post-LLM longshot floor formula, not from a model upgrade. The old max(0.05, 0.5/n) evaluated to 0.25 for binary events, silently clamping every binary prediction into [0.25, 0.75] regardless of model output.
- Scoring rule matters more than the model. Prophet Arena's CLI evaluator scores single-binary Brier; their published docs describe proper multi-class Brier; their actual live scoring is a Brier skill score against snapshotted Kalshi/Polymarket prices. The three rules rank our model lineup differently on n=26.
- Same retrieval, same prompt, same post-processing: GPT-5.5 and Gemini 3.1 Pro Preview scored 18 to 46 times worse than Opus 4.7 on the multi-outcome events. The failure was JSON output - probability mass on outcome labels that weren't in the supplied list, not a reasoning gap. Both are strong models that didn't fit this exact contract.
Negative results:
- Adversarial-review prompts regress on a calibrated production model. Two independent variants (two-call self-critique, one-call verification field) both pull confident-and-correct predictions toward the middle, costing Brier where production was right to be confident.
- Two changes that looked like obvious wins didn't survive a paired-bootstrap CI on n=26: an adaptive retrieval count, and prioritizing exchange sources only. The CIs crossed zero, so neither shipped. Recorded in
docs/DECISIONS.mdas research notes.
Every published claim is backed by an on-disk artifact: prediction JSON, bootstrap-CI script, leakage-audit script, variance ablation. The audit script flags 38.5 percent of resolved events as having at least one post-resolution URL in the evidence list, so the 0.0378 number is best-case-with-hindsight and we say so prominently in the report.
Scale validation. We re-ran the same production variant on Prophet Arena's public 1200-event resolved dataset (Prophet-Arena-Subset-1200 on HuggingFace), 46 times the sample-resolved size. Mean Brier 0.1224 with 95% bootstrap CI [0.110, 0.135]. The leakage rate at scale drops to 21.8 percent of events / 7.0 percent of URLs, vs 38.5 percent / 23.8 percent on the small set, confirming the small-set headline was hindsight-inflated. The 0.1224 number is the more credible expected magnitude on live PA events. We do not replace the 0.0378 (it is what the local prophet forecast evaluate CLI computes on its own dataset); we publish both alongside.
The auth-gated research console at agent.forecastingpath.com/observatory shows live prediction traces, a 5-model side-by-side gallery with per-event drill-down, a cross-model heatmap, an interactive abstain-policy slider that visualizes Prophet Arena's actual scoring rule, a bootstrap distribution histogram, an intra-model variance plot, and a step-by-step pipeline trace explorer for educational replay. The public root shows a sparse product page.
We coordinated two AI agents in parallel via docs/AGENT_STATUS.md with explicit file-ownership claims. Zero merge conflicts. Methodology lessons live in docs/DECISIONS.md (append-only, 20+ dated entries) so any future iteration can see what was tried and rejected before being tried again.
Python 3.13. FastAPI plus uvicorn on Railway with Nixpacks builds. Anthropic Claude Opus 4.7 as the forecasting model, with Sonnet 4.6 and Haiku 4.5 as fallbacks. Brave Search Web API for retrieval. OpenRouter unified API for the multi-vendor ablation harness (tested Opus 4.6, GPT-5.2, GPT-5.5, and Gemini 3.1 Pro Preview through the same pipeline).
Engineering scaffolding:
- scripts/preflight.sh: gate-and-print check before every deploy. Verify green, working tree clean, HEAD = origin/main, upload size sanity (caught a real 18MB worktree-bloat bug), prints live vs local SHA delta.
- scripts/agent/deploy.sh: single safe path to railway up. Pins commit SHA into PROPHET_BUILD_COMMIT_SHA env so /healthz.commit reflects what is actually serving.
- scripts/full_check.sh: 10-step end-to-end audit across source state, deployed surface, auth gates, and watcher process.
- scripts/bootstrap_brier_ci.py: paired-bootstrap confidence interval for any two prediction files.
- scripts/check_retrieval_leakage.py: word-boundaried regex audit of evidence URLs for post-resolution markers.
- scripts/ablate_variance.py: 5-run intra-model variance estimate.
Frontend is hand-written HTML plus a small amount of Plotly via CDN for the interactive views. No build step, no framework. Auth-gated static research pages live at /static/ behind a PIN-checked middleware; the public root and /healthz and /static/summary.pdf and /static/architecture.svg are open.
Testing: 270+ pytest cases. The verify gate (scripts/agent/verify.sh) was tightened mid-build after we discovered it had been silently swallowing pytest failures (now loud). The submission report's public copy quality is itself a test (tests/test_public_text_quality.py bans non-ASCII em-dashes, marketing speak, and any number known to be stale).
Some challenges:
A silent production bug. The Kalshi longshot floor formula returned 0.25 for binary events instead of the intended 0.10, silently clamping every binary prediction into [0.25, 0.75] regardless of LLM output. Found by a smoke test on a synthetic Chiefs and Super-Bowl-LXI event after an unrelated model swap. About 85 percent of the eventual headline improvement came from this one-line fix; the model swap accounted for the remaining 15 percent.
A methodology bug in the ablation harness. backtest_forecast.py dropped the per-outcome probabilities array from each saved prediction, keeping only p_yes. The summary-report generator then defaulted to a uniform 1/n distribution for multi-class scoring, producing apparent 25-times multi-outcome gaps that turned out to be metric-mixing across files. Fixed; postmortem in docs/DECISIONS.md.
Three Railway deploys failed silently with TLS BadRecordMac errors before we noticed .claude/worktrees/ was uploading 18MB of agent state on every push. Fix: .gitignore the worktrees plus preflight du -sh check.
The scoring rule was not what we thought. Prophet Arena's CLI scores single-binary Brier; their published docs describe proper multi-class; their actual live scoring (confirmed mid-build in Discord) is a Brier skill score against snapshotted market prices. Three different metrics, three different model rankings on n=26. The right adjustment was to report all three and resist promoting any change on a metric the actual evaluator does not implement.
Adversarial-review prompts looked promising on a single run and failed to replicate. A two-call self-critique pattern improved Brier by 0.003 in run 1, regressed by 0.003 in run 2. A one-call verification-field pattern regressed by 0.019. Both pull confident-and-correct predictions toward the middle. Recorded as negative results.
If we are lucky to be chosen to continue, our next steps include
- Run our agent through FutureSim's three-month chronological-replay benchmark as a non-leaking substrate.
- Try a confidence-aware critique: only revise low-confidence initial predictions, leaving confident-and-correct ones alone. Open hypothesis after the two regression results.
- Test market-aware blending. If Prophet Arena's actual live payload includes a market price for the outcome, blend it with the model probability conditional on retrieval confidence.
- Validate the schema-discipline finding on a larger non-leaking sample. n=26 with 16 sports matchups is directional, not conclusive.
Built With
- claude
- fastapi
- python
- railway
Log in or sign up for Devpost to join the conversation.