Our system runs independent LLM forecaster lanes in parallel — Gemini 2.5 Flash (via Google's native search grounding), Claude Sonnet 4.6 (via OpenRouter with :online web access), and Grok 4.20 (via OpenRouter with live data) — each given a distinct prompt philosophy to maximize independence and reduce correlated errors.
Gemini follows a market-first evidence hierarchy: it starts from the Kalshi market price as a prior, works through a strict source quality ladder (official primary data → reputable aggregators → low-latency reporting), and applies domain-specific research checklists per category (sports injury reports, NOAA forecasts, official legislative dockets, etc.). It reports calibrated adjustments from the prior rather than raw probability estimates.
Claude uses a structured four-phase protocol — rules parsing, targeted evidence assembly (2–4 web calls), market comparison with spread/liquidity regime classification, and an explicit red-team pass — and only deviates from the prior when the research gap exceeds 5pp with citable evidence. Its prior blends the Kalshi microprice with Polymarket.
Grok is tuned to fix its known failure mode of hedging confident markets toward 0.5. The prompt explicitly instructs it to trust market extremes (e.g. if the market is at 0.97, stay in [0.90, 0.99]). For binary markets it uses bidirectional prompting — independently querying P(YES) and P(NO) and averaging — to reduce anchoring bias. A noise-removal filter skips the API call entirely for high-volume, extreme-priced, or ATP tennis markets (historically where Grok underperforms), returning the market price directly instead.
Ensemble aggregation weights each lane dynamically at runtime based on its evidence quality, rules clarity, liquidity assessment, and whether it deferred to market — so a low-confidence or market-deferring lane automatically contributes less signal. A Kalshi market anchor (weight 1.5) acts as a stabilizing Bayesian prior. The raw pool is then shrunk toward live Kalshi market-implied probabilities via a calibration layer. Finally, a GPT supervisor acts as a pure meta-aggregator: it sees all three lane distributions plus the deterministic ensemble, performs no research of its own, and either selects the strongest lane, mixes them, or defers to the deterministic result.
Log in or sign up for Devpost to join the conversation.