AI-Prophet: Project Documentation

The Overview

Forecasting the future has always felt like the ultimate test of machine intelligence.

When we first started playing around with LLMs for prediction, we noticed a frustrating pattern: they’re either dangerously overconfident, or they timidly sit on the fence with a safe 50/50 guess. We also looked at real-world prediction markets like Kalshi and realized that while the "wisdom of the crowd" is great, those markets often go stale or get things wrong when breaking news hits.

We thought, what if we could build a system that doesn't just guess, but actually argues with itself like a room full of analysts?

At its core, AI-Prophet is an automated forecasting engine designed to predict complex, multi-outcome real-world events—spanning politics, sports, finance, and pop culture. It doesn't just fire off a single prompt. It scours the live web to build a research brief, checks historical market prices, and spins up 6 distinct AI personas to debate the outcome. It's built to know exactly when to trust the crowd, and when to confidently bet against it.


Architecture & How We Built It

We wanted this engine to be lightning fast but also cheap to run, so we built a 6-stage parallel architecture:

  • The Time Machine (Cutoff Engine): Resolves the exact date an event happens to establish a strict temporal boundary for our searches.
  • Evidence Gathering: We use OpenRouter (specifically Claude Haiku) to scrape the web and build an "Evidence Brief," while our Kalshi API client pulls historical candlestick market pricing.
  • The Parallel Debate: We run 6 distinct AI personas simultaneously. One persona thinks purely in base-rate probabilities, one maps out decision trees (Tree-of-Thought), one plays devil's advocate (Contrarian), and another runs autonomous deep-search loops.
  • The Supervisor: A master Supervisor AI reviews the entire 6-way debate, ranks the arguments, and outputs a consensus prediction with a strict confidence rating.
  • The Math Engine: We run the raw probabilities through our custom calibration layer (applying Platt scaling and market blending) to ensure the numbers are mathematically sound.
  • The Output: The system clips the probabilities to safe tournament margins and outputs a perfectly formatted JSON prediction.

Key Design Decisions

David vs. Goliath Routing

The biggest lesson we learned was that throwing a single, hyper-expensive model (like Claude Opus) at a problem is rarely the best answer. By orchestrating a structured debate between 6 cheaper, specialized models (Claude 3.5 Sonnet, GPT-4o-mini, and Gemini 2.5 Flash), we produced dramatically better reasoning while keeping our evaluation costs.

Zero-Downtime Fallbacks

Web environments are volatile. We wrapped our entire pipeline in a universal exception catcher. If an API key fails or a rate limit is hit, the server gracefully falls back to a safe, uniform probability distribution rather than crashing, ensuring we never submit a disqualified tournament entry.

Sandboxed Portability

We completely dockerized the repository to eliminate weird local dependencies. We had to sandbox volatile scraper cache directories so the Docker container could run with zero privileges, allowing us to deploy it live as a persistent web service on Render in just a few clicks.


Key Innovations

Beating Look-Ahead Bias (The Ultimate Cheat)

When backtesting historical events, standard web-search tools instantly retrieve the "future" answer, ruining the evaluation. We built a brutal 6-layer temporal debiasing engine that scrapes HTML metadata and physically redacts future dates and phrases like "won the election" from the context window before the AI ever sees it.

Disagreement-Aware Blending

Standard forecasting models blindly merge their guesses with market priors. We built an engine that actually measures its own confidence. If our AI models are highly confident and strongly disagree with a flat, stale Kalshi market, our system automatically scales the market's influence down to zero and trusts its own reasoning.

Adaptive Multi-Class Platt Scaling

Standard Platt scaling (which sharpens predictions) works beautifully for Yes/No questions. But when we threw a 14-candidate primary election at it, the math totally broke and destroyed the probability distributions. We had to sit down and derive an entirely new adaptive cardinality-scaled Platt formula that dynamically scales based on how many outcomes are possible, ensuring perfect calibration across the board.

References

Built With

Share this project:

Updates