Inspiration
Most LLM forecasting systems optimize for final event accuracy: estimate the true probability of an outcome, compare it to the market price, and trade if the gap is large. But in Prophet Arena's trading track, open positions are marked to market every tick. That changes the problem.
We asked: instead of only asking “is this market mispriced at resolution?”, can an LLM-assisted trading agent ask “will this market's midpoint move enough over the next few ticks to justify entering, holding, reducing, or exiting a position?”
That led us to build a price-aware RAG-BLF trading agent: forecasting is still useful, but only as one input into a short-horizon trading policy.
What it does
67 Prophets is an autonomous trading agent for live Kalshi prediction markets on Prophet Arena. Every 15-minute tick, it:
- Loads market candidates and portfolio state
- Updates cross-tick memory for each market, including prior bid/ask/midpoint, spread, repeated-market history, previous forecasts, evidence state, and decision history
- Ranks candidates before expensive LLM calls, prioritizing repeated markets, tight spreads, recent midpoint movement, near-term catalysts, and existing positions that may need exits
- Runs RAG/BLF selectively, using retrieval and belief-state estimation as an evidence engine rather than blindly forecasting every market
- Separates two kinds of edge:
- long-run outcome edge:
p_final - executable price - short-horizon price edge:
predicted_next_mid - executable price
- long-run outcome edge:
- Trades only through guarded live channels, including:
- clean price-action entries
- fresh-event catalyst entries
- clean forecast-mispricing micro-entries
- Actively manages exits, reducing stale or weak positions instead of only looking for new buys
The key design shift is that the bot does not treat raw LLM probability gaps as automatic trades. It asks whether the edge survives spread, uncertainty, evidence quality, stale-data checks, market movement, and position-risk constraints.
How we built it
- Price memory layer: stores quote history per market across ticks, including midpoint movement, spread changes, repeated-market count, previous forecasts, and prior decision state.
- Candidate pre-ranker: deterministically reorders markets before RAG/BLF so the limited processing budget goes to markets with tighter spreads, repeated price history, recent movement, near-term catalysts, or existing-position risk.
- RAG / BLF forecasting stack: uses web-grounded evidence and belief-state estimation to produce calibrated fair-value anchors, evidence-quality scores, confidence labels, and resolution-risk flags.
- Short-horizon price layer: estimates whether the market midpoint is likely to move over the next few ticks, using price momentum, forecast-market disagreement, catalyst evidence, spread quality, and stale-evidence penalties.
- Guarded live trading policy: separates outcome-edge trades from price-movement trades. Forecast-only entries remain conservative, while clean price-action and catalyst entries can trade at tiny size when evidence is strong.
- Risk and freeze system: blocks live entries from fallback, rate-limit, stale evidence, invalid JSON, timeout, unresolved resolution risk, or failed verifier paths. Forecast-only freeze can pause risky entries while still allowing clean price-action trades and exits.
- Diagnostics and replay mindset: every tick produces structured audit data so we can evaluate whether predicted midpoint movement actually matched realized mark-to-market behavior.
Challenges
- Forecasting was not the same as trading: early versions focused too much on final-event probability. Most executable taker edges disappeared after spread and market-price shrinkage. We changed the objective from “find a true probability gap” to “find a tradable mark-to-market edge.”
- The bot was too conservative: heavy market anchoring and strict evidence gates caused many zero-trade ticks. We added separate guarded entry channels for price action, fresh events, and clean post-shrinkage forecast mispricings.
- No-history markets were difficult: momentum and mean-reversion signals require repeated quote history, but some catalyst opportunities appear before there is enough history. We added a separate fresh-event channel that allows tiny entries only under strict evidence and spread requirements.
- RAG/BLF failures could not become trades: rate limits, stale evidence, malformed responses, or fallback paths could accidentally make weak markets look tradable. We added live-freeze behavior and evidence recovery rules so cached evidence can help diagnostics without authorizing unsafe live entries.
- Processing budget mattered: the bot cannot deeply analyze every market every tick. Candidate pre-ranking became essential so RAG/BLF calls are spent on markets where trading is actually plausible.
What we learned
The biggest lesson was that prediction-market trading is not just forecasting. A model can be directionally right about final probability and still lose money if the spread is too wide, the timing is wrong, or the price does not move during the holding window.
We learned to treat LLM forecasting as a research engine, not a direct trading signal. RAG and BLF are useful for detecting evidence, catalysts, resolution risks, and fair-value anchors. But the trading layer needs its own logic: quote history, spread-aware execution, midpoint-movement prediction, position exits, and risk controls.
The final strategy became a price-aware portfolio manager: trade small, require clean evidence, separate outcome edge from short-horizon price edge, and measure success by mark-to-market decisions rather than only final probability calibration.
Built With
- ai-prophet-core
- kalshi
- openrouter
- perplexity-sonar
- prophet-arena-api
- python-3.12
Log in or sign up for Devpost to join the conversation.