Inspiration
LLMs price prediction markets confidently but trade them badly: they ignore the quote already on the screen, swing across reruns, and have no notion of how much to bet. Prophet Arena's Trading Track scores on combined Sharpe + PnL rank — so a strategy that overbets on a noisy forecast is a strategy that loses. We wanted to wrap a small, careful Bayesian brain around the LLM, use it as one opinion among several, and let risk-adjusted return fall out of the math instead of being bolted on.
What it does
Project overview. everyday-intelli is an end-to-end Trading Track
agent that plugs directly into the ai-prophet SDK harness. On every
15-minute BenchmarkSession tick it:
- Pulls the candidate market list and current quotes.
- Forecasts $P(\text{YES})$ for each market with an LLM, tempered by the market mid-price as a Bayesian prior.
- Updates a per-market 4-state regime POMDP and folds its entropy into forecast uncertainty.
- Gates each market on expected value, then sizes survivors with fractional Kelly — shrunk by belief uncertainty and current drawdown.
- Submits intents through the SDK, finalizes, and advances. State persists per-slug, so the agent is resumable across crashes and safe to leave running for the full 2-week evaluation window.
The same brain runs in an offline harness (no API key required) that scores Brier, log loss, accuracy, and a synthetic PnL on resolved markets — so we could iterate on the math without burning live ticks.
How we built it
Architecture & design decisions.
- Model choice. Provider-agnostic LLM via an OpenAI-compatible
endpoint — OpenRouter by default, but
LLM_BASE_URL/LLM_MODELswitches it to any compatible host. Falls back to a deterministic base-rate heuristic when no key is set, so the pipeline always runs. Forecasts are disk-cached by content hash so reruns are free. - Data sources. The Prophet Arena
BenchmarkSession.load_candidatesfeed (question text, rules, resolution time, live best bid/ask), and the agent's own per-market price history persisted indata/prophet/regime_state.json. - Bayesian core. A vendored log-odds belief updater. The market mid-price becomes the prior; the LLM forecast is folded in as a log-odds opinion pool: $$\text{logit}(p_\text{post}) = (1-w)\,\text{logit}(p_\text{prior}) + w\,\text{logit}(p_\text{llm})$$ with $w \in [0,1]$ controlling LLM trust, and a temperature $T \geq 1$ flattening overconfident model outputs before the fold.
- Latent regime POMDP. Per market, a 4-state belief over
{bull, bear, sideways, volatile}updated from binned price-change observations: $$b'(s') = \eta \cdot O(o\mid s') \sum_s T(s'\mid s)\,b(s)$$ Shannon entropy of the belief feeds back into forecast uncertainty, which shrinks Kelly sizing in volatile regimes without biasing the LLM-vs-prior weighting. - Strategy logic — built for Sharpe. Trades must clear an EV threshold. Survivors are sized by fractional Kelly with three multiplicative shrinks: $$f = f_\text{kelly} \cdot \text{kelly}^*(p, q) \cdot (1 - 2\sigma)+ \cdot \big(1 - \tfrac{\text{dd}}{\text{dd}\text{max}}\big)+$$ where $\sigma$ is belief spread and $\text{dd}$ is current drawdown off peak equity. The drawdown shrink linearly de-risks as the book bleeds — the Sharpe-friendly choice — and zeroes out at $\text{dd}\text{max}$. Position caps and gross-exposure caps from the Prophet Arena ruleset are enforced before submission.
- Resilience. Every file write is atomic (
tmp+replace). Tick-level operations are idempotent. Re-running with the samePA_SLUGresumes the existing experiment instead of starting over — important for the 2-week eval window where a crash mid-run shouldn't cost a day of trades.
Challenges we ran into
- LLM overconfidence. Raw model probabilities cluster around 0.05 and 0.95. Log-odds temperature scaling ($\text{logit}/T$ with $T>1$) before the prior fold made calibration usable.
- Sizing under uncertainty. Full Kelly on a 0.7 prior is a great way to blow up. Layering three shrink factors (fractional Kelly, belief uncertainty, drawdown) before sizing felt right was the difference between a Sharpe-positive agent and a coin-flip one.
- Resumability against an unforgiving lifecycle. Leases expire and ticks must be finalized — partial state would have corrupted experiments. Atomic persistence + idempotent per-tick steps fixed it.
- Holding both sides. Early versions tried to BUY NO on a market we
already held YES on; the server (correctly) rejects it. We now gate
on
held_sidebefore deciding.
Accomplishments that we're proud of
Key innovations — what makes our approach distinct:
- Market quote as Bayesian prior, not competitor. Most LLM-trading agents treat the price as something to beat. We treat it as the prior of a log-odds opinion pool, so the LLM has to earn its pull. This is the single biggest stability win.
- Latent-regime POMDP that feeds uncertainty, not direction. A bull belief doesn't shift our $P(\text{YES})$ — it shrinks our bet size. Regime info shows up where it should: in risk, not in point estimates.
- Three-stage Kelly shrink targeting Sharpe. Fractional Kelly is common; layering uncertainty- and drawdown-aware shrinks on top means our sizing actually responds to the things that hurt Sharpe.
- Crash-resumable Arena agent. State persisted per-slug with atomic writes — designed from day one to survive the 2-week evaluation window without manual babysitting.
- Zero-key offline harness. The same brain scores on resolved markets with no API key, so calibration is reproducible and auditable.
What we learned
- The market mid is information, not noise. Building it in as a prior did more for our metrics than any prompt change.
- Log-odds space is the right place to do almost everything — tempering, pooling, opinion blending. Probability space is where bugs live.
- POMDPs are underused in trading. A tiny 4-state belief added more signal than another LLM call would have, at a fraction of the cost.
- Sharpe and PnL pull in different directions. Designing for the combined-rank metric — not pure PnL — changed every default we picked.
What's next for everyday-intelli
- Value iteration over the regime POMDP so it can choose
information-gathering trades, not just
observe. - Multi-forecaster ensembles with explicit per-model calibration weights — the log-odds pool already supports weighted blending.
- Online calibration: keep updating the calibration map as new resolutions land instead of fitting once.
- Walk-forward backtesting on longer historical Arena data to tune $w$, $T$, and $f_\text{kelly}$ rigorously instead of by feel.
- Hardening for the 2-week eval window: tighter rate-limit back-off, structured error telemetry, and a watchdog that restarts on any unhandled exit.
Log in or sign up for Devpost to join the conversation.