Inspiration

LLMs price prediction markets confidently but trade them badly: they ignore the quote already on the screen, swing across reruns, and have no notion of how much to bet. Prophet Arena's Trading Track scores on combined Sharpe + PnL rank — so a strategy that overbets on a noisy forecast is a strategy that loses. We wanted to wrap a small, careful Bayesian brain around the LLM, use it as one opinion among several, and let risk-adjusted return fall out of the math instead of being bolted on.

What it does

Project overview. everyday-intelli is an end-to-end Trading Track agent that plugs directly into the ai-prophet SDK harness. On every 15-minute BenchmarkSession tick it:

  1. Pulls the candidate market list and current quotes.
  2. Forecasts $P(\text{YES})$ for each market with an LLM, tempered by the market mid-price as a Bayesian prior.
  3. Updates a per-market 4-state regime POMDP and folds its entropy into forecast uncertainty.
  4. Gates each market on expected value, then sizes survivors with fractional Kelly — shrunk by belief uncertainty and current drawdown.
  5. Submits intents through the SDK, finalizes, and advances. State persists per-slug, so the agent is resumable across crashes and safe to leave running for the full 2-week evaluation window.

The same brain runs in an offline harness (no API key required) that scores Brier, log loss, accuracy, and a synthetic PnL on resolved markets — so we could iterate on the math without burning live ticks.

How we built it

Architecture & design decisions.

  • Model choice. Provider-agnostic LLM via an OpenAI-compatible endpoint — OpenRouter by default, but LLM_BASE_URL / LLM_MODEL switches it to any compatible host. Falls back to a deterministic base-rate heuristic when no key is set, so the pipeline always runs. Forecasts are disk-cached by content hash so reruns are free.
  • Data sources. The Prophet Arena BenchmarkSession.load_candidates feed (question text, rules, resolution time, live best bid/ask), and the agent's own per-market price history persisted in data/prophet/regime_state.json.
  • Bayesian core. A vendored log-odds belief updater. The market mid-price becomes the prior; the LLM forecast is folded in as a log-odds opinion pool: $$\text{logit}(p_\text{post}) = (1-w)\,\text{logit}(p_\text{prior}) + w\,\text{logit}(p_\text{llm})$$ with $w \in [0,1]$ controlling LLM trust, and a temperature $T \geq 1$ flattening overconfident model outputs before the fold.
  • Latent regime POMDP. Per market, a 4-state belief over {bull, bear, sideways, volatile} updated from binned price-change observations: $$b'(s') = \eta \cdot O(o\mid s') \sum_s T(s'\mid s)\,b(s)$$ Shannon entropy of the belief feeds back into forecast uncertainty, which shrinks Kelly sizing in volatile regimes without biasing the LLM-vs-prior weighting.
  • Strategy logic — built for Sharpe. Trades must clear an EV threshold. Survivors are sized by fractional Kelly with three multiplicative shrinks: $$f = f_\text{kelly} \cdot \text{kelly}^*(p, q) \cdot (1 - 2\sigma)+ \cdot \big(1 - \tfrac{\text{dd}}{\text{dd}\text{max}}\big)+$$ where $\sigma$ is belief spread and $\text{dd}$ is current drawdown off peak equity. The drawdown shrink linearly de-risks as the book bleeds — the Sharpe-friendly choice — and zeroes out at $\text{dd}\text{max}$. Position caps and gross-exposure caps from the Prophet Arena ruleset are enforced before submission.
  • Resilience. Every file write is atomic (tmp + replace). Tick-level operations are idempotent. Re-running with the same PA_SLUG resumes the existing experiment instead of starting over — important for the 2-week eval window where a crash mid-run shouldn't cost a day of trades.

Challenges we ran into

  • LLM overconfidence. Raw model probabilities cluster around 0.05 and 0.95. Log-odds temperature scaling ($\text{logit}/T$ with $T>1$) before the prior fold made calibration usable.
  • Sizing under uncertainty. Full Kelly on a 0.7 prior is a great way to blow up. Layering three shrink factors (fractional Kelly, belief uncertainty, drawdown) before sizing felt right was the difference between a Sharpe-positive agent and a coin-flip one.
  • Resumability against an unforgiving lifecycle. Leases expire and ticks must be finalized — partial state would have corrupted experiments. Atomic persistence + idempotent per-tick steps fixed it.
  • Holding both sides. Early versions tried to BUY NO on a market we already held YES on; the server (correctly) rejects it. We now gate on held_side before deciding.

Accomplishments that we're proud of

Key innovations — what makes our approach distinct:

  1. Market quote as Bayesian prior, not competitor. Most LLM-trading agents treat the price as something to beat. We treat it as the prior of a log-odds opinion pool, so the LLM has to earn its pull. This is the single biggest stability win.
  2. Latent-regime POMDP that feeds uncertainty, not direction. A bull belief doesn't shift our $P(\text{YES})$ — it shrinks our bet size. Regime info shows up where it should: in risk, not in point estimates.
  3. Three-stage Kelly shrink targeting Sharpe. Fractional Kelly is common; layering uncertainty- and drawdown-aware shrinks on top means our sizing actually responds to the things that hurt Sharpe.
  4. Crash-resumable Arena agent. State persisted per-slug with atomic writes — designed from day one to survive the 2-week evaluation window without manual babysitting.
  5. Zero-key offline harness. The same brain scores on resolved markets with no API key, so calibration is reproducible and auditable.

What we learned

  • The market mid is information, not noise. Building it in as a prior did more for our metrics than any prompt change.
  • Log-odds space is the right place to do almost everything — tempering, pooling, opinion blending. Probability space is where bugs live.
  • POMDPs are underused in trading. A tiny 4-state belief added more signal than another LLM call would have, at a fraction of the cost.
  • Sharpe and PnL pull in different directions. Designing for the combined-rank metric — not pure PnL — changed every default we picked.

What's next for everyday-intelli

  • Value iteration over the regime POMDP so it can choose information-gathering trades, not just observe.
  • Multi-forecaster ensembles with explicit per-model calibration weights — the log-odds pool already supports weighted blending.
  • Online calibration: keep updating the calibration map as new resolutions land instead of fitting once.
  • Walk-forward backtesting on longer historical Arena data to tune $w$, $T$, and $f_\text{kelly}$ rigorously instead of by feel.
  • Hardening for the 2-week eval window: tighter rate-limit back-off, structured error telemetry, and a watchdog that restarts on any unhandled exit.

Built With

  • ai-prophet-core
  • bayesian-inference
  • kelly-criterion
  • llm-forecasting
  • log-odds-pooling
  • numpy
  • openai-compatible-api
  • openrouter
  • pomdp
  • prophet-arena-sdk
  • python
  • scipy
  • sharpe-ratio
Share this project:

Updates