everyday-intelli

Inspiration

LLMs price prediction markets confidently but trade them badly: they ignore the quote already on the screen, swing across reruns, and have no notion of how much to bet. Prophet Arena's Trading Track scores on combined Sharpe + PnL rank — so a strategy that overbets on a noisy forecast is a strategy that loses. We wanted to wrap a small, careful Bayesian brain around the LLM, use it as one opinion among several, and let risk-adjusted return fall out of the math instead of being bolted on.

What it does

Project overview. everyday-intelli is an end-to-end Trading Track agent that plugs directly into the ai-prophet SDK harness. On every 15-minute BenchmarkSession tick it:

Pulls the candidate market list and current quotes.
Forecasts $P(\text{YES})$ for each market with an LLM, tempered by the market mid-price as a Bayesian prior.
Updates a per-market 4-state regime POMDP and folds its entropy into forecast uncertainty.
Gates each market on expected value, then sizes survivors with fractional Kelly — shrunk by belief uncertainty and current drawdown.
Submits intents through the SDK, finalizes, and advances. State persists per-slug, so the agent is resumable across crashes and safe to leave running for the full 2-week evaluation window.

The same brain runs in an offline harness (no API key required) that scores Brier, log loss, accuracy, and a synthetic PnL on resolved markets — so we could iterate on the math without burning live ticks.

How we built it

Architecture & design decisions.

Model choice. Provider-agnostic LLM via an OpenAI-compatible endpoint — OpenRouter by default, but LLM_BASE_URL / LLM_MODEL switches it to any compatible host. Falls back to a deterministic base-rate heuristic when no key is set, so the pipeline always runs. Forecasts are disk-cached by content hash so reruns are free.
Data sources. The Prophet Arena BenchmarkSession.load_candidates feed (question text, rules, resolution time, live best bid/ask), and the agent's own per-market price history persisted in data/prophet/regime_state.json.
Bayesian core. A vendored log-odds belief updater. The market mid-price becomes the prior; the LLM forecast is folded in as a log-odds opinion pool: $$\text{logit}(p_\text{post}) = (1-w)\,\text{logit}(p_\text{prior}) + w\,\text{logit}(p_\text{llm})$$ with $w \in [0,1]$ controlling LLM trust, and a temperature $T \geq 1$ flattening overconfident model outputs before the fold.
Latent regime POMDP. Per market, a 4-state belief over {bull, bear, sideways, volatile} updated from binned price-change observations: $$b'(s') = \eta \cdot O(o\mid s') \sum_s T(s'\mid s)\,b(s)$$ Shannon entropy of the belief feeds back into forecast uncertainty, which shrinks Kelly sizing in volatile regimes without biasing the LLM-vs-prior weighting.
Strategy logic — built for Sharpe. Trades must clear an EV threshold. Survivors are sized by fractional Kelly with three multiplicative shrinks: $$f = f_\text{kelly} \cdot \text{kelly}^*(p, q) \cdot (1 - 2\sigma)+ \cdot \big(1 - \tfrac{\text{dd}}{\text{dd}\text{max}}\big)+$$ where $\sigma$ is belief spread and $\text{dd}$ is current drawdown off peak equity. The drawdown shrink linearly de-risks as the book bleeds — the Sharpe-friendly choice — and zeroes out at $\text{dd}\text{max}$. Position caps and gross-exposure caps from the Prophet Arena ruleset are enforced before submission.
Resilience. Every file write is atomic (tmp + replace). Tick-level operations are idempotent. Re-running with the same PA_SLUG resumes the existing experiment instead of starting over — important for the 2-week eval window where a crash mid-run shouldn't cost a day of trades.

Challenges we ran into

LLM overconfidence. Raw model probabilities cluster around 0.05 and 0.95. Log-odds temperature scaling ($\text{logit}/T$ with $T>1$) before the prior fold made calibration usable.
Sizing under uncertainty. Full Kelly on a 0.7 prior is a great way to blow up. Layering three shrink factors (fractional Kelly, belief uncertainty, drawdown) before sizing felt right was the difference between a Sharpe-positive agent and a coin-flip one.
Resumability against an unforgiving lifecycle. Leases expire and ticks must be finalized — partial state would have corrupted experiments. Atomic persistence + idempotent per-tick steps fixed it.
Holding both sides. Early versions tried to BUY NO on a market we already held YES on; the server (correctly) rejects it. We now gate on held_side before deciding.

Accomplishments that we're proud of

Key innovations — what makes our approach distinct:

Market quote as Bayesian prior, not competitor. Most LLM-trading agents treat the price as something to beat. We treat it as the prior of a log-odds opinion pool, so the LLM has to earn its pull. This is the single biggest stability win.
Latent-regime POMDP that feeds uncertainty, not direction. A bull belief doesn't shift our $P(\text{YES})$ — it shrinks our bet size. Regime info shows up where it should: in risk, not in point estimates.
Three-stage Kelly shrink targeting Sharpe. Fractional Kelly is common; layering uncertainty- and drawdown-aware shrinks on top means our sizing actually responds to the things that hurt Sharpe.
Crash-resumable Arena agent. State persisted per-slug with atomic writes — designed from day one to survive the 2-week evaluation window without manual babysitting.
Zero-key offline harness. The same brain scores on resolved markets with no API key, so calibration is reproducible and auditable.

What we learned

The market mid is information, not noise. Building it in as a prior did more for our metrics than any prompt change.
Log-odds space is the right place to do almost everything — tempering, pooling, opinion blending. Probability space is where bugs live.
POMDPs are underused in trading. A tiny 4-state belief added more signal than another LLM call would have, at a fraction of the cost.
Sharpe and PnL pull in different directions. Designing for the combined-rank metric — not pure PnL — changed every default we picked.

What's next for everyday-intelli

Value iteration over the regime POMDP so it can choose information-gathering trades, not just observe.
Multi-forecaster ensembles with explicit per-model calibration weights — the log-odds pool already supports weighted blending.
Online calibration: keep updating the calibration map as new resolutions land instead of fitting once.
Walk-forward backtesting on longer historical Arena data to tune $w$, $T$, and $f_\text{kelly}$ rigorously instead of by feel.
Hardening for the 2-week eval window: tighter rate-limit back-off, structured error telemetry, and a watchdog that restarts on any unhandled exit.

Built With

ai-prophet-core
bayesian-inference
kelly-criterion
llm-forecasting
log-odds-pooling
numpy
openai-compatible-api
openrouter
pomdp
prophet-arena-sdk
python
scipy
sharpe-ratio

Updates

Brown A started this project — May 17, 2026 06:52 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.