Benchmark LLMs by having them play Magic: The Gathering against each other — 1v1 duels and multiplayer Commander.
Built on XMage, a full rules engine with enforcement for 28,000+ unique cards. LLMs interact via MCP tools exposed by the bridge — they see the board state, choose actions, and play full games with no manual intervention.
- Java 17+ and Maven
- Python 3.11+ and uv
- FFmpeg (for video recording)
Card images aren't included in the repo. Download them once via the XMage desktop client:
- Run
make run-clientto launch the client. - Dismiss the "Unable connect to server" error — no server is needed for downloads.
- Click Download in the top toolbar.
- Download both mana symbols and card images separately. Pick a Scryfall source — "normal" is ~10 GB, "small" is ~1.5 GB.
- Close the client when done. Images are cached in
plugins/images/and reused by all future runs.
export OPENROUTER_API_KEY="sk-..."
make run CONFIG=commander-gauntletThis runs 4 LLM pilots against each other in a Commander game with streaming and video recording. Recordings and logs are saved to ~/.mage-bench/logs/.
Other configs:
# Default: no API keys needed (2 CPU Standard duel)
make run
# 1 LLM pilot + 3 CPU opponents
make run CONFIG=commander-1v3
# Long-lived test server (stays running between games)
make run CONFIG=modern-staller
# List all available configs
make configs
# Custom config file
make run CONFIG=path/to/my-config.json
# Record to a specific file
make run OUTPUT=/path/to/video.movAfter a game finishes, the puppeteer prompts to upload the recording to YouTube.
- Set up Google Cloud OAuth credentials (see
doc/youtube.md). - Save the client secrets to
~/.mage-bench/youtube-client-secrets.json. - To target a specific playlist, set
YOUTUBE_PLAYLIST_IDin your environment or.envfile. Defaults to the mage-bench playlist.
Three layers:
- XMage server — upstream game engine, handles rules enforcement and game state. Unmodified from upstream.
- Java clients (
Mage.Client.Headless,Mage.Client.Streaming) — the bridge lets LLMs play via MCP tool calls, and the spectator renders the game and records video. - Puppeteer (
puppeteer/) — orchestrates everything: spawns processes, connects LLMs to bridge clients, tracks costs, manages recordings.
Game logic and XMage workarounds live in the Java bridge layer. The puppeteer stays simple.
| Type | LLM? | Description |
|---|---|---|
| Pilot | Yes | Strategic LLM player — sees board state, chooses actions |
| Sleepwalker | No | MCP auto-player with chat, no LLM |
| CPU | No | XMage's built-in AI (COMPUTER_MAD) |
| Potato | No | Dumbest auto-player |
| Staller | No | Like potato but slow; stays connected between games |
Configure players in JSON config files (see configs/).
The spectator provides:
- Live game visualization (JavaFX)
- Video recording via FFmpeg
See AGENTS.md for development conventions, code isolation rules, and how to run things.
Based on XMage.