A lightweight, customizable benchmark runner for pi-coding-agent, inspired by opencode-bench.
pi-bench automates the process of testing an AI coding agent against real-world tasks. It does this by:
- Cloning a target repository to a temporary workspace (or using a pre-configured SWE-bench container).
- Checking out a specific baseline commit.
- Spinning up
pi-coding-agentin the workspace with a predefined task prompt. - Letting the agent use its tools (
read,bash,edit,write) to complete the task. - Capturing the generated patch (
git diff). - Running the test suite — either from a
testCommand(curated tasks) or SWE-benchFAIL_TO_PASStests (inside the container). - Using a secondary LLM Judge (Gemini) to evaluate the patch and provide a rationale for the score.
First, install the required dependencies (using bun or npm):
bun installBenchmark tasks are defined as simple JSON files. See tasks/curated/easy.json for a reference:
{
"id": "curated-easy",
"repo": "chalk/chalk",
"commit": "v5.3.0",
"prompt": "There is a typo in the README.md file in the `chalk` repository. Please find the typo 'colos' and fix it to 'colors'.",
"expectedDiff": "diff --git a/README.md b/README.md\n...",
"testCommand": "npm install && npm test"
}Note: solutionCommit, expectedDiff, and testCommand are optional. If testCommand is provided, the runner will execute it in the workspace after the agent completes. A 0 exit code automatically grants a perfect score, bypassing the subjective LLM judge.
pi-bench supports multiple datasets to evaluate the agent's performance.
A highly curated subset of 50 verified tasks from the SWE-bench dataset. This is the recommended dataset for rapid, high-quality evaluation as it tests a broad set of capabilities without taking days to run.
To download and import this dataset directly from HuggingFace, simply run:
./scripts/download-swe-mini.shThis will automatically generate the 50 task files inside the tasks/verified-mini/ directory.
SWE-bench tasks run inside official SWE-bench Docker containers from ghcr.io/epoch-research/swe-bench.eval.x86_64.*. Each task gets its own container with:
- The correct Python version (e.g. Python 3.6 for Django 3.1, Python 3.8+ for Sphinx)
- All dependencies pre-installed
- The repository checked out at the right commit in
/testbed
This eliminates the environment mismatch problems that plague host-side execution.
Download all 49 container images upfront (~2.4 GB download, ~6 GB on disk due to heavy layer sharing):
./scripts/pull-swe-containers.shYou can configure and use both local and cloud-based models as the backend engine for the pi-coding-agent.
Local providers are configured in models.json in the project root. By default:
llama.cppexpects a local server running athttp://localhost:8080/v1ds4expects a localds4.cserver running athttp://localhost:8000/v1
When using a local provider, you do not need to specify a model name via --model. pi-bench will automatically query the local provider's /v1/models endpoint to retrieve the active model name and format the results directory accordingly. Whatever model your local server is currently running will be used.
Example: Running with llama.cpp
./run-swe-bench.sh tasks/verified-mini/ \
--provider llama.cpp \
--judge-model google/gemini-3.1-pro-preview \
--platform strix-halo \
--rocm-version 7.2.4 \
--timeout 45Example: Running with ds4
./run-swe-bench.sh tasks/verified-mini/ \
--provider ds4 \
--judge-model google/gemini-3.1-pro-preview \
--platform strix-halo \
--rocm-version 7.2.4 \
--timeout 45For cloud providers like OpenRouter, the provider endpoint is queried. Because these platforms host many models, you must specify which model to run using the --model flag.
Example: Running with OpenRouter
./run-swe-bench.sh tasks/verified-mini/django__django-11790.json \
--provider openrouter \
--model deepseek/deepseek-v4-flash \
--judge-model google/gemini-3.1-pro-preview \
--platform openrouter \
--timeout 30After the agent finishes editing code, the runner:
- Applies the test patch from the SWE-bench dataset (adds the regression tests)
- Runs the
FAIL_TO_PASStests inside the container using the correct Python and test runner - Score is ground truth — if the tests pass,
score = 1; if they fail,score = 0 - The LLM Judge (Gemini) receives both the diff and the test results, and provides a human-readable rationale explaining why the fix worked or didn't
This combines the objectivity of SWE-bench's test-based evaluation with the explainability of an LLM judge.
For non-SWE-bench tasks (curated, custom), use the Docker runner:
./run-docker.sh tasks/curated/ \
--provider llama.cpp \
--judge-model google/gemini-3.1-pro-preview \
--platform strix-halo \
--timeout 30Running the benchmark locally executes the agent on your host machine.
bun run src/index.ts tasks/curated/easy.json| Flag | Description | Default |
|---|---|---|
--provider <name> |
Inference provider: llama.cpp, ds4, or openrouter |
llama.cpp |
--model <model-id> |
Model ID within the provider (e.g. deepseek/deepseek-v4-flash) |
Auto-detected |
--judge-model <provider/id> |
Judge model (e.g. google/gemini-3.1-pro-preview) |
Same as agent |
--port <port> |
Override the local server port | 8080 (llama.cpp), 8000 (ds4) |
--engine <name> |
Backward-compatible alias for --provider |
— |
Local providers (llama.cpp, ds4) auto-detect the model name by querying the local server's /v1/models endpoint. No --model needed.
Cloud providers (openrouter) require --model to specify which model to use, since the provider may host many models.
Backward compatibility: --model openrouter/deepseek/deepseek-v4-flash (without --provider) still works — the provider is parsed from the first path segment.
| Flag | Description | Default |
|---|---|---|
--platform <id> |
Save results to benchmark_results/<platform>/ |
— |
--model-tag <tag> |
Append a suffix to the results directory (e.g. mtp) |
— |
--rocm-version <ver> |
ROCm version running the backend | 7.2.4 |
--timeout <minutes> |
Agent timeout per task | 30 |
# Local llama.cpp (auto-detects model from server)
./run-swe-bench.sh tasks/verified-mini/ \
--judge-model google/gemini-3.1-pro-preview \
--platform strix-halo \
--rocm-version 7.2.4 \
--timeout 45
# Local ds4 server on custom port
./run-swe-bench.sh tasks/verified-mini/ \
--provider ds4 --port 9000 \
--judge-model google/gemini-3.1-pro-preview \
--platform strix-halo \
--rocm-version 7.2.4 \
--timeout 45
# OpenRouter cloud
./run-swe-bench.sh tasks/verified-mini/ \
--provider openrouter --model deepseek/deepseek-v4-flash \
--judge-model google/gemini-3.1-pro-preview \
--platform openrouter \
--timeout 30
# Single task, backward-compat style
./run-swe-bench.sh tasks/verified-mini/django__django-11790.json \
--model openrouter/deepseek/deepseek-v4-flash \
--judge-model google/gemini-3.1-pro-preview \
--platform openrouter \
--timeout 30If you need to configure custom API endpoints or model parameters (like max tokens or context windows), edit the models.json file in the project root.
Create a .env file in the root pi-bench/ directory with your API keys:
GEMINI_API_KEY=...
OPENROUTER_API_KEY=...
Both run-docker.sh and run-swe-bench.sh automatically pass this file into the container.
When a single run completes, it outputs a JSON artifact to the current directory (e.g. results-curated-easy.json).
When running a batch (providing a directory like tasks/verified-mini/), pi-bench automatically generates a uniquely named directory for the results based on the model (e.g., Qwen3_6-35B-A3B-UD-Q8_K_XL_gguf_results/).
pi-bench includes a dynamic HTML dashboard that can track results across multiple hardware platforms. To get your results onto the dashboard:
-
Create your platform metadata: If it's a new platform, create a folder for it inside
benchmark_results/and add aplatform.jsondescribing your hardware:mkdir -p benchmark_results/r9700
benchmark_results/r9700/platform.json:
{ "id": "r9700", "name": "Radeon 9700", "gpu": "Radeon 9700 16GB", "ram": "32GB DDR5" } -
Run your benchmark with the
--platformflag:./run-swe-bench.sh tasks/verified-mini/ \ --judge-model google/gemini-3.1-pro-preview \ --platform r9700
This automatically routes the results folder (e.g.
Qwen3_6..._results) right intobenchmark_results/r9700/. -
Generate the report: This script parses all new results in
benchmark_results/and compiles them into a singledocs/data.jsonfile. The frontend dashboard (app.js) requires this JSON file to display data.bun run scripts/generate-report.ts
-
Serve the dashboard: The dashboard is a static website. Serve the
docs/folder, open your browser (e.g.,http://localhost:8082), and the Vue frontend (app.js) will automatically load the updateddata.json.python3 -m http.server 8082 -d docs/