Pi-Bench

A lightweight, customizable benchmark runner for pi-coding-agent, inspired by opencode-bench.

Overview

pi-bench automates the process of testing an AI coding agent against real-world tasks. It does this by:

Cloning a target repository to a temporary workspace (or using a pre-configured SWE-bench container).
Checking out a specific baseline commit.
Spinning up pi-coding-agent in the workspace with a predefined task prompt.
Letting the agent use its tools (read, bash, edit, write) to complete the task.
Capturing the generated patch (git diff).
Running the test suite — either from a testCommand (curated tasks) or SWE-bench FAIL_TO_PASS tests (inside the container).
Using a secondary LLM Judge (Gemini) to evaluate the patch and provide a rationale for the score.

Setup

First, install the required dependencies (using bun or npm):

bun install

Defining Tasks

Benchmark tasks are defined as simple JSON files. See tasks/curated/easy.json for a reference:

{
  "id": "curated-easy",
  "repo": "chalk/chalk",
  "commit": "v5.3.0",
  "prompt": "There is a typo in the README.md file in the `chalk` repository. Please find the typo 'colos' and fix it to 'colors'.",
  "expectedDiff": "diff --git a/README.md b/README.md\n...",
  "testCommand": "npm install && npm test"
}

Note: solutionCommit, expectedDiff, and testCommand are optional. If testCommand is provided, the runner will execute it in the workspace after the agent completes. A 0 exit code automatically grants a perfect score, bypassing the subjective LLM judge.

Included Datasets

pi-bench supports multiple datasets to evaluate the agent's performance.

SWE-bench Verified Mini (Recommended)

A highly curated subset of 50 verified tasks from the SWE-bench dataset. This is the recommended dataset for rapid, high-quality evaluation as it tests a broad set of capabilities without taking days to run.

To download and import this dataset directly from HuggingFace, simply run:

./scripts/download-swe-mini.sh

This will automatically generate the 50 task files inside the tasks/verified-mini/ directory.

Running Benchmarks

SWE-bench Tasks (Recommended)

SWE-bench tasks run inside official SWE-bench Docker containers from ghcr.io/epoch-research/swe-bench.eval.x86_64.*. Each task gets its own container with:

The correct Python version (e.g. Python 3.6 for Django 3.1, Python 3.8+ for Sphinx)
All dependencies pre-installed
The repository checked out at the right commit in /testbed

This eliminates the environment mismatch problems that plague host-side execution.

Pre-pull containers (optional)

Download all 49 container images upfront (~2.4 GB download, ~6 GB on disk due to heavy layer sharing):

./scripts/pull-swe-containers.sh

Provider Setup & Execution

You can configure and use both local and cloud-based models as the backend engine for the pi-coding-agent.

Local Providers (`llama.cpp` and `ds4`)

Local providers are configured in models.json in the project root. By default:

llama.cpp expects a local server running at http://localhost:8080/v1
ds4 expects a local ds4.c server running at http://localhost:8000/v1

When using a local provider, you do not need to specify a model name via --model. pi-bench will automatically query the local provider's /v1/models endpoint to retrieve the active model name and format the results directory accordingly. Whatever model your local server is currently running will be used.

Example: Running with llama.cpp

./run-swe-bench.sh tasks/verified-mini/ \
  --provider llama.cpp \
  --judge-model google/gemini-3.1-pro-preview \
  --platform strix-halo \
  --rocm-version 7.2.4 \
  --timeout 45

Example: Running with ds4

./run-swe-bench.sh tasks/verified-mini/ \
  --provider ds4 \
  --judge-model google/gemini-3.1-pro-preview \
  --platform strix-halo \
  --rocm-version 7.2.4 \
  --timeout 45

Cloud Providers (`openrouter`)

For cloud providers like OpenRouter, the provider endpoint is queried. Because these platforms host many models, you must specify which model to run using the --model flag.

Example: Running with OpenRouter

./run-swe-bench.sh tasks/verified-mini/django__django-11790.json \
  --provider openrouter \
  --model deepseek/deepseek-v4-flash \
  --judge-model google/gemini-3.1-pro-preview \
  --platform openrouter \
  --timeout 30

How SWE-bench evaluation works

After the agent finishes editing code, the runner:

Applies the test patch from the SWE-bench dataset (adds the regression tests)
Runs the FAIL_TO_PASS tests inside the container using the correct Python and test runner
Score is ground truth — if the tests pass, score = 1; if they fail, score = 0
The LLM Judge (Gemini) receives both the diff and the test results, and provides a human-readable rationale explaining why the fix worked or didn't

This combines the objectivity of SWE-bench's test-based evaluation with the explainability of an LLM judge.

Curated Tasks (Docker sandbox)

For non-SWE-bench tasks (curated, custom), use the Docker runner:

./run-docker.sh tasks/curated/ \
  --provider llama.cpp \
  --judge-model google/gemini-3.1-pro-preview \
  --platform strix-halo \
  --timeout 30

Local Execution (Use with Caution)

Running the benchmark locally executes the agent on your host machine.

bun run src/index.ts tasks/curated/easy.json

CLI Reference

Provider & Model Flags

Flag	Description	Default
`--provider <name>`	Inference provider: `llama.cpp`, `ds4`, or `openrouter`	`llama.cpp`
`--model <model-id>`	Model ID within the provider (e.g. `deepseek/deepseek-v4-flash`)	Auto-detected
`--judge-model <provider/id>`	Judge model (e.g. `google/gemini-3.1-pro-preview`)	Same as agent
`--port <port>`	Override the local server port	`8080` (llama.cpp), `8000` (ds4)
`--engine <name>`	Backward-compatible alias for `--provider`	—

Local providers (llama.cpp, ds4) auto-detect the model name by querying the local server's /v1/models endpoint. No --model needed.

Cloud providers (openrouter) require --model to specify which model to use, since the provider may host many models.

Backward compatibility: --model openrouter/deepseek/deepseek-v4-flash (without --provider) still works — the provider is parsed from the first path segment.

Other Flags

Flag	Description	Default
`--platform <id>`	Save results to `benchmark_results/<platform>/`	—
`--model-tag <tag>`	Append a suffix to the results directory (e.g. `mtp`)	—
`--rocm-version <ver>`	ROCm version running the backend	`7.2.4`
`--timeout <minutes>`	Agent timeout per task	`30`

Examples

# Local llama.cpp (auto-detects model from server)
./run-swe-bench.sh tasks/verified-mini/ \
  --judge-model google/gemini-3.1-pro-preview \
  --platform strix-halo \
  --rocm-version 7.2.4 \
  --timeout 45

# Local ds4 server on custom port
./run-swe-bench.sh tasks/verified-mini/ \
  --provider ds4 --port 9000 \
  --judge-model google/gemini-3.1-pro-preview \
  --platform strix-halo \
  --rocm-version 7.2.4 \
  --timeout 45

# OpenRouter cloud
./run-swe-bench.sh tasks/verified-mini/ \
  --provider openrouter --model deepseek/deepseek-v4-flash \
  --judge-model google/gemini-3.1-pro-preview \
  --platform openrouter \
  --timeout 30

# Single task, backward-compat style
./run-swe-bench.sh tasks/verified-mini/django__django-11790.json \
  --model openrouter/deepseek/deepseek-v4-flash \
  --judge-model google/gemini-3.1-pro-preview \
  --platform openrouter \
  --timeout 30

Configuring Models

If you need to configure custom API endpoints or model parameters (like max tokens or context windows), edit the models.json file in the project root.

API Keys

Create a .env file in the root pi-bench/ directory with your API keys:

GEMINI_API_KEY=...
OPENROUTER_API_KEY=...

Both run-docker.sh and run-swe-bench.sh automatically pass this file into the container.

Results & Multi-Platform Dashboard

When a single run completes, it outputs a JSON artifact to the current directory (e.g. results-curated-easy.json).

When running a batch (providing a directory like tasks/verified-mini/), pi-bench automatically generates a uniquely named directory for the results based on the model (e.g., Qwen3_6-35B-A3B-UD-Q8_K_XL_gguf_results/).

Populating the Dashboard

pi-bench includes a dynamic HTML dashboard that can track results across multiple hardware platforms. To get your results onto the dashboard:

Create your platform metadata: If it's a new platform, create a folder for it inside benchmark_results/ and add a platform.json describing your hardware:
```
mkdir -p benchmark_results/r9700
```
benchmark_results/r9700/platform.json:
```
{
  "id": "r9700",
  "name": "Radeon 9700",
  "gpu": "Radeon 9700 16GB",
  "ram": "32GB DDR5"
}
```
Run your benchmark with the --platform flag:
```
./run-swe-bench.sh tasks/verified-mini/ \
  --judge-model google/gemini-3.1-pro-preview \
  --platform r9700
```
This automatically routes the results folder (e.g. Qwen3_6..._results) right into benchmark_results/r9700/.
Generate the report: This script parses all new results in benchmark_results/ and compiles them into a single docs/data.json file. The frontend dashboard (app.js) requires this JSON file to display data.
```
bun run scripts/generate-report.ts
```
Serve the dashboard: The dashboard is a static website. Serve the docs/ folder, open your browser (e.g., http://localhost:8082), and the Vue frontend (app.js) will automatically load the updated data.json.
```
python3 -m http.server 8082 -d docs/
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pi-Bench

Overview

Setup

Defining Tasks

Included Datasets

SWE-bench Verified Mini (Recommended)

Running Benchmarks

SWE-bench Tasks (Recommended)

Pre-pull containers (optional)

Provider Setup & Execution

Local Providers (`llama.cpp` and `ds4`)

Cloud Providers (`openrouter`)

How SWE-bench evaluation works

Curated Tasks (Docker sandbox)

Local Execution (Use with Caution)

CLI Reference

Provider & Model Flags

Other Flags

Examples

Configuring Models

API Keys

Results & Multi-Platform Dashboard

Populating the Dashboard

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
benchmark_results		benchmark_results
debug		debug
docs		docs
research		research
scripts		scripts
src		src
tasks		tasks
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
bun.lock		bun.lock
models.json		models.json
package.json		package.json
run-docker.sh		run-docker.sh
run-swe-bench.sh		run-swe-bench.sh
tsconfig.json		tsconfig.json

Folders and files

Latest commit

History

Repository files navigation

Pi-Bench

Overview

Setup

Defining Tasks

Included Datasets

SWE-bench Verified Mini (Recommended)

Running Benchmarks

SWE-bench Tasks (Recommended)

Pre-pull containers (optional)

Provider Setup & Execution

Local Providers (llama.cpp and ds4)

Cloud Providers (openrouter)

How SWE-bench evaluation works

Curated Tasks (Docker sandbox)

Local Execution (Use with Caution)

CLI Reference

Provider & Model Flags

Other Flags

Examples

Configuring Models

API Keys

Results & Multi-Platform Dashboard

Populating the Dashboard

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Local Providers (`llama.cpp` and `ds4`)

Cloud Providers (`openrouter`)

Packages