GitHub - juyangbai/MAS-PromptBench: A reproducible benchmark for studying when prompt optimization improves multi-agent LLM systems.

A Benchmark of Prompt Optimization for Multi-Agent LLM Systems

🌐 Website | 📖 Paper | 💻 GitHub

Contents

News and Updates
Introduction
Code Structure
Quickstart
Referenced Resources
Contributing

🔔 News and Updates

[2026-06-22] — Initial public release of the code!

📖 Introduction

A reproducible benchmark for studying when prompt optimization improves multi-agent LLM systems across optimizers, tasks, topologies, communication formats, and team sizes.

Optimizer — GEPA and MIPRO prompt optimizers, run on the real multi-agent runners.
Task dataset — 9 reasoning, coding, and tool-use benchmarks, each scored with its official / community-standard scorer.
Workflow Topology — single, independent, sequential, centralized, and decentralized, implemented across LangGraph, CrewAI, AutoGen, and the OpenAI SDK.
Communication format — three inter-agent message formats (freeform, semi_structured, structured_soft).
Team size — the number of agents per team, r ∈ {2, 4, 8, 10}.

🌳 Code Structure

Path	Contents
`benchmarks/`	per-dataset evaluation-ID manifests + the API-Bank source
`communications/`	inter-agent message-format studies
`configs/`	seed per-role prompts (`configs/prompts/<topology>/<dataset>/<role>.txt`)
`frameworks/`	LangGraph, CrewAI, AutoGen, and debate submodules (editable installs)
`models/`	node-agnostic vLLM serve scripts (Qwen3.5-9B / 122B)
`optimizers/`	GEPA and MIPRO prompt optimizers over the real runners
`scripts/`	helper and launcher scripts
`teamsizes/`	team-size sweeps (number of agents per team)
`topologies/`	the core benchmark — 5 topologies × 9 datasets, one runnable pair each

🚀 Quickstart

Go from a fresh clone to scored results in four steps — install, connect a model, run a baseline, then optimize its prompts.

1. Install

Clone with the framework submodules and create the conda environment:

git clone https://github.com/juyangbai/MAS-PromptBench.git
cd MAS-PromptBench
git submodule update --init --recursive

conda env create -f environment.yml      # Python 3.11 + vLLM + benchmark deps
conda activate mas-promptbench

2. Connect a model

Every agent talks to the same OpenAI-compatible chat endpoint, configured via VLLM_BASE_URL and MODEL_ID (plus OPENAI_API_KEY when the provider requires one). Pick one path:

Option A — Use a model API (no GPU). Manage keys in a .env file — uncomment one provider block, fill in your key, and load it:

# edit .env — pick a provider, set OPENAI_API_KEY
set -a && source .env && set +a

.env ships ready-to-use blocks for OpenAI, Anthropic, Google Gemini, Mistral AI, DeepSeek, and local vLLM — no local GPU required.

Option B — Local serving (vLLM, on your own GPUs). Serve a Qwen model from models/, then point runs at it:

bash models/serve_qwen3_5_9b.sh
export VLLM_BASE_URL=http://localhost:8000/v1
export MODEL_ID=Qwen/Qwen3.5-9B
# OPENAI_API_KEY not needed (the local endpoint accepts any key)

Script	Model	Serving	GPUs needed
`serve_qwen3_5_9b.sh`	`Qwen/Qwen3.5-9B`	one replica per GPU	≥ 1 CUDA GPU
`serve_qwen3_5_122b.sh`	`Qwen/Qwen3.5-122B-A10B-FP8`	tensor-parallel (TP=4)	4 FP8-capable GPUs (Hopper / Blackwell)

3. Run a baseline

Every (topology, dataset) pair ships a no-arg smoke demo and a --batch mode that writes predictions, per-instance results, and traces to results/<dataset>/:

# smoke demo (built-in example, no dataset download)
python topologies/single/hotpotqa/langgraph_hotpotqa.py

# real batch on a slice
python topologies/single/hotpotqa/langgraph_hotpotqa.py --batch --limit 100

See topologies/README.md for the full run interface, per-dataset setup, and scoring details.

4. Optimize prompts

optimizers/ holds two optimizers — GEPA (reflective prompt evolution) and MIPRO (instruction + few-shot example search) — that improve the seed prompts by running the real topology runners and re-scoring, so gains transfer directly back to the benchmark:

# from optimizers/gepa/
python -m real_runner_gepa.pilots.run_gepa_dataset \
  --dataset math --topology single --train-size 25 --val-size 25 \
  --max-full-evals 5 --out results/gepa/single_math

Train/val splits draw from the frozen eval-ID manifests, so optimization never trains on reported eval instances. See optimizers/README.md.

🔗 Referenced Resources

The agent frameworks under frameworks/ retain their own upstream licenses:

LangGraph — stateful graph-based agent orchestration
CrewAI — role-based multi-agent framework
AutoGen — conversational multi-agent framework
LLM Multi-Agent Debate — multi-agent debate reference implementation

MAS-PromptBench evaluates on nine existing benchmarks. Please cite and comply with the license of each original dataset when reporting results:

GPQA — graduate-level science multiple-choice QA
HotpotQA — multi-hop open-domain QA
MATH — competition mathematics
LiveCodeBench — contamination-free code generation
APPS — programming problems
Berkeley Function Calling Leaderboard (BFCL) — function / tool calling
SWE-bench Verified — real-world GitHub issue resolution
API-Bank — tool-augmented API calling
ToolHop — multi-hop tool use

MAS-PromptBench also builds on:

vLLM — high-throughput, OpenAI-compatible LLM serving (the models/ endpoints)
DSPy — the prompt-optimization backend behind GEPA and MIPRO

🤝 Contributing

Contributions are very welcome — a new topology, dataset, or optimizer, a framework integration, a bug fix, or even a typo. Every bit helps!

A few tips to make it smooth:

For anything substantial, open an issue first so we can talk through the approach together.
Match the existing pair / runner conventions — topologies/README.md and the per-optimizer ADDING_PAIR.md guides are great starting points.
Give the relevant smoke demo a quick run (plus a small --batch --limit slice) before opening your PR.

Then send over a pull request — and thank you for helping make MAS-PromptBench better! 🙌

📚 Citation

If you use MAS-PromptBench in your research, please cite it:

@misc{mas-promptbench,
  title  = {MAS-PromptBench: A Multi-Agent Topology Benchmark and Prompt-Optimization Testbed},
  author = {MAS-PromptBench contributors},
  year   = {2026},
  url    = {<repo-url>}
}

⚖️ License

MAS-PromptBench is released under the MIT License. The framework libraries under frameworks/ and the API-Bank source under benchmarks/apibank/apibank_upstream/ retain their respective upstream licenses — comply with each when redistributing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Benchmark of Prompt Optimization for Multi-Agent LLM Systems

🔔 News and Updates

📖 Introduction

🌳 Code Structure

🚀 Quickstart

1. Install

2. Connect a model

3. Run a baseline

4. Optimize prompts

🔗 Referenced Resources

🤝 Contributing

📚 Citation

⚖️ License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assets		assets
benchmarks		benchmarks
communications		communications
configs		configs
frameworks		frameworks
models		models
optimizers		optimizers
scripts		scripts
teamsizes		teamsizes
topologies		topologies
.gitignore		.gitignore
.gitmodules		.gitmodules
.prettierignore		.prettierignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Folders and files

Latest commit

History

Repository files navigation

A Benchmark of Prompt Optimization for Multi-Agent LLM Systems

🔔 News and Updates

📖 Introduction

🌳 Code Structure

🚀 Quickstart

1. Install

2. Connect a model

3. Run a baseline

4. Optimize prompts

🔗 Referenced Resources

🤝 Contributing

📚 Citation

⚖️ License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages