Skip to content

juyangbai/MAS-PromptBench

Repository files navigation

MAS-PromptBench

A Benchmark of Prompt Optimization for Multi-Agent LLM Systems

🌐 Website | 📖 Paper | 💻 GitHub

Contents
  1. News and Updates
  2. Introduction
  3. Code Structure
  4. Quickstart
  5. Referenced Resources
  6. Contributing

🔔 News and Updates

  • [2026-06-22] — Initial public release of the code!

📖 Introduction

MAS-PromptBench overview

A reproducible benchmark for studying when prompt optimization improves multi-agent LLM systems across optimizers, tasks, topologies, communication formats, and team sizes.

  • Optimizer — GEPA and MIPRO prompt optimizers, run on the real multi-agent runners.
  • Task dataset — 9 reasoning, coding, and tool-use benchmarks, each scored with its official / community-standard scorer.
  • Workflow Topologysingle, independent, sequential, centralized, and decentralized, implemented across LangGraph, CrewAI, AutoGen, and the OpenAI SDK.
  • Communication format — three inter-agent message formats (freeform, semi_structured, structured_soft).
  • Team size — the number of agents per team, r ∈ {2, 4, 8, 10}.

🌳 Code Structure

Path Contents
benchmarks/ per-dataset evaluation-ID manifests + the API-Bank source
communications/ inter-agent message-format studies
configs/ seed per-role prompts (configs/prompts/<topology>/<dataset>/<role>.txt)
frameworks/ LangGraph, CrewAI, AutoGen, and debate submodules (editable installs)
models/ node-agnostic vLLM serve scripts (Qwen3.5-9B / 122B)
optimizers/ GEPA and MIPRO prompt optimizers over the real runners
scripts/ helper and launcher scripts
teamsizes/ team-size sweeps (number of agents per team)
topologies/ the core benchmark — 5 topologies × 9 datasets, one runnable pair each

🚀 Quickstart

Go from a fresh clone to scored results in four steps — install, connect a model, run a baseline, then optimize its prompts.

1. Install

Clone with the framework submodules and create the conda environment:

git clone https://github.com/juyangbai/MAS-PromptBench.git
cd MAS-PromptBench
git submodule update --init --recursive

conda env create -f environment.yml      # Python 3.11 + vLLM + benchmark deps
conda activate mas-promptbench

2. Connect a model

Every agent talks to the same OpenAI-compatible chat endpoint, configured via VLLM_BASE_URL and MODEL_ID (plus OPENAI_API_KEY when the provider requires one). Pick one path:

Option A — Use a model API (no GPU). Manage keys in a .env file — uncomment one provider block, fill in your key, and load it:

# edit .env — pick a provider, set OPENAI_API_KEY
set -a && source .env && set +a

.env ships ready-to-use blocks for OpenAI, Anthropic, Google Gemini, Mistral AI, DeepSeek, and local vLLM — no local GPU required.

Option B — Local serving (vLLM, on your own GPUs). Serve a Qwen model from models/, then point runs at it:

bash models/serve_qwen3_5_9b.sh
export VLLM_BASE_URL=http://localhost:8000/v1
export MODEL_ID=Qwen/Qwen3.5-9B
# OPENAI_API_KEY not needed (the local endpoint accepts any key)
Script Model Serving GPUs needed
serve_qwen3_5_9b.sh Qwen/Qwen3.5-9B one replica per GPU ≥ 1 CUDA GPU
serve_qwen3_5_122b.sh Qwen/Qwen3.5-122B-A10B-FP8 tensor-parallel (TP=4) 4 FP8-capable GPUs (Hopper / Blackwell)

3. Run a baseline

Every (topology, dataset) pair ships a no-arg smoke demo and a --batch mode that writes predictions, per-instance results, and traces to results/<dataset>/:

# smoke demo (built-in example, no dataset download)
python topologies/single/hotpotqa/langgraph_hotpotqa.py

# real batch on a slice
python topologies/single/hotpotqa/langgraph_hotpotqa.py --batch --limit 100

See topologies/README.md for the full run interface, per-dataset setup, and scoring details.

4. Optimize prompts

optimizers/ holds two optimizers — GEPA (reflective prompt evolution) and MIPRO (instruction + few-shot example search) — that improve the seed prompts by running the real topology runners and re-scoring, so gains transfer directly back to the benchmark:

# from optimizers/gepa/
python -m real_runner_gepa.pilots.run_gepa_dataset \
  --dataset math --topology single --train-size 25 --val-size 25 \
  --max-full-evals 5 --out results/gepa/single_math

Train/val splits draw from the frozen eval-ID manifests, so optimization never trains on reported eval instances. See optimizers/README.md.


🔗 Referenced Resources

The agent frameworks under frameworks/ retain their own upstream licenses:

  • LangGraph — stateful graph-based agent orchestration
  • CrewAI — role-based multi-agent framework
  • AutoGen — conversational multi-agent framework
  • LLM Multi-Agent Debate — multi-agent debate reference implementation

MAS-PromptBench evaluates on nine existing benchmarks. Please cite and comply with the license of each original dataset when reporting results:

MAS-PromptBench also builds on:

  • vLLM — high-throughput, OpenAI-compatible LLM serving (the models/ endpoints)
  • DSPy — the prompt-optimization backend behind GEPA and MIPRO

🤝 Contributing

Contributions are very welcome — a new topology, dataset, or optimizer, a framework integration, a bug fix, or even a typo. Every bit helps!

A few tips to make it smooth:

  • For anything substantial, open an issue first so we can talk through the approach together.
  • Match the existing pair / runner conventions — topologies/README.md and the per-optimizer ADDING_PAIR.md guides are great starting points.
  • Give the relevant smoke demo a quick run (plus a small --batch --limit slice) before opening your PR.

Then send over a pull request — and thank you for helping make MAS-PromptBench better! 🙌


📚 Citation

If you use MAS-PromptBench in your research, please cite it:

@misc{mas-promptbench,
  title  = {MAS-PromptBench: A Multi-Agent Topology Benchmark and Prompt-Optimization Testbed},
  author = {MAS-PromptBench contributors},
  year   = {2026},
  url    = {<repo-url>}
}

⚖️ License

MAS-PromptBench is released under the MIT License. The framework libraries under frameworks/ and the API-Bank source under benchmarks/apibank/apibank_upstream/ retain their respective upstream licenses — comply with each when redistributing.

About

A reproducible benchmark for studying when prompt optimization improves multi-agent LLM systems.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors