Tables & Resources
This page contains statistical tables and resources from our comprehensive survey on Issue Resolution in Software Engineering.
Evaluation & Training Datasets
A comprehensive survey and statistical overview of issue resolution datasets. We categorize these datasets based on programming language, modality support, source repositories, data scale (Amount), and the availability of reproducible execution environments.
| Dataset |
Language |
Multimodal |
Repos |
Amount |
Environment |
Link |
| Single-PL Datasets |
|
|
|
|
|
|
| SWE-Fixer |
Python |
❌ |
856 |
115,406 |
❌ |
 |
| SWE-smith |
Python |
❌ |
128 |
50k |
✅ |
 |
| SWE-Lego |
Python |
❌ |
3,251 |
32,119 |
✅ |
 |
| SWE-rebench |
Python |
❌ |
3,468 |
21,336 |
✅ |
 |
| SWE-bench-train |
Python |
❌ |
37 |
19k |
❌ |
 |
| SWE-Flow |
Python |
❌ |
74 |
18,081 |
✅ |
 |
| Skywork-SWE |
Python |
❌ |
2,531 |
10,169 |
✅ |
- |
| R2E-Gym |
Python |
❌ |
10 |
8,135 |
✅ |
 |
| RepoForge |
Python |
❌ |
- |
7.3k |
✅ |
- |
| SWE-bench-extra |
Python |
❌ |
2k |
6.38k |
✅ |
 |
| SWE-Gym |
Python |
❌ |
11 |
2,438 |
✅ |
 |
| SWE-bench |
Python |
❌ |
12 |
2,294 |
✅ |
 |
| SWE-bench-java |
Java |
❌ |
19 |
1,797 |
✅ |
 |
| FEA-bench |
Python |
❌ |
83 |
1,401 |
✅ |
 |
| SWE-bench-Live |
Python |
❌ |
164 |
1,565 |
✅ |
 |
| Loc-Bench |
Python |
❌ |
- |
560 |
❌ |
 |
| SWE-bench Verified |
Python |
❌ |
- |
500 |
✅ |
 |
| SWE-bench Lite |
Python |
❌ |
12 |
300 |
✅ |
 |
| SWE-MERA |
Python |
❌ |
200 |
300 |
✅ |
 |
| SWE-Bench-CL |
Python |
❌ |
8 |
273 |
✅ |
 |
| SWE-Sharp-Bench |
C# |
❌ |
17 |
150 |
✅ |
 |
| SWE-Perf |
Python |
❌ |
12 |
140 |
✅ |
 |
| Visual SWE-bench |
Python |
✅ |
11 |
133 |
✅ |
 |
| SWE-EVO |
Python |
❌ |
7 |
48 |
✅ |
 |
| Multi-PL Datasets |
|
|
|
|
|
|
| SWE-Mirror |
Python, Rust, Go |
❌ |
40 |
60k |
✅ |
- |
| Multi-SWE-bench |
Java, JS, TS, Go, Rust, C, C++ |
❌ |
76 |
4,723 |
✅ |
 |
| Swing-Bench |
Python, Go, C++, Rust |
❌ |
400 |
2300 |
✅ |
- |
| SWE-PolyBench |
Python, Java, JS, TS |
❌ |
21 |
2,110 |
✅ |
 |
| SWE-Compass |
Python, JS, TS, Java, C, C++, Go, Rust, Kotlin, C# |
❌ |
- |
2,000 |
✅ |
 |
| SWE-Bench Pro |
Python, Go, TS |
❌ |
41 |
1,865 |
✅ |
 |
| SWE-bench++ |
Python, Go, TS, JS, Ruby, PHP, Java, Rust, C++, C#, C |
❌ |
3,971 |
1,782 |
✅ |
 |
| SWE-Lancer |
JS, TS |
❌ |
- |
1,488 |
✅ |
 |
| OmniGIRL |
Python, TS, Java, JS |
✅ |
15 |
959 |
✅ |
 |
| SWE-bench Multimodal |
JS, TS, HTML, CSS |
✅ |
17 |
619 |
✅ |
 |
| SWE-fficiency |
Python, Cython |
❌ |
9 |
498 |
✅ |
 |
| SWE-Factory |
Python, Java, JS, TS |
❌ |
12 |
430 |
✅ |
 |
| SWE-bench-Live-MultiLang \& Windows |
Python, JS, TS, C, C++, C#, Java, Go, Rust |
❌ |
238 |
418 |
✅ |
 |
| SWE-bench Multilingual |
C, C++, Go, Java, JS, TS, Rust, Python, Ruby, PHP |
❌ |
42 |
300 |
✅ |
 |
| SWE-InfraBench |
Python, TS |
❌ |
- |
100 |
✅ |
- |
Training Trajectory Datasets
A survey of trajectory datasets used for agent training or analysis. We list the programming language, number of source repositories, and total trajectories for each dataset.
| Dataset |
Language |
Repos |
Amount |
Link |
| SWE-Fixer |
Python |
856 |
69,752 |
 |
| SWE-rebench |
Python |
1,823 |
67,074 |
 |
| R2E-Gym |
Python |
10 |
3,321 |
 |
| SWE-Synth |
Python |
11 |
3,018 |
 |
| SWE-Factory |
Python |
10 |
2,809 |
 |
| SWE-Gym |
Python |
11 |
491 |
 |
| SWE-Lego |
Python |
3251 |
14.6k |
 |
SFT-based Methods
Overview of SFT-based methods for issue resolution. This table categorizes models by their base architecture and training scaffold (Sorted by Performance).
| Model Name |
Base Model |
Size |
Arch. |
Training Scaffold |
Res.(%) |
Code |
Data |
Model |
| SWE-rebench-openhands-Qwen3-235B-A22B |
Qwen3-235B-A22B |
235B-A22B |
MoE |
OpenHands |
59.9 |
- |
 |
 |
| SWE-Lego-Qwen3-32B |
Qwen3-32B |
32B |
Dense |
OpenHands |
57.6 |
 |
 |
 |
| SWE-rebench-openhands-Qwen3-30B-A3B |
Qwen3-30B-A3B |
30B-A3B |
MoE |
OpenHands |
49.7 |
- |
 |
 |
| Devstral |
Mistral Small 3 |
22B |
Dense |
OpenHands |
46.8 |
- |
 |
 |
| Co-PatcheR |
Qwen2.5-Coder-14B |
3×14B |
Dense |
PatchPilot-mini |
46.0 |
 |
- |
 |
| SWE-Swiss-32B |
Qwen2.5-32B-Instruct |
32B |
Dense |
Agentless |
45.0 |
 |
 |
 |
| SWE-Lego-Qwen3-8B |
Qwen3-8B |
8B |
Dense |
OpenHands |
44.4 |
 |
 |
 |
| Lingma SWE-GPT |
Qwen2.5-72B-Instruct |
72B |
Dense |
SWESynInfer |
30.2 |
 |
- |
- |
| SWE-Gym-Qwen-32B |
Qwen2.5-Coder-32B |
32B |
Dense |
OpenHands, MoatlessTools |
20.6 |
 |
- |
 |
| Lingma SWE-GPT |
Qwen2.5-Coder-7B |
7B |
Dense |
SWESynInfer |
18.2 |
 |
- |
- |
| SWE-Gym-Qwen-14B |
Qwen2.5-Coder-14B |
14B |
Dense |
OpenHands, MoatlessTools |
16.4 |
 |
- |
 |
| SWE-Gym-Qwen-7B |
Qwen2.5-Coder-7B |
7B |
Dense |
OpenHands, MoatlessTools |
10.6 |
 |
- |
 |
RL-based Methods
A comprehensive overview of specialized models for issue resolution, categorized by parameter size. The table details each model's base architecture, the training scaffold used for rollout, the type of reward signal employed (Outcome vs. Process), and their performance results (Res. %) on issue resolution benchmarks.
| Model Name |
Base Model |
Size |
Arch. |
Train. Scaffold |
Reward |
Res.(%) |
Code |
Data |
Model |
| 560B Models (MoE) |
|
|
|
|
|
|
|
|
|
| LongCat-Flash-Think |
LongCatFlash-Base |
560B-A27B |
MoE |
R2E-Gym |
Outcome |
60.4 |
 |
- |
 |
| 72B Models |
|
|
|
|
|
|
|
|
|
| Kimi-Dev |
Qwen 2.5-72B-Base |
72B |
Dense |
BugFixer + TestWriter |
Outcome |
60.4 |
 |
- |
 |
| SWE-RL |
Llama-3.3-70B-Instruct |
70B |
Dense |
Agentless-mini |
Outcome |
41.0 |
 |
- |
- |
| Multi-turn RL(Nebius) |
Qwen2.5-72B-Instruct |
72B |
Dense |
SWE-agent |
Outcome |
39.0 |
- |
- |
- |
| Agent-RLVR-RM-72B |
Qwen2.5-Coder-72B |
72B |
Dense |
Localization + Repair |
Outcome |
27.8 |
- |
- |
- |
| Agent-RLVR-72B |
Qwen2.5-Coder-72B |
72B |
Dense |
Localization + Repair |
Outcome |
22.4 |
- |
- |
- |
| 32B Models |
|
|
|
|
|
|
|
|
|
| OpenHands Critic |
Qwen2.5-Coder-32B |
32B |
Dense |
SWE-Gym |
- |
66.4 |
 |
- |
 |
| KAT-Dev-32B |
Qwen3-32B |
32B |
Dense |
- |
- |
62.4 |
- |
- |
 |
| SWE-Swiss-32B |
Qwen2.5-32B-Instruct |
32B |
Dense |
- |
Outcome |
60.2 |
 |
 |
 |
| FoldAgent |
Seed-OSS-36B-Instruct |
36B |
Dense |
FoldAgent |
Process |
58.0 |
 |
 |
- |
| SeamlessFlow-32B |
Qwen3-32B |
32B |
Dense |
SWE-agent |
Outcome |
45.8 |
 |
- |
- |
| DeepSWE |
Qwen3-32B |
32B |
Dense |
R2E-Gym |
Outcome |
42.2 |
 |
 |
 |
| SA-SWE-32B |
- |
32B |
Dense |
SkyRL-Agent |
- |
39.4 |
- |
- |
- |
| OpenHands LM v0.1 |
Qwen2.5-Coder-32B |
32B |
Dense |
SWE-Gym |
- |
37.2 |
 |
- |
 |
| SWE-Dev-32B |
Qwen2.5-Coder-32B |
32B |
Dense |
OpenHands |
Outcome |
36.6 |
 |
- |
 |
| Satori-SWE |
Qwen2.5-Coder-32B |
32B |
Dense |
Retriever + Code editor |
Outcome |
35.8 |
 |
 |
 |
| SoRFT-32B |
Qwen2.5-Coder-32B |
32B |
Dense |
Agentless |
Outcome |
30.8 |
- |
- |
- |
| Agent-RLVR-32B |
Qwen2.5-Coder-32B |
32B |
Dense |
Localization + Repair |
Outcome |
21.6 |
- |
- |
- |
| 14B Models |
|
|
|
|
|
|
|
|
|
| Agent-RLVR-14B |
Qwen2.5-Coder-14B |
14B |
Dense |
Localization + Repair |
Outcome |
18.0 |
- |
- |
- |
| SEAlign-14B |
Qwen2.5-Coder-14B |
14B |
Dense |
OpenHands |
Process |
17.7 |
- |
- |
- |
| 7-8B Models |
|
|
|
|
|
|
|
|
|
| SeamlessFlow-8B |
Qwen3-8B |
8B |
Dense |
SWE-agent |
Outcome |
27.4 |
 |
- |
- |
| SWE-Dev-7B |
Qwen2.5-Coder-7B |
7B |
Dense |
OpenHands |
Outcome |
23.4 |
 |
- |
 |
| SoRFT-7B |
Qwen2.5-Coder-7B |
7B |
Dense |
Agentless |
Outcome |
21.4 |
- |
- |
- |
| SWE-Dev-8B |
Llama-3.1-8B |
8B |
Dense |
OpenHands |
Outcome |
18.0 |
 |
- |
 |
| SEAlign-7B |
Qwen2.5-Coder-7B |
7B |
Dense |
OpenHands |
Process |
15.0 |
- |
- |
- |
| SWE-Dev-9B |
GLM-4-9B |
9B |
Dense |
OpenHands |
Outcome |
13.6 |
 |
- |
 |
General Foundation Models
Overview of general foundation models evaluated on issue resolution. The table details the specific inference scaffolds (e.g., OpenHands, Agentless) employed during the evaluation process to achieve the reported results.
| Model Name |
Size |
Arch. |
Inf. Scaffold |
Reward |
Res.(%) |
Code |
Model |
| MiMo-V2-Flash |
309B-A15B |
MoE |
Agentless |
Outcome |
73.4 |
 |
 |
| KAT-Coder |
- |
- |
Claude Code |
Outcome |
73.4 |
- |
 |
| Deepseek V3.2 |
671B-A37B |
MoE |
Claude Code, RooCode |
- |
73.1 |
 |
 |
| Kimi-K2-Instruct |
1T |
MoE |
Agentless |
Outcome |
71.6 |
- |
 |
| Qwen3-Coder |
480B-A35B |
MoE |
OpenHands |
Outcome |
69.6 |
 |
 |
| GLM-4.6 |
355B-A32B |
MoE |
OpenHands |
Outcome |
68.0 |
- |
 |
| gpt-oss-120b |
116.8B-A5.1B |
MoE |
Internal tool |
Outcome |
62.0 |
 |
 |
| Minimax M2 |
230B-10B |
MoE |
R2E-Gym |
Outcome |
61.0 |
 |
 |
| gpt-oss-20b |
20.9B-A3.6B |
MoE |
Internal tool |
Outcome |
60.0 |
 |
 |
| GLM-4.5-Air |
106B-A12B |
MoE |
OpenHands |
Outcome |
57.6 |
- |
- |
| Minimax M1-80k |
456B-A45.9B |
MoE |
Agentless |
Outcome |
56.0 |
 |
 |
| Minimax M1-40k |
456B-A45.9B |
MoE |
Agentless |
Outcome |
55.6 |
 |
 |
| Seed1.5-Thinking |
200B-A20B |
MoE |
- |
Outcome |
47.0 |
 |
- |
| Llama 4 Maverick |
400B-A17B |
MoE |
mini-SWE-agent |
Outcome |
21.0 |
 |
 |
| Llama 4 Scout |
109B-17B |
MoE |
mini-SWE-agent |
Outcome |
9.1 |
 |
 |