CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis (COLM'25)
-
Create and activate a Conda environment:
conda create -y -n CodeARC python=3.10.12 conda activate CodeARC
-
Install dependencies:
pip install -r requirements.txt
-
Set API keys: Ensure you have valid API keys for the required services:
export OPENAI_API_KEY=<your_openai_api_key> export ANTHROPIC_API_KEY=<your_anthropic_api_key> export TOGETHER_API_KEY=<your_together_api_key>
python3 run.py --model_name openai/gpt-4o-mini --total_idx 20We support OpenAI models (e.g., openai/gpt-4o), Anthropic models (e.g., anthropic/claude-3-7-sonnet-20250219), and models served by Together AI (e.g., meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo). For testing purposes, you can pass --total_idx 20 to limit evaluation to 20 problems instead of the full dataset (1114 problems). See run.py for additional configuration options.
To summarize results:
python3 src/compute_metrics.pyThe CodeARC datasets are hosted on HuggingFace:
- Problems Dataset: anjiangwei/CodeARC-Problems
- Invocations Dataset: anjiangwei/CodeARC-Invocations
-
Obtain an access token:
- Go to HuggingFace Tokens and generate a token with
readorwritepermissions.
- Go to HuggingFace Tokens and generate a token with
-
Login using the token:
Option A: Use the command line:
huggingface-cli login huggingface-cli whoami
Option B: Add the token to the environment variable:
export HF_TOKEN=<your_huggingface_token>
You can directly load the datasets using the HuggingFace datasets library:
from datasets import load_dataset
# Define dataset paths
hf_problems_path = "anjiangwei/CodeARC-Problems"
hf_invocations_path = "anjiangwei/CodeARC-Invocations"
# Load datasets
problems_dataset = load_dataset(hf_problems_path)
invocations_dataset = load_dataset(hf_invocations_path)
# Example: Access the first training sample
print(problems_dataset["train"][0])
print(invocations_dataset["train"][0])If our research inspires you, please cite our paper:
@inproceedings{wei2025codearc,
title={Code{ARC}: Benchmarking Reasoning Capabilities of {LLM} Agents for Inductive Program Synthesis},
author={Anjiang Wei and Tarun Suresh and Jiannan Cao and Naveen Kannan and Yuheng Wu and Kai Yan and Thiago S. F. X. Teixeira and Ke Wang and Alex Aiken},
booktitle={Second Conference on Language Modeling},
year={2025},
url={https://openreview.net/forum?id=Q5pVZCrrKr}
}This project is licensed under the Apache 2.0 License.