
Figure 1: a) RaDeR performance overview, b) Data efficiency
- π‘ Overview
- β¬οΈ Installation
- π Retrieval data generation
- β Generating queries
- π Training models
- π Evaluating models
- π Evaluation results
We propose RaDeR, a set of reasoning-based dense retrieval models trained with data derived from mathematical problem solving using large language models (LLMs). Our method leverages retrieval-augmented reasoning trajectories of an LLM and self-reflective relevance evaluation, enabling the creation of both diverse and hard-negative samples for reasoning-intensive relevance. RaDeR retrievers, trained for mathematical reasoning, effectively generalize to diverse reasoning tasks in the BRIGHT and RAR-b benchmarks, consistently outperforming strong baselines in overall performance.

Figure 2: Data generation pipeline for RaDeR.The OST action stands for one step thought generation, and CRS stands for complete remaining solution steps action.
#Create a python environment with version 3.12.3.
python -m venv RaDeR_env
source RaDeR_env/bin/activate
pip install -r requirements.txtThen set up the necessary API keys and other environment variables in the .env file.
HUGGINGFACE_CACHE_DIR="" # Local path to huggingface cache
HUGGING_FACE_HUB_TOKEN="" # Huggingface token
HF_HOME="" # Local path to huggingface cache AZURE_OPENAI_ENDPOINT=""
AZURE_OPENAI_API_KEY=""
REPLLAMA_MERGED_HUGGINGFACE_PATH="" # Path to Repllama huggingface merged modelRaDeR_MERGED_HUGGINGFACE_PATH="" # Path to RaDeR huggingface merged modelBefore setting the path to the stored Repllama doc embeddings, we need to compute the doc embeddings and store in BRIGHT_cache directory. We can precompute the embeddings of the document corpus of any splits of the BRIGHT dataset (with any retriever supported in vLLM). For the MCTS data generation pipeline, we need the doc embeddings of TheoremQA theorems corpus using RepLLama.
bash get_doc_embeddings.sh RepLLama
And set the path to the stored RepLLama doc embeddings in the .env file.
REPLLAMA_THEOREMT_DOC_EMB_CACHE="BRIGHT_cache/doc_emb/RepLLama/0.npy" # Path to cache for Theoremqa Theorems BRIGHT repllama doc embeddingsbash get_doc_embeddings.sh RaDeR
We generate training data using a retrieval-augmented Monte Carlo Tree Search (MCTS) approach to solve mathematical reasoning problems using LLMs. Our motivation is twofold. First, solving mathematical problems often requires applying theorems to subproblems, which enables the integration of retrievers. Theorems found to be relevant to subproblems are also relevant to the original question due to the reasoning steps that connect them. Second, verifying LLM answers against gold answers provides a proxy for evaluating the utility of retrieved theorems in solving subproblems.
Follow the commands in run_MCTS.sh file to execute the data generation process by generating retrieval-augmented reasoning paths guided by rewards based on final answer correctness.
First run the commands to start the retriever servers BM25 and RepLlama/RaDeR:
Start BM25 server: python models/BM25_server_API.py > BM25_server.log 2>&1 &
Start RepLlama server: ./servers/repllama_server.sh > servers/RepLLama_server.log 2>&1 &
Start RaDeR server: ./servers/rader_server.sh RaDeR/merged_retriever_Qwen-2.5-7B-Instruct_MATH_questionpartialsol_and_LLMquery_full > servers/RaDeR_server.log 2>&1 &
Example command to run the MCTS pipeline with parallel processing:
# Command to run MCTS on MATH dataset with Qwen2.5-7B-instruct and Repllama
python run_src/run_parallel_workers.py \
--dataset_name MATH \
--test_json_filename shuffled_train \
--api vllm-server \
--model_ckpt Qwen/Qwen2.5-7B-Instruct \
--note TEST_MCTS_MATH_repllama \
--retriever repllama \
--max_depth_allowed 6 \
--num_a1_steps 2 \
--num_rollouts 16 \
--disable_a5 \
--save_tree \
--disable_a3 \
--disable_a4 \
--bool_goldanswer_reward \
--LLM_candidate_theorems \
--num_workers 10 \
--retrieval_selfreasoning \
--run_outputs_root outputs_MCTS/run_outputs \
--cache_dir BRIGHT_cache/doc_emb/RepLLama \
--num_cpu 16Important parameters for MCTS run:
--retriever # Retriever choices: RaDeR/repllama
--LLM_candidate_theorems # Whether to use LLM candidate theorems prompt
--bool_goldanswer_reward # Use rewards based on final answer correctness to guide MCTS
--retrieval_selfreasoning # Whether to use self-reflection on the retrieved document
--dataset_name # Which math reasoning dataset to use for MCTS
--test_json_filename # Name of the json file in the dataset directory to useNote
We use the Rstar: Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers github repository as the foundation for our MCTS data generation pipeline.
We prompt an LLM to generate query based on the math question M, the reasoning steps up to the query node (excluding the query),and the retrieved theorem. Assuming outputs_MCTS/run_outputs/MATH/TEST_RUN/answer_sheets is the path to your MCTS answer directory, use the following command:
python generate_LLMqueries.py --answers_directory_path outputs_MCTS/run_outputs/MATH/TEST_RUN/answer_sheets --model_ckpt Qwen/Qwen2.5-7B-Instruct --dataset_name MATH
We generate queries that have high term similarity with their respective relevant theorems, we prompt LLM using only the theorem from the MCTS. We provide helper code to generate these queries assuming you have the path to a csv file with a input column containing the retrieved theorem. Uncomment the code in the main function of vLLM_server_API.py and run the following command:
bash generate_lexicalqueries.sh
Note
First, you need to make a csv file where there is a column called input containing the theorem from MCTS. generate_LLMqueries.py shows how to extract the theorems from MCTS.
For training the our RaDeR retrievers and rerankers, we use the Tevatron github, which is a software for training billion-scale LLM neural retriever on GPUs and TPUs. We use its functionality for Parameter efficient tuning with LoRA. It is integrated with vLLM, DeepSpeed, FlashAttention, gradient accumulation, and other efficient training and inference techniques.
Installation
- Clone the Tevatron repository.
- Install PyTorch based on your CUDA version from PyTorch.
- Install dependencies and Tevatron.
pip install transformers datasets peft
pip install deepspeed accelerate
pip install faiss-cpu
pip install -e .Run training
Example command for Retriever training:
deepspeed --include localhost:0,1 --master_port 60000 --module tevatron.retriever.driver.train \
--deepspeed deepspeed/ds_zero3_config.json \
--output_dir retriever-RaDeR \
--model_name_or_path Qwen/Qwen2.5-7B-Instruct \
--lora \
--lora_target_modules q_proj,k_proj,v_proj,o_proj,down_proj,up_proj,gate_proj \
--save_steps 500 \
--dataset_name RaDer/MATH_allquery_types \
--query_prefix "Query: " \
--passage_prefix "Passage: " \
--bf16 \
--pooling eos \
--append_eos_token \
--normalize \
--temperature 0.01 \
--per_device_train_batch_size 2 \
--gradient_checkpointing \
--train_group_size 12 \
--learning_rate 1e-4 \
--query_max_len 32 \
--passage_max_len 3900 \
--num_train_epochs 1 \
--logging_steps 10 \
--overwrite_output_dir \
--gradient_accumulation_steps 16(1) For evaluation of retrievers on the BRIGHT dataset run the retriever_evaluation.sh with the correct arguments. Example command shown in the file.
(2) For evaluation of rerankers on the BRIGHT dataset run the reranker_evaluation.sh with the correct arguments. Example command shown in the file.
Note
The score files from our two best performing RaDeR models with Qwen-32B-Instruct reranking is provided in the BRIGHT_score_files directory.


