This repository contains the implementation of DEL-ToM, designed to run on NVIDIA H100 GPUs.
It provides step-by-step instructions for (1) environment setup and (2) fine-tuning the Process Belief Model (PBM) using Axolotl on a curated ToM dataset.
In addition, it includes code for (3) generating the PBM training data and (4) performing inference-time scaling using PBM together with Best-of-N (BoN).
It is recommended to create a conda environment using Python 3.11 and PyTorch 2.7.1, as these versions ensure full compatibility with Axolotl.
# Create and activate environment
conda create -n del_tom python=3.11 -y
conda activate del_tom
# Install PyTorch
pip install torch==2.7.1
# Clone the DEL-ToM repository
git clone https://github.com/joel-wu/DEL-ToM
cd DEL-ToMWe fine-tuned meta-llama/Llama-3.1-8B-Instruct and meta-llama/Llama-3.2-3B-Instruct on our PBM dataset (20,000 conversations). Here is an example about how to train a PBM.
pip install --no-build-isolation "axolotl[flash-attn,deepspeed]"If you plan to use gated models, log in first:
huggingface-cli loginWe provide a ready-to-use llama.yml configuration file for full fine-tuning on Llama-3.2-1B-Instruct using ChatML formatting and training only on assistant turns.
Key points:
- Loads conversations from the dataset (
conversationslist withroleandcontent). - Uses chat_template:
chatml. - roles_to_train:
["assistant"], masks non-assistant tokens. - Full fine-tuning by default (for LoRA, see Axolotl documentation).
- You can set your own model and output path by editing
base_modelandoutput_dirin the YAML.
axolotl train llama.ymlThat’s it - Axolotl will automatically:
- Download the base model
- Preprocess the dataset
- Start full fine-tuning
- Save the trained model to your specified output directory.
We have released our fine-tuned models on Hugging Face:
- joooelw/ToM-PBM-8B (base model: meta-llama/Llama-3.1-8B-Instruct)
- joooelw/ToM-PBM-3B (base model: meta-llama/Llama-3.2-3B-Instruct)
Both models are trained on our PRM dataset. Later, we will download these checkpoints for evaluation and inference-time scaling experiments.
If you are curious about how to generate ToM data, please see this section.
Otherwise, you can simply use our released PBM models for inference.
Our simulator is adapted from the Hi-ToM Dataset Repository.
We mainly modify the following files to enable belief tracking:
./Hi-ToM_dataset/generate_tasks.py./Hi-ToM_dataset/tasks.py
These modifications allow the simulator to record the full belief trace of each agent during task execution.
To generate new ToM training or testing data with belief trace:
cd Hi-ToM_dataset
python generate_tasks.py -w world_xlarge.txt -n 20 -ptn=0.1
python generate_tasks.py -w world_xlarge.txt -n 20 -ptn=0.1 --tell TrueThe generated belief traces and questions will appear under the ./Hi-ToM_dataset/data_ToMh directory.
Each JSON file contains both the full belief-state sequence and the corresponding ToM reasoning questions.
If you want to recreate the PBM dataset from the simulator outputs, you can run our one-shot pipeline that:
-
calls an LLM to produce step-by-step traces,
-
aligns them with ground-truth belief states, and
-
exports PRM-style training lines.
# 1) Put your OpenAI key in the env (required)
export OPENAI_API_KEY="YOUR_KEY_HERE"
# 2) Install deps for the generator
pip install openai aiofiles tqdm tiktoken
# 3) From repo root, run the pipeline script
cd ..
python generate_PBM_dataset.py \
--base-dir Hi-ToM_dataset/data_ToMh \
--trace-out gpt_trace.jsonl \
--conversations-out gpt_trace_conversations_format.jsonl \
--prm-out ToM_PBM_test.jsonl \
--model gpt-4o-mini \
--concurrency 200 \
--only-length length_1-
--base-dirpoints to the simulator output directory from 3.2. -
--trace-outstores raw LLM traces. -
--conversations-outstores aligned stepwise +/- conversations. -
--prm-outwrites the final PBM-style jsonl: one line per sample as {"conversations": [...]}. -
Use
--only-length allto includelength_1/2/3in one run.
This will produce a PBM jsonl similar in structure to our released datasets (e.g., joooelw/ToM-PRM-Train).
Sections 2 & 3 are optional.
Our released PBM models datasets are available on Hugging Face, so you can directly perform inference.
We generate multiple candidate reasoning traces using a base LLM (e.g., Qwen) and later evaluate them with our trained PBM.
⚠️ Note:vLLMmay conflict with Axolotl dependencies.
It’s recommended to create a new virtual environment for inference.
# Install vLLM
pip install vllm
# Inference
python bon_generation.py \
--data_path ./Hi-ToM_dataset/Hi-ToM_data \
--model Qwen/Qwen3-4B \
--n_samples 256Command line arguments:
--data_path: Path to the Hi-ToM dataset directory.--model: Hugging Face model name for the generator (e.g.,Qwen/Qwen3-4B).--n_samples: Number of candidate completions per prompt.--output_dir: Directory to save JSONL results.--batch_size: Batch size for generation.--temperature: Sampling temperature (falls back to model default).--max_tokens: Maximum number of generated tokens (falls back to model default).
Files are saved to --output_dir as: ToM_BoN_candidates_{model_name}_n{n_samples}.jsonl
Each line is a JSON object:
{
"path": "path/to/sample.txt",
"question": "question text",
"answer": "ground truth answer",
"answers": [
"generated_answer_1",
"generated_answer_2"
]
}Once reasoning traces are generated, we evaluate them using the PBM to identify which reasoning trajectory best aligns with ground-truth belief dynamics.
Run the PBM evaluation script:
python PBM_evaluation.py \
--input_file outputs/ToM_BoN_candidates_Qwen_Qwen3_4B_n256.jsonl \
--pbm_model joooelw/ToM-PBM-8B \
--output_file outputs/Qwen3-4B_scores.jsonl \
--n_traces 256 \
--batch_size 128Command line arguments:
--input_file: Path to the generated reasoning traces.--pbm_model: Model name for the trained PBM.--output_file: Output file path to save PBM scoring results.--n_traces: Number of reasoning traces to evaluate.--batch_size: Batch size for PBM evaluation.
⚠️ Note:
This script is designed for our released PBM models (joooelw/ToM-PBM-3B,joooelw/ToM-PBM-8B).
If you train your own PBM, make sure to align the chat template and tokenizer format
before computing scores, so that message roles and special tokens match your model’s configuration.
Finally, we evaluate the PBM reranking performance and visualize the Pass@N improvement curves across different sample sizes.
Install dependencies:
pip install matplotlibRun evaluation and plotting:
python evaluate_draw.py \
--candidates_file outputs/ToM_BoN_candidates_Qwen_Qwen3_4B_n256.jsonl \
--scores_file outputs/Qwen3-4B_scores.jsonl \
--model_name "Qwen3-4B" \
--n_values 1 4 16 64 256 1024 \
--trials 20Command line arguments:
--candidates_file: Path to candidates JSONL file from bon_generation.py (required)--scores_file: Path to PBM scores JSONL file from PBM_evaluation.py (required)--output_dir: Output directory for plots (default: ./plots)--model_name: Model name for plot titles (default: Model)--mode: Evaluation mode - vanilla, weighted, or both (default: both)--trials: Number of random trials for evaluation (default: 10)--n_values: N values to test (default: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024])--output_format: Output format - pdf, png, or svg (default: pdf)
The script generates accuracy plots showing:
- Majority: Simple majority voting baseline
- Avg: Average score across all steps
- Last: Score of the final step
- Min: Minimum score across all steps
- Prod: Product of all step scores
Each plot shows:
- Accuracy vs N
- Confidence bands from multiple random trials
If you find this work useful, please cite our paper:
@inproceedings{wu2025del-tom,
title={DEL-ToM: Inference-Time Scaling for Theory-of-Mind Reasoning via Dynamic Epistemic Logic},
author={Wu, Yuheng and Xie, Jianwen and Zhang, Denghui and Xu, Zhaozhuo},
booktitle={Proceedings of the Conference on Empirical Methods in Natural Language Processing},
month={November},
year={2025}
}For questions or issues, please contact Yuheng Wu or open an issue in this repository.