You need to first set up the environment.
pip install -r requirements.txtThis document outlines the step-by-step process for handling the Pararel dataset in our pipeline.
Before training, the dataset needs to be preprocessed. The following command processes both the test and training sets:
# Process the Pararel test dataset
python template.py --data_path dataset/blank/pararel_test.json --save_path dataset/blank/processed_pararel_test.json --case blank --question_number 3
# Process the Pararel training dataset
python template.py --data_path dataset/blank/pararel_train.json --save_path dataset/blank/processed_pararel_train.json --case blank --question_number 3Once the dataset is processed, we generate model predictions using the following commands:
# Generate predictions for the Pararel test dataset
python generate_output.py --data_path dataset/blank/processed_pararel_test.json --save_path result/blank/pararel.json --case blank --generate_vllm --question_number 3 --gpu 0
# Generate predictions for the Pararel training dataset
python generate_output.py --data_path dataset/blank/processed_pararel_train.json --save_path result/blank/pararel.json --case blank --generate_vllm --question_number 3 --gpu 0To evaluate the model’s performance, compare its predictions against the ground truth labels:
python compare.py --data_path result/blank/pararel.json --case blank --question_number 3To improve model robustness, we categorize the dataset into certain and uncertain instances:
python divide_dataset.py --data_path dataset/blank/processed_pararel_train.json --result result/blank/pararel.json --save_path dataset/blank/pararel_split/pararel --case blankTo enhance model performance, fine-tune it using the Pararel dataset:
# Fine-tune using LLaMA3
python fine_tune.py --data_path dataset/blank/pararel_split/pararel --save_path models/blank/llama3_pararel --case blank --question_number 3 --gpu 0
# Fine-tune using Qwen
python fine_tune_Qwen.py --data_path dataset/blank/pararel_split/pararel --save_path models/blank/llama3_pararel --case blank --question_number 3 --gpu 0After fine-tuning, we generate new predictions using the updated model:
python generate_output.py --data_path dataset/blank/processed_pararel_test.json --save_path fine_tune_result/blank/pararel.json --lora_model --lora_path models/blank/llama3_pararel --case blank --question_number 3 --gpu 0To assess the improvement, compare the fine-tuned model’s output:
python compare.py --data_path fine_tune_result/blank/pararel.json --case blank --question_number 3To quantify the model’s reliability, compute the AP (Average Precision) score:
# AP Score for fine-tuned model
python calculate_ap.py --data_path fine_tune_result/blank/pararel.json --lora_model --lora_path models/blank/llama3_pararel --case blank --gpu 0This pipeline ensures a systematic approach to processing, fine-tuning, and evaluating the Pararel dataset. 🚀
If you find this repository helpful, please consider citing our paper to support the research.
@misc{huang2025mactuningllmmulticompositionalproblem,
title={MAC-Tuning: LLM Multi-Compositional Problem Reasoning with Enhanced Knowledge Boundary Awareness},
author={Junsheng Huang and Zhitao He and Sandeep Polisetty and Qingyun Wang and May Fung},
year={2025},
eprint={2504.21773},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.21773},
}