By David Wan, Jesse Vig, Mohit Bansal, and Shafiq Joty.
data/generation: Data and code for replicating data for generating summariesdata/metric_benchmark: Data and code for replicating data for meta-evaluationsrc: Code for running the metric and different generation methodsprompts: All prompts used for generation
This section describes how to replicate the datasets used for both the metric meta-evaluation and the summary generation tasks. All processed data and processing scripts are located in the data/ directory.
The benchmark data is used to evaluate the performance of faithfulness metrics. Each line in the jsonl files has the following structure:
id: A unique identifier.documents: A list of source documents.document: The concatenated original document.summary: The system-generated summary to be evaluated.faithfulness: A binary label indicating faithfulness (1 for faithful, 0 for not).ranking: A list of document indices, ordered from most to least important.
- Download the original annotations from the DiverseSumm repository.
- Run the processing script:
python data/metric_benchmark/process_diversesumm.py ${path_to_diversesumm_annotation_jsonl}
- Clone the LongEval repository to obtain the sentence-level faithfulness annotations.
- Place the
pubmed_annotationsandsquality_annotationsdirectories intodata/metric_benchmark/longeval/. - Download the
pubmed_test.txtfile from Cohan et al., 2018 and place it in the same directory. - Note: The script
process_longeval.pycontains hardcoded paths. You may need to adjust the paths on lines 13, 14, 28, 116, 119, and 122 to match your file locations. - Run the processing script:
python data/metric_benchmark/process_longeval.py
- Download the compiled annotations from Infuse.
- From the Infuse data, place
DiverSumm.csvintodata/metric_benchmark/diversumm/. - Download the ArXiv and PubMed test sets from Cohan et al., 2018 and place
arxiv_test.txtintodata/metric_benchmark/scientific_papers/. - Note: You may need to update the hardcoded paths in
process_other.py(e.g., lines 6 and 72) if your file structure differs. - Run the processing script:
python data/metric_benchmark/process_other.py
This data is used as input for the various summary generation methods. Each jsonl file contains lines with id and document (a list of documents).
- DiverseSumm: Use the data from the original authors and run:
python data/generation/process_diversesumm.py
- Other Datasets: We use the
datasetslibrary. Please refer to the corresponding scripts indata/generation/for more details.
All source code is located in the src/ directory. Prompts for all models and tasks are in prompts/.
To evaluate generated summaries using an LLM-based metric (e.g., GPT-4o), use src/evaluate_summaries.py.
Example:
python src/evaluate_summaries.py \
--model_name gpt4o \
--data_file data/metric_benchmark/arxiv.jsonl \
--output_file results/arxiv_eval.json \
--document_merge_type maxAll prompts used for generation can be found in the prompts/ directory.
Use src/generate_summaries.py. The "Focus" method uses a modified prompt (prompts/arxiv_prompt_top.txt) to guide the model.
python src/generate_summaries.py \
--model_name gpt4o \
--data_file data/generation/arxiv.jsonl \
--prompt_file prompts/arxiv.txt \
--output_file results/summaries_standard.jsonpython src/generate_summaries_incremental.py \
--model_name gpt4o \
--data_file data/generation/arxiv.jsonl \
--prompt_file prompts/arxiv_incremental.txt \
--original_prompt_file prompts/arxiv.txt \
--output_file results/summaries_incremental.jsonThis is a two-step process.
Step 1: Generate summaries for each document individually.
python src/generate_summaries_individual.py \
--model_name gpt4o \
--data_file data/generation/arxiv.jsonl \
--prompt_file prompts/arxiv.txt \
--output_file results/summaries_individual.jsonStep 2: Merge the individual summaries.
python src/generate_summaries_merge.py \
--model_name gpt4o \
--prompt_file prompts/arxiv_merge.txt \
--data_file results/summaries_individual.json \
--output_file results/summaries_hierarchical.jsonThis process is similar to hierarchical merging but uses a different initial generation script.
Step 1: Generate summaries for each document with calibration.
python src/generate_summaries_calibration_initial.py \
--model_name gpt4o \
--data_file data/generation/arxiv.jsonl \
--prompt_file prompts/arxiv.txt \
--output_file results/summaries_calibration_individual.jsonStep 2: Merge the individual summaries.
python src/generate_summaries_merge.py \
--model_name gpt4o \
--prompt_file prompts/arxiv_merge.txt \
--data_file results/summaries_calibration_individual.json \
--output_file results/summaries_calibration.jsonWe provide all our generated outputs and metric scores in the following Google Drive folder:
- Metric Scores: These are JSON files containing sentence-level faithfulness scores. The structure is a list of lists:
[num_examples, num_documents, num_sentences].- Scores from
MiniCheckare floats; scores fromGPT-4oare strings. - The
_splitvariant contains scores for each document separately. The_fullvariant contains scores for the concatenated document.
- Scores from
- Generated Summaries: These are JSON files that mirror the input data format but include an additional
generated_summaryfield. We also include theMiniCheckscores for all generated outputs.
If you find our work useful in your research, please cite our paper:
@inproceedings{wan-etal-2025-positional,
title = "On Positional Bias of Faithfulness for Long-form Summarization",
author = "Wan, David and Vig, Jesse and Bansal, Mohit and Joty, Shafiq",
editor = "Chiruzzo, Luis and Ritter, Alan and Wang, Lu",
booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
month = apr,
year = "2025",
address = "Albuquerque, New Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.naacl-long.442/",
doi = "10.18653/v1/2025.naacl-long.442",
pages = "8791--8810",
ISBN = "979-8-89176-189-6",
}