Skip to content

Reproducibility study of "How to Leverage Personal Textual Knowledge for Personalized Conversational Information Retrieval" submitted to SIGIR2025

Notifications You must be signed in to change notification settings

EricLangezaal/PersonalizedCIR

Repository files navigation

PersonalizedCIR

This is a reproduction study of the original paper: "How to Leverage Personal Textual Knowledge for Personalized Conversational Information Retrieval.". The original codebase can be found here. This codebase has been substantially extended to allow for self-contained experiments.

Environment and dependencies

We highly suggest using a Conda environment through the provided installation script, since multiple requirements do not install correctly from PyPI automatically. To run this script, please invoke the following in the repositories main folder. This script will automatically install Anaconda if it cannot load nor find Conda. The environment name (pcir by default) can be changed at the top of the script.

bash install_environment.sh

Not recommended: If you want to install the dependencies manually, please see the list below. Main packages:

  • python 3.11.x
  • torch 2.5.1
  • transformers 4.46.2
  • numpy 2.0.2
  • faiss-gpu 1.9.0
  • pyserini 0.43.0
  • openai 1.54.1
  • pytrec-eval 0.5
  • toml 0.10.2
  • tenacity 9.0.0
  • pandas 2.2.3
  • tqdm 4.66.6
  • accelerate 1.2.0
  • scipy 1.14.1
  • httpx 0.27.2

When installing manually, don't forget to install the repository itself as a package:

pip install -e .

Preparation

1. Downloading data

This repository already contains the 2023 TREC iKAT conversation data inside the data folder. The 116M document collection has to be downloaded from the iKAT TREC website. Note that the document collection is not public domain, requiring an account to be accessed. Please refer to the "collection" section in the README of IKAT for more details.

This repository contains Slurm job scripts to download both the raw JSONL passage data, as well as the prepared BM25 index. If you are not using Snellius Slurm, please also change the output directory on top of the job script. Running these scripts directly from terminal is possible by coping their contents to a bash script.

Important

  • Both scripts below require an account to download the collection files. Please make sure to set the IKAT_USERNAME and IKAT_PASSWORD variables prior to downloading, for example by defining these in a set_secrets.sh script.

The passages can be downloaded into a single collection.jsonl file using the script below. These files are only needed for the ANCE/dense retrieval. Make sure to modify the destination directory as needed.

sbatch jobs/download/download_raw_dataset.job

To download the BM25 index for Pyserini use the script below, this is only required for BM25. Make sure to modify the destination directory as needed.

sbatch jobs/download/download_index_dataset.job

2. Preprocessing

After downloading the data, the JSONL collection needs to be preprocessed for the ANCE dense retrieval tasks. The pre-trained ad-hoc search model ANCE is used generate passage embeddings, and is hosted by us on HuggingFace. The entire preprocessing can be invoked using the following two Python scripts. Make sure to modify the filepaths in both configuration files to correspond to your file system.

Important

It takes roughly 80 hours with a single A100 GPU to complete these scripts

python index/gen_tokenized_doc.py --config=index/gen_tokenized_doc.toml
python index/gen_doc_embeddings.py --config=index/gen_doc_embeddings.toml

Optional: After the preprocessing has finished, it is possible to validate if this has been done correctly:

python index/verify_outputs.py --config=index/gen_tokenized_doc.toml

Optional: Later reproduction stages required flattened versions of the iKAT conversation data. These are already present in this repository, specifically 2023_test_topics_flattened.jsonl. Should you want to recreate this file, this can be done through:

python pcir/preprocessing_data.py

3. iKAT TREC 2024

We have also repeated our research for the new iteration of the iKAT TREC dataset, and we host the conversation data for this year's iteration in the data folder too. Note that the 2024 gold standard relevance file (2024-qrels.all_turns.txt) is not yet public, so it has not been included in this repository. So while our entire pipeline supports this dataset too, you have to obtain this file yourself once it has been published, and put it in the data-directory. For clarity, the default arguments and examples below will assume the 2023 dataset, so the command line/configuration arguments need to be changed for 2024.

Reproduction

This section will outline how to reproduce every experiment, provided the datasets have been downloaded and preprocessed. Firstly the basic steps will be outlined, after which we will briefly elaborate how to run ablations such as in-context learning, using different LLM's or using OpenAI batch processing for efficiently repeating experiments.

0. Dataset statistics

It is possible to automatically calculate some rudimentary statistics of the dataset, which form the basis of the table in our paper. This can be done by running:

python data/generate_data_table.py

1. Query reformulation

Note

If you do not want to run the query reformulation process yourself, it is possible to use the files we created, which are hosted in this repository too (including subfolders for the in-context learning or Llama). Note that these are output files of the reformulation already, so this entire section can be skipped when using those.

Note

The query reformulation process inherently introduces differences between runs, as OpenAI's API will give different results for the same query. To get our exact results, please use the files already hosted in this repository.

The method of this research distinguishes between two different pipelines: Firstly, there are approaches which separately select PTKB (either intelligently or through a baseline) and then reformulate using an LLM. These approaches can be reproduced through section 1.1. Next, there is the Select And Reformulate (SAR) pipeline, which does both PTKB selection and reformulation in one pass, as explained in section 1.2. The prompt templates are provided in prompt_template.md. All scripts output files according to a standard naming scheme (in the data folder), such that filepaths often don't have to be specified. If this doesn't work, each script also accepts an input and output file overwrite.

1.1 Two stage approaches.

The paper distinguishes between five approaches that consist of a separate PTKB selection and reformulation stage: All, None, Human, Automatic and LLM (STR). The three baselines (None, All and Human) can be run by directly invoking the reformulation script. LLM/STR requires an explicit first step to be run, and automatic has its own custom script. See the list below for an example for each of the five approaches.

  • None (no PTKB): python pcir/methods/reformulate.py --annotation 'None"
  • All (use all PTKB): python pcir/methods/reformulate.py --annotation 'All"
  • Human: python pcir/methods/reformulate.py --annotation 'human"
  • STR: First run python pcir/methods/select_ptkb_Xshot.py --shot 0 to select the relevant PTKB. Then run python pcir/methods/reformulate.py --annotation 'LLM' --shot 0
  • Automatic: python pcir/methods/ptkb_automatic_method.py

1.2 Select and reformulate (SAR)

To run the SAR pipeline, which selects PTKB and reformulates the query in a single pass, the following can be used:

python pcir/methods/select_reformulate_Xshot.py --shot 0

1.3 Output file details and overwrite option

  • Default output file:
    If not explicitly provided via --output_path, the script will automatically create an output file in data/results/ with a name following the pattern:
    2023_test_[_<ptkb_selection_type>_]_<N>shot[_<llm_model>].jsonl
    For example: data/results/2023_test_human_0shot.jsonl

    Note: using gpt-3.5-turbo-16k leaves llm_model empty as it's the default model

  • Overwriting Existing Output:
    The scripts check for already processed sample IDs in the output file and skips them in subsequent runs.
    To re-run the scripts from scratch set your openAI api key and use the --overwrite flag:

  source set_secrets.sh
  python select_ptkb_xshot.py --shot 0 --overwrite
  python reformulate.py --annotation LLM --shot 0 --prompt_type 1 --overwrite

2. Retrieval evaluation

Once the self-contained queries have been obtained through LLM reformulation, sparse or dense retrieval can be employed to evaluate the quality of the queries.

2.1 Sparse retrieval

We can perform sparse retrieval to evaluate the personalized reformulated queries by running for example. For a JSONL file obtained through the automatic method, make sure to use the --automatic_method flag.

python pcir/eval/bm25_ikat.py --input_query_path data/results/2023_test_SAR_0shot.jsonl --index_dir_path path/to/bm25/index

2.2 Dense retrieval

We can perform dense retrieval to evaluate the personalized reformulated queries by running:

python pcir/eval/ance_ikat.py --config pcir/eval/ance_ikat_config.toml

You will need to modify the 'passage_offset2pid_path' and 'passage_collection_path' in this configuration file accordingly. For a JSONL file obtained through the automatic method, make sure to use the --automatic_method flag too.

3. In context learning

The procedure to run in-context learning does not differ much from the pipelines outlined above. Just make sure to run either STR or SAR using the --shot x flag in all scripts from section 1 of this part, where 'x' denotes the number of in context learning examples (can be 0, 1, 3 or 5 currently) from the 2023 training dataset (Note that a 2024 train set does not exist, so we used the 2023 examples for 2024 too). The retrieval evaluation is identical, using the reformulation jsonl file obtained with multiple shots. Note that the original paper always used the same fixed examples, but we also support using random examples through the --random_examples flag both for SAR and STR.

4. Using a different LLM

The original paper used OpenAI's gpt-3.5-turbo-16k for all experiments. We extended this by also implementing gpt-4o-mini and Llama-3.1 8B. To use a different LLM, simply add the command line arguments --llm_model model_string for any script in section 1 of this part. If the model string is not a GPT version, it is assumed to be a HuggingFace identifier. This allows us to do --llm_model "meta-llama/Meta-Llama-3.1-8B-Instruct" to use Llama-3.1 8B.

5. Repeating experiments with batch processing

As an extension, we allow experiments to be repeated multiple times using OpenAI's cheaper batch processing API. All files related to this reformulation can be found in the batch_processing folder. There are four submit scripts, which submit jobs for the SAR method, the LLM-based PTKB selection (STR) or any the reformulation required for All, None, Human and STR. There is also a submission file for the automatic method.

Next, there are four similar scripts that check if the corresponding job has completed, and save the results to a file if applicable. Use batch_cancel.py to cancel a job.

5.1 Aggregation

If you have run an experiment multiple times, you can aggregate the results to calculate means and standard deviations. Use --automatic_method if it concerns an automatic method run. Note that this script assumes the files to be named {input}_run{i}.jsonl, where i goes from 1 to the number of runs specified (5 by default).

python pcir/eval/aggregate_results.py --input data/batch/gpt-4o-mini/processed/batch_SAR_5shot --year "2023" --num_runs 5

This script will output an aggregated file, by default in the data/batch/gpt-4o-mini/aggregated/ folder.

Significance testing

Given multiple aggregation summary files, each from a different method, it can be interesting to perform a T-Test to see if certain methods differ significantly. This can be done through a dedicated script, where n denotes how many experiments were conducted for each aggregation file.

python pcir/eval/significance_test.py --folder basefolder --files file1.jsonl file2.jsonl file3.jsonl -n 5

This script will automatically test for significance across any pair of aggregation files, for any defined subset and metric.

About

Reproducibility study of "How to Leverage Personal Textual Knowledge for Personalized Conversational Information Retrieval" submitted to SIGIR2025

Resources

Stars

Watchers

Forks

Releases

No releases published

Contributors 3

  •  
  •  
  •