LLM-SelfReflection-Eval

This code is associated with the NAACL 2024 Findings paper: "When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models". If you use this code or the results from our paper, please cite:

@inproceedings{li-etal-2024-hindsight,
    title = "When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models",
    author = "Li, Yanhong  and Yang, Chenghao  and Ettinger, Allyson",
    booktitle = "Findings of the Association for Computational Linguistics: NAACL 2024",
    month = jun,
    year = "2024",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-naacl.237"
}

Setup Environment

To install the necessary dependencies, run:

pip install -r requirements.txt

Generating Data

Initial Responses and Critiques

To generate initial response and critique pairs, as well as responses without and with reflection, use the following command. We used n_samples = 4 in our paper:

python generate.py \
    --cache_dir YOUR_CACHE_DIR \
    --data_dir YOUR_DATA_DIR \
    --dataset truthfulqa \
    --api_key YOUR_API_KEY \
    --n_samples 4 \
    --model gpt-3.5-turbo

Synthetic Datastore for HotpotQA

To build a synthetic datastore for HotpotQA, run:

python build_synthetic_dataset.py \
    --cache_dir YOUR_CACHE_DIR \
    --data_dir YOUR_DATA_DIR \
    --api_key YOUR_API_KEY \
    --model gpt-3.5-turbo

Generate with Synthetic Data

To generate data using the synthetic datastore, use the following command:

python generate_w_synthetic_data.py \
    --cache_dir YOUR_CACHE_DIR \
    --data_dir YOUR_DATA_DIR \
    --dataset hotpotqa \
    --api_key YOUR_API_KEY \
    --n_samples 10 \
    --model gpt-3.5-turbo

Note

The data used for plotting the figures in the paper are in the plots directory. You can run python plots/plot_4.py and python plots/plot_10.py to reproduce Figures 2 and 3 in our paper. Our generated results are saved in the results folder for your reference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM-SelfReflection-Eval

Setup Environment

Generating Data

Initial Responses and Critiques

Synthetic Datastore for HotpotQA

Generate with Synthetic Data

Note

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
plots		plots
results		results
README.md		README.md
build_synthetic_dataset.py		build_synthetic_dataset.py
config.py		config.py
generate.py		generate.py
generate_w_synthetic_data.py		generate_w_synthetic_data.py
requirements.txt		requirements.txt
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

LLM-SelfReflection-Eval

Setup Environment

Generating Data

Initial Responses and Critiques

Synthetic Datastore for HotpotQA

Generate with Synthetic Data

Note

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages