Skip to content

wang2226/PAD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Privacy-Aware Decoding (PAD)

License: CC BY-NC 4.0 Python arXiv

📖 Overview

This repository contains the official implementation and datasets for our paper:

Privacy-Aware Decoding: Mitigating Privacy Leakage of Large Language Models in Retrieval-Augmented Generation

Project Structure

PAD/
├── 📁 data/                 # Attack prompts
├── 📁 result/               # Output results
├── 📁 processed/            # Processed data files
├── 📁 corpus/               # Corpus files for retrieval
├── 📁 RetrievalBase/        
├── 🐍 generate.py           # Main generation script
├── 🐍 llm.py                # LLM engine with PAD
├── 🐍 retriever.py          # Retrieval system
├── 🐍 evaluate.py           # Evaluation script
├── 🐍 utils.py              
├── 📄 environment.yml       
└── 📄 .gitignore           

⚙️ Installation & Setup

Prerequisites

  • Python: 3.9 or higher
  • Conda: For environment management

Quick Start

  1. Create and activate conda environment:

    conda env create -n pad --file environment.yml
    conda activate pad
  2. Download required datasets:

    Medical Datasets:

    Email Dataset:

🚀 Usage

Running Extraction Attacks (Baseline)

python generate.py \
    --dataset healthcaremagic \
    --model_name EleutherAI/pythia-6.9b \
    --retriever_model BAAI/bge-large-en-v1.5 \
    --temperature 0.2 \
    --max_tokens 256 \
    --output_file result/healthcaremagic/pythia/baseline.json

Running PAD (Privacy-Aware Decoding)

python generate.py \
    --dataset healthcaremagic \
    --model_name EleutherAI/pythia-6.9b \
    --retriever_model BAAI/bge-large-en-v1.5 \
    --temperature 0.2 \
    --add_noise \
    --epsilon 0.2 \
    --noise_amplification 3.0 \
    --min_sensitivity 0.4 \
    --max_tokens 256 \
    --output_file result/healthcaremagic/pythia/pad.json

📊 Evaluation

Evaluate baseline extraction attack:

python evaluate.py \
    --input_file result/healthcaremagic/pythia/baseline.json \
    > result/healthcaremagic/pythia/baseline.txt

Evaluate PAD results:

python evaluate.py \
    --input_file result/healthcaremagic/pythia/pad.json \
    > result/healthcaremagic/pythia/pad.txt

📚 Citation

If you find this work useful, please cite our paper:

@article{wang2025privacy,
  title={Privacy-Aware Decoding: Mitigating Privacy Leakage of Large Language Models in Retrieval-Augmented Generation},
  author={Wang, Haoran and Xu, Xiongxiao and Huang, Baixiang and Shu, Kai},
  journal={arXiv preprint arXiv:2508.03098},
  year={2025}
}

License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages