- (🔥 New) MiniLLM is supported in HuggingFace TRL. Checkout the docs.
- MiniLLM's "minimizing reverse KLD by on-policy distillation" is introduced by Thinking Machine Lab. Checkout their implementation.
See also:
- DPKD: A simple improvement of MiniLLM using DPO.
- MiniPLM: Knowledge distillation for pre-training lanuage models.
pip3 install git+https://github.com/t1101675/transformers@minillm
pip3 install torch
pip3 install deepspeed
pip3 install numerize
pip3 install rouge-score
pip3 install torchtyping
pip3 install rich
pip3 install accelerate
pip3 install datasets
pip3 install peftor
bash install.shOur data and pre-trained models are uploaded to our HuggingFace repo. We modified the transformers code base to support model (tensor) parallel and teacher-mixed sampling. The modified lines are wrapped with
# ### MiniLLM BEGIN ###
... SOME NEW CODES ...
# ### MiniLLM END ###
- The training/evaluation intruction-response data before processing can be downloaded from the following links: dolly, self-inst, vicuna, sinst, and uinst
huggingface-cli download MiniLLM/dolly --repo-type dataset /PATH_TO/LMOps/minillm/data/dolly/ huggingface-cli download MiniLLM/self-inst --repo-type dataset /PATH_TO/LMOps/minillm/data/self-inst/ huggingface-cli download MiniLLM/Vicuna --repo-type dataset /PATH_TO/LMOps/minillm/data/vicuna/ huggingface-cli download MiniLLM/sinst --repo-type dataset /PATH_TO/LMOps/minillm/data/sinst/ huggingface-cli download MiniLLM/uinst --repo-type dataset /PATH_TO/LMOps/minillm/data/uinst/
- (Optional) The plain-text corpus
$\mathcal{D}_\text{PT}$ can be download from the HugginFace datasets repository. For reproducibility, we recommend you to use the following preprocessed data. - The processed data can be downloaded from the following links: dolly, openwebtext (Optional), roberta-corpus (Optional).
huggingface-cli download MiniLLM/dolly-processed --repo-type dataset --local-dir /PATH_TO/LMOps/minillm/processed_data/dolly/ huggingface-cli download MiniLLM/openwebtext-processed --repo-type dataset --local-dir /PATH_TO/LMOps/minillm/processed_data/openwebtext/gpt2/512/10M/ # Optional huggingface-cli download MiniLLM/roberta-corpus-processed --repo-type dataset --local-dir /PATH_TO/LMOps/minillm/processed_data/openwebtext/ # Optional
bash scripts/gpt2/tools/process_data_dolly.sh /PATH_TO/LMOps/minillm # Process Dolly Train / Validation Data
bash scripts/opt/tools/process_data_dolly.sh /PATH_TO/LMOps/minillm # Process Dolly Train / Validation Data
bash scripts/llama/tools/process_data_dolly.sh /PATH_TO/LMOps/minillm # Process Dolly Train / Validation DataGet plain-text corpus
python3 tools/get_openwebtext.pyThis script will replace the continuous \n in each document with a special token "<@x(x!>" and write each document in OpenWebText in a line, which is covenient for parallel processing. In data/openwebtext/data.txt, we give an example of the resulting format. You can follow this format to prepare other corpus beyond OpenWebText.
Tokenize the data and store them in binary files:
bash scripts/gpt2/tools/process_data_pretrain.sh /PATH_TO/LMOps/minillm # Process OpenWebText Train / Validation Data
bash scripts/opt/tools/process_data_pretrain.sh /PATH_TO/LMOps/minillm # Process RoBERTa Corpus Train / Validation Data
bash scripts/llama/tools/process_data_pretrain.sh /PATH_TO/LMOps/minillm # Process RoBERTa Corpus Train / Validation Data- The pre-trained models (MiniLLM and the baselines) can be found in this collection.
To run fine-tuning or standard KD baselines, you need to download the model checkpoints from [Huggingface Model Hub] and put them in checkpoints/. For example, for gpt2-large, you can download the model from this link and put them in checkpoints/gpt2-large.
huggingface-cli download gpt2 --repo-type model /PATH_TO/LMOps/minillm/checkpoints/gpt2-base
huggingface-cli download gpt2-medium --repo-type model /PATH_TO/LMOps/minillm/checkpoints/gpt2-medium
huggingface-cli download gpt2-large --repo-type model /PATH_TO/LMOps/minillm/checkpoints/gpt2-large
huggingface-cli download gpt2-xl --repo-type model /PATH_TO/LMOps/minillm/checkpoints/gpt2-xlargeAlternatively, you can also change the CKPT variable in each script to the corresponding model name to enable Transformers to download the base models automatically. For example, set CKPT="gpt2-large" in scripts/gpt2/sft/sft_large.sh causes download of the gpt2-large base model from the HugginFace model hub.
NOTE:
- LLaMA models require license and cannot be directly downloaded.
- If you want to use model parallel for training, it is recommended to download the models to
checkpointsbecause you need to runtools/convert_mp.pyto change their model parallel sizes (see next section).
If you find the model is too large to fit in your GPUs, you can increase/decrease the tensor parallel sizes with
python3 tools/convert_mp.py \
--input_path results/llama/train/minillm/7B-init-13B-sft \
--source_mp_size 1 \
--target_mp_size 4 \
--model_type llama # choose from opt and llamaTo use the model with Model Parallel, we provide two example scripts for training and evaluation.
NOTE: Model parallelism is not applied to gpt2 because these models are generally sufficiant small to fit in common GPUs.
bash scripts/gpt2/eval/run_eval.sh /PATH_TO/LMOps/minillm
bash scripts/opt/eval/run_eval.sh /PATH_TO/LMOps/minillm
bash scripts/llama/eval/run_eval.sh /PATH_TO/LMOps/minillmWe provide example commands for GPT-2 models. Similar scripts for model families can be found in scripts/opt and scripts/llama. All our experiments are conducted on 16 * 32V100, which can be reduced for small models.
Some large models require tensor parallel size = 4, which is set in the scripts with --model-parallel and --model-parallel-size options.
The final checkpoints are selected by the Rouge-L scores.
bash scripts/gpt2/sft/sft_xlarge.sh /PATH_TO/LMOps/minillmFine-tuned teacher model:
bash scripts/gpt2/sft/sft_base.sh /PATH_TO/LMOps/minillm
bash scripts/gpt2/sft/sft_medium.sh /PATH_TO/LMOps/minillm
bash scripts/gpt2/sft/sft_large.sh /PATH_TO/LMOps/minillmThe SFT models
bash scripts/gpt2/kd/kd_base.sh /PATH_TO/LMOps/minillm
bash scripts/gpt2/kd/kd_medium.sh /PATH_TO/LMOps/minillm
bash scripts/gpt2/kd/kd_large.sh /PATH_TO/LMOps/minillmThe KD models
Generate and process responses with the teacher:
bash scripts/gpt2/tools/generate_data_seqkd.sh /PATH_TO/LMOps/minillm
bash scripts/gpt2/tools/process_pseudo_data_seqkd.sh /PATH_TO/LMOps/minillmFine-tune the model with SeqKD:
bash scripts/gpt2/seqkd/seqkd_base.sh /PATH_TO/LMOps/minillm
bash scripts/gpt2/seqkd/seqkd_medium.sh /PATH_TO/LMOps/minillm
bash scripts/gpt2/seqkd/seqkd_large.sh /PATH_TO/LMOps/minillmThe SeqKD models
We first conduct SFT on base models to get a better initialization for the following RL-based MiniLLM training.
bash scripts/gpt2/sft/sft_base.sh /PATH_TO/LMOps/minillm
bash scripts/gpt2/sft/sft_medium.sh /PATH_TO/LMOps/minillm
bash scripts/gpt2/sft/sft_large.sh /PATH_TO/LMOps/minillmThe final checkpoints are selected by the validation loss. The trained checkpoints:
The final checkpoints are selected by the Rouge-L scores.
bash scripts/gpt2/minillm/train_base_xl.sh /PATH_TO/LMOps/minillm
bash scripts/gpt2/minillm/train_medium_xl.sh /PATH_TO/LMOps/minillm
bash scripts/gpt2/minillm/train_large_xl.sh /PATH_TO/LMOps/minillmFor the data we use:
-
PROMPT_DATA_DIRis the SFT data ($\mathcal{D}$ , Dolly), which is required. -
LM_DATA_DIRis the plain-text corpus ($\mathcal{D}_\text{PT}$ ), which is optional. Seeminillm/scripts/gpt2/minillm/train_base_xl_no_pt.shfor training withoutLM_DATA_DIR(by just commenting out theOPTS+=" --lm-data-dir ${LM_DATA_DIR}"line).
The MiniLLM models
Multi-Node training is launched by deepspeed. We provide an example script in scripts/llama/sft/sft_7B_mn.sh for multi-node training. Compared to single-node scripts, some of the DISTRIBUTED_ARGS are changed, and you need to specify a hostfile like configs/hostfiles/node_0_1 to tell the script which nodes to use. For more information, please refer to HuggingFace's tutorial.
@inproceedings{minillm,
title={MiniLLM: Knowledge Distillation of Large Language Models},
author={Gu, Yuxian and Dong, Li and Wei, Furu and Huang, Minlie},
booktitle={Proceedings of ICLR},
year={2024}
}
