This repository contains the implementation of B-cos LM, as introduced in the paper "B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability".
B-cos LM is a modification of pre-trained language models to enhance interpretability while maintaining performance. Our implementation provides:
- B-cos versions of BERT, DistilBERT, RoBERTa, GPT-2 and Llama models
- Support for training B-cos and conventional models
- Evaluation of B-cos and various post-hoc explanation methods
The core implementations are in:
bcos_lm/models/– Contains B-cos model architecturesbcos_lm/modules/– Contains essential components for B-cos adaptation
B-cos adaptations in the code are marked with ## bcos for clarity.
Our codes require transformers==4.45.2
To train a B-cos LM model, run:
bash train_bcos_models.shYou can specify:
- Model (e.g., BERT, DistilBERT, RoBERTa)
- Dataset
- Hyperparameters
Modify train_bcos_models.sh to customize these settings.
To generate explanations using B-cos and other explanation methods, run:
bash generate_explanations.shYou can specify explanation methods to use.
To evaluate the model using perturbation-based methods, run:
bash run_perturbation_evaluation.sh- Generate SeqPG examples using conventional models:
bash create_pointing_game_examples.sh
- Evaluate using SeqPG:
bash run_pointing_game_evaluation.sh
Decoder-only model experiments (GPT-2 & Llama) can be run by executing decoder_only_model_experiments.sh in decoder_experiments.