ISM

By Jeffrey Ouyang-Zhang, Chengyue Gong, Yue Zhao, Philipp Krähenbühl, Adam Klivans, Daniel J. Diaz

This repository is an official implementation of the paper Distilling Structural Representations into Protein Sequence Models.

TL; DR. ESM2 with enriched structural representations

Download ISM

The model URL contains ISM accessible in both huggingface and ESM2 format. These models can be used as a drop-in replacements for ESM2 (see Quickstart). Most users should use the first model (ISM-650M-UC30PDB). The second model (ISM-650M-UC30) is for users who do not want a model trained on PDB (e.g. for benchmarking).

Name	Layers	#params	Dataset	Model URL
ISM-650M-UC30PDB	33	650M	Uniclust30 + PDB	https://huggingface.co/jozhang97/ism_t33_650M_uc30pdb
ISM-650M-UC30	33	650M	Uniclust30	https://huggingface.co/jozhang97/ism_t33_650M_uc30
ISM-3B-UC30	36	3B	Uniclust30	https://huggingface.co/jozhang97/ism_t36_3B_uc30
ISM-C-300M	30	300M	Uniclust30 + PDB	https://huggingface.co/jozhang97/ismc-300m-2024-12
ISM-C-600M	36	600M	Uniclust30 + PDB	https://huggingface.co/jozhang97/ismc-600m-2024-12

Quickstart

This quickstart assumes that the user is already working with ESM2 and is interested in replacing ESM with ISM. First, download ISM.

# recommended
huggingface-cli download jozhang97/ism_t33_650M_uc30pdb --local-dir /path/to/save/ism

# alternative
git clone https://huggingface.co/jozhang97/ism_t33_650M_uc30pdb

If the user is starting from fair-esm, add the following lines of code.

import esm
model, alphabet = esm.pretrained.esm2_t33_650M_UR50D()
ckpt = torch.load('/path/to/ism_t33_650M_uc30pdb/checkpoint.pth')
model.load_state_dict(ckpt)

If the user is starting from huggingface, replace the model and tokenizer with the following line of code.

from transformers import AutoTokenizer, AutoModel
config_path = "/path/to/ism_t33_650M_uc30pdb/"
model = AutoModel.from_pretrained(config_path)
tokenizer = AutoTokenizer.from_pretrained(config_path)

Please change /path/to/ism_t33_650M_uc30pdb to the path where the model is downloaded.

Installation

The following reproduction setup walks through how to structure-tune, fine-tune on a downstream task and evaluate our model. Prepare your conda environment.

conda create -n ism python=3.10 -y
pip install -r requirements.txt

ISM Structure-tuning

ISM is initialized from ESM2 and fine-tuned on structural tokens. Download the dataset from here (131 GB uncompressed, 22GB compressed).

If you are using slurm, use the following command. This trains in roughly 1 day.

cd plm_train
python submitit_train.py --nodes 32 --ngpus 1 --dist_eval \
    --loss_func allmergedce_ce \
    --data_path /path/to/dataset \
    --job_dir logs/%j_ism

If you are training on an 8 GPU machine, the following training script corresponds to the above command.

cd plm_train
torchrun --nproc_per_node=8 main_train.py --accum_iter 4 --dist_eval \
    --loss_func allmergedce_ce \
    --data_path /path/to/dataset
    --output_dir logs/ism

Structural Benchmark Evaluation

Here, we show how to reproduce our performance on the secondary structure and binding residues datasets. The datasets are make available at plm_eval/data.

To retrain the models, use the following commands.

cd plm_eval
torchrun --nproc_per_node=8 main_train.py \
    --data_path data/binding_residue/development_set/train.csv \
    --eval_data_path data/binding_residue/development_set/test.csv \
    --freeze_at 33 --lr 1e-4 \
    --finetune_backbone /path/to/ism_t33_650M_uc30/checkpoint.pth \
    --output_dir logs/ism_binding_residue

torchrun --nproc_per_node=4 main_train.py \
    --data_path data/secondary_structure/train.csv \
    --eval_data_path data/secondary_structure/test.csv \
    --freeze_at 33 \
    --finetune_backbone /path/to/ism_t33_650M_uc30/checkpoint.pth \
    --output_dir logs/ism_secondary_structure

To evaluate our models, add --resume /path/to/ism_finetuned --eval to the above commands. The models fine-tuned for secondary structure is available here and for binding residues here. The models are 2.4GB each. (Note that here we evaluate ISM structure-tuned on AlphaFold structures in Uniclust30 only to avoid data leakage.)

License

This project builds off ESM. Please refer to their original licenses for more details.

Citing ISM

If you find ISM useful in your research, please consider citing:

@article{ouyangzhang2024distilling,
  title={Distilling Structural Representations into Protein Sequence Models},
  author={Ouyang-Zhang, Jeffrey and Gong, Chengyue and Zhao, Yue and Kr{\"a}henb{\"u}hl, Philipp and Klivans, Adam and Diaz, Daniel J},
  journal={bioRxiv},
  doi={10.1101/2024.11.08.622579},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
plm_eval		plm_eval
plm_train		plm_train
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ISM

Download ISM

Quickstart

Installation

ISM Structure-tuning

Structural Benchmark Evaluation

License

Citing ISM

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

jozhang97/ism

Folders and files

Latest commit

History

Repository files navigation

ISM

Download ISM

Quickstart

Installation

ISM Structure-tuning

Structural Benchmark Evaluation

License

Citing ISM

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages