This repository contains the source code and resources for the DeepDrug Protein Embeddings Bank (DPEB), a multimodal database of human protein embeddings designed to enhance protein-protein interaction modeling.
A preprint version of this work is available on arXiv:2510.22008.
The dataset can be accessed from this link: DPEB AWS S3 Bucket Link.
This repository provides the environment setup used in our experiments related to deep protein embeddings, graph neural networks, and AlphaFold-based representations.
We present DPEB, the first searchable database exclusively focused on 22,043 human proteins that integrates four distinct protein embedding types—each capturing a unique biological modality including sequences up to 1,975 amino acids. This novel resource combines representations of structurally-informed features (AlphaFold2 [18]), transformer- based sequence embeddings (BioEmbeddings [8]), contextual amino acid patterns (ESM-2: Evolutionary Scale Modeling [23]), and sequence-based n-gram statistics (ProtVec [3]), which altogether provide a multi-dimensional characterization of human proteins. By incorporating these complementary embedding approaches, our database captures protein properties at multiple biological levels—from primary sequence patterns to tertiary structural features—enabling a more holistic analysis than any single embedding method could achieve.
The novelty of DPEB extends beyond its human- specific focus and multimodal embeddings. Our platform uniquely enables researchers to cross-reference predictions across different embedding types, revealing insights that might be missed when using a single representation strategy. The observed variation in predictive accuracy across embedding types suggests that each representation captures distinct and complementary aspects of protein function, structure, or sequence—a key advantage for modeling complex biological systems. For example, BioEmbedding achieved the highest AUROC (87.37%) in our experiments, while AlphaFold2 captured interactions potentially driven by structural similarity. These findings reveal that no single embedding type captures the full spectrum of protein properties, establishing that multimodal integration is essential rather than optional for comprehensive interactome modeling.
Unlike existing databases that lack species specificity or utilize limited embedding approaches, DPEB distinctively combines human-specific data from multiple established repositories, including UniProt [7], STRING [33], and IntAct [28] with our novel multimodal embedding framework. This comprehensive protein interaction resource integrates structural, sequential, evolutionary, and functional properties within a unified analysis platform.
Our database supports multiple graph-based neural network architectures, including GraphSage, Graph Convolutional Networks (GCNs), Graph Transformer Networks (GTNs), Graph Neural Networks (GNNs), and Graph Isomorphism Networks (GINs). This flexibility enables researchers to apply the most appropriate model for specific biological questions, further enhancing the utility and adaptability of our platform for diverse research needs in computational biology.
By providing researchers with this innovative, human- focused resource that leverages the complementary strengths of four distinct embedding approaches, DPEB aims to accelerate research in systems biology, drug discovery, and personalized medicine through more accurate and comprehensive protein interaction predictions than previously possible.
As shown in the Figure 2, DPEB provides a flexible framework for constructing protein-protein interaction graphs, allowing users to select their preferred protein embeddings while incorporating established interaction data from external sources such as HuMap. In this approach, each protein functions as a node in the graph, with neighborhoods defined by relationships among proteins. Nodes carry features derived from amino acid sequences and structural properties, while edges encode valuable information about residue interactions drawn from known PPI databases. This network provides a mathematical representation of known and predicted protein-protein contacts based on available data, which helps to improve the accuracy of interaction prediction.
we are hosting the DeepDrug Protein Embeddings Bank (DPEB) on the Amazon Web Services (AWS) Open Data Program. This public repository provides curated, multi-modal protein embeddings—including AlphaFold2 structural vectors, ESM and ProtVec sequence representations. By making DPEB openly accessible via AWS, we aim to facilitate reproducibility, promote downstream discovery in drug development and systems biology. The dataset can be accessed from this DPEB AWS S3 Bucket Link.
The data repository contains four main subdirectories under the deepdrug-dpeb/ directory, each corresponding to a different protein embedding type.
The data directory includes:
deepdrug-dpeb/
│
├── dpeb_aggreagated_embeddings_all_in_one.csv
│
├── AlphaFold-2/
│ ├── All_ePPI_Alphafold2_Embeddings_np_v1.3.rar
│ │ ├── Q9Y6X2.npy
│ │ ├── P12345.npy
│ │ └── ...
│ └── eppi_alphafold_aggregated_embeddings.csv
│
├── ESM-2/
│ ├── esm2_dict_embeddings.rar
│ │ ├── Q9Y6X2.npy
│ │ ├── P12345.npy
│ │ └── ...
│ └── ProteinID_proteinSEQ_ESM_emb.csv
│
├── ProtVec/
│ ├── protvec_dict_embeddings.rar
│ │ ├── Q9Y6X2.npy
│ │ ├── P12345.npy
│ │ └── ...
│ └── protvec_aggregated_embeddings.csv
│
└── BioEmbedding/
├── All_ePPI_Bio_Embeddings_np.rar
│ ├── Q9Y6X2.npy
│ ├── P12345.npy
│ └── ...
└── bio_embeddings_ePPI.csv
Each .npy file inside the .rar archive corresponds to a protein and contains its embedding matrix or vector:
- AlphaFold2:
[L × 384]structure-informed residue embeddings - ESMFold:
[L × 1280]or[L × 2560]contextualized transformer embeddings - ProtVec:
[100]pooled trigram-based sequence vector - BioEmbeddings:
[L × 1024]embeddings from language models like SeqVec or ProtBert
The .csv metadata files contain UniProt IDs, amino acid sequences, and optionally precomputed averaged embeddings for fast access.
File descriptions:
.rarfiles: Archives containing individual.npyembedding files for each protein..csvfiles: Metadata files with UniProt IDs, amino acid sequences, or pre-aggregated embeddings.dpeb_aggreagated_embeddings_all_in_one.csv: Combined metadata and aggregated embeddings for all proteins.
User suggestions—when to use which file:
- Use the
.rararchive if you need individual per-protein embeddings and want to analyze or model proteins separately (e.g., for custom downstream tasks or when working with raw embedding matrices). - Use the
.csvmetadata files in each directory if you want aggregated embeddings (e.g., averaged across residues) and quick access to UniProt IDs and sequences, typically for fast prototyping, graph construction, or ML tasks that do not require per-residue information. - Use
dpeb_aggreagated_embeddings_all_in_one.csvif you want a single file containing aggregated embeddings of all types for all proteins.
This is recommended for benchmarking, tabular machine learning, or analyses requiring a unified multimodal representation for every protein across all embedding modalities. This file provides separate columns for each embedding type, so you can select, combine, or compare AlphaFold2, BioEmbedding, ESM-2, and ProtVec embeddings easily in one place.
Dataset Directory Size Table
Below is a summary table showing the size of each main embedding directory and their key files:
| Directory | File(s) | Size |
|-------------------|--------------------------------------------|-------------|
| Alphafold-2/ | All_ePPI_Alphafold2_Embeddings_np_v1.3.rar | 14.857 GB |
| | eppi_alphafold_aggregated_embeddings.csv | 171.96 MB |
| BioEmbedding/ | All_ePPI_Bio_Embeddings_np.rar | 45.6364 GB |
| | bio_embeddings_ePPI.csv | 501.52 MB |
| ESM-2/ | esm2_dict_embeddings.rar | 49.3481 GB |
| | ProteinID_proteinSEQ_ESM_emb.csv | 622.86 MB |
| ProtVec/ | protvec_dict_embeddings.rar | 3.8175 GB |
| | protvec_aggregated_embeddings.csv | 90.46 MB |
| *root directory* | dpeb_aggreagated_embeddings_all_in_one.csv | 1.2749 GB |
DPEB can also be accessed directly using the AWS Command Line Interface (CLI).
No credentials are required since the dataset is hosted under the AWS Open Data Program.
1. Install AWS CLI (Linux example)
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install2. Download Example
To download any file for example- aggregated ProtVec embeddings into your current directory, run:
aws s3 cp s3://deepdrug-dpeb/ProtVec/protvec_aggregated_embeddings.csv . --no-sign-requestYou can replace the object path with any file or folder under S3 bucket path: deepdrug-dpeb/ to access other parts of the dataset (e.g., AlphaFold2, BioEmbeddings, ESM-2).
A Python Colab notebook is provided to help users easily download data from the AWS DeepDrug Protein Embeddings Bank (DPEB) S3 bucket (deepdrug-dpeb).
The Colab notebook and download instructions can be found at this link: DPEB download
To reproduce the results or run any experiments in this repository, use the provided Conda environment file to set up your environment. The environment is named DPEB and includes dependencies such as PyTorch, DGL (CUDA 10.2), scikit-learn, transformers, and AlphaFold-related utilities.
You can recreate the DPEB environment using the provided env.yml file:
conda env create -f /home/saiful/DPEB/env.ymlconda activate DPEBEach .npy file inside the .rar archive contains the embedding and metadata for a single protein, stored as a Python dictionary. These files can be loaded and inspected using NumPy:
import numpy as np
# Path to an example AlphaFold embedding file
file_path = "/data/saiful/ePPI/alphafold_eppi_embeddings/All_ePPI_Alphafold2_Embeddings_np_v1.3/X6RFL8_embedding.npy"
# Load the file (set allow_pickle=True to load Python objects)
embedding = np.load(file_path, allow_pickle=True)
# The stored object is a Python dictionary, so extract it using .item()
content = embedding.item()
# Inspect the structure
print("Extracted object type:", type(content))
print("Protein ID:", content['protein_id'])
print("FASTA sequence:", content['fasta'][:60], "...") # Preview the sequence
print("Embedding shape:", content['embedding'].shape)Each file contains a Python dictionary with the following keys:
-
protein_id: A UniProt-style identifier for the protein
Example:"X6RFL8" -
fasta: The amino acid sequence of the protein in FASTA format
Example:"MATAPYNYSYIFKYIIIGDMGVGKSCLLHQFTEKKFMADCPHTI..." -
embedding: A NumPy array of shape[L × D]where:Lis the number of amino acids (i.e., the length of the protein)Dis the embedding dimension- Examples:
384for AlphaFold2,1024for BioEmbeddings, etc.
- Examples:
This array contains per-residue embeddings suitable for structural or sequence-based modeling tasks such as classification or graph-based learning.
```
Extracted object type:
Protein ID: X6RFL8
FASTA sequence: MATAPYNYSYIFKYIIIGDMGVGKSCLLHQFTEKKFMADCPHTIGVEFGT ...
Embedding shape: (228, 384)
``` These embeddings can be directly used as input features for deep learning models in tasks like:
- Graph-based protein–protein interaction prediction
- Enzyme vs. non-enzyme classification
- Protein function and family clustering
- Any downstream computational biology pipeline
For a step-by-step guide on how to use the DeepDrug Protein Embeddings Bank (DPEB) and reproduce key experiments, please refer to the tutorial section of this repository:
This tutorial includes example scripts, usage instructions, and recommendations for running protein embedding pipelines using AlphaFold2, ESM, ProtVec, BioEmbeddings.
In addition to the main DPEB embeddings hosted on the AWS Open Data Program, we provide a set of Supporting Data files to facilitate reproducibility of the analyses described in our paper. These include protein family labels, enzyme annotations, FASTA sequences, and protein–protein interaction data.
Download Supporting Data (Box Link)
Detailed file descriptions can be found inside Supporting Data/readme.txt file.
The reference numbers cited throughout the text (e.g., [3], [8], [18], [23], [28], [33]) correspond to the full bibliographic entries listed in the following document:
- Code: MIT
- Data: Creative Commons Attribution 4.0 International (CC BY 4.0)
