DeepDrug Protein Embeddings Bank (DPEB) : A Multimodal Human Protein Embeddings Data

This repository contains the source code and resources for the DeepDrug Protein Embeddings Bank (DPEB), a multimodal database of human protein embeddings designed to enhance protein-protein interaction modeling.

A preprint version of this work is available on arXiv:2510.22008.
The dataset can be accessed from this link: DPEB AWS S3 Bucket Link.

Protein-protein interactions (PPIs) are fundamental to biological processes, yet researchers face significant challenges in computational PPI prediction due to the lack of integrated, multimodal protein representations. We present DPEB, the first searchable database focused exclusively on human proteins that integrates four distinct protein embedding types—capturing complementary biological modalities from sequence to structure. DPEB combines structurally-informed features (AlphaFold2 [18]), transformer-based sequence embeddings (BioEmbeddings [8]), contextual amino acid patterns (ESM-2: Evolutionary Scale Modeling [23]), and sequence-based n-gram statistics (ProtVec [3]) to provide comprehensive protein characterizations. While protein structures predicted by AlphaFold2 are broadly available through repositories such as the AlphaFold Protein Structure Database, access to the internal neural network embeddings generated by AlphaFold2—particularly in a format suitable for downstream computational modeling—remains extremely limited. To address this gap, DPEB provides, to our knowledge, the first large-scale, human-specific collection of AlphaFold2-derived embeddings in a publicly accessible format, specifically designed for incorporation into predictive modeling framework with other types of embeddings. Our benchmark evaluations demonstrate that GraphSAGE models utilizing BioEmbedding representations achieve the highest performance (87.37% AUROC, 79.16% accuracy) for PPI prediction tasks, while our framework correctly categorizes proteins into enzyme/non-enzyme classes with 77.42% accuracy and achieves 86.04% accuracy for protein family classification. By making these diverse embeddings accessible through a unified framework that supports multiple graph neural network architectures, DPEB accelerates research in systems biology, drug discovery, and personalized medicine through more powerful protein interaction modeling.

This repository provides the environment setup used in our experiments related to deep protein embeddings, graph neural networks, and AlphaFold-based representations.

Our Contribution

We present DPEB, the first searchable database exclusively focused on 22,043 human proteins that integrates four distinct protein embedding types—each capturing a unique biological modality including sequences up to 1,975 amino acids. This novel resource combines representations of structurally-informed features (AlphaFold2 [18]), transformer- based sequence embeddings (BioEmbeddings [8]), contextual amino acid patterns (ESM-2: Evolutionary Scale Modeling [23]), and sequence-based n-gram statistics (ProtVec [3]), which altogether provide a multi-dimensional characterization of human proteins. By incorporating these complementary embedding approaches, our database captures protein properties at multiple biological levels—from primary sequence patterns to tertiary structural features—enabling a more holistic analysis than any single embedding method could achieve.

The novelty of DPEB extends beyond its human- specific focus and multimodal embeddings. Our platform uniquely enables researchers to cross-reference predictions across different embedding types, revealing insights that might be missed when using a single representation strategy. The observed variation in predictive accuracy across embedding types suggests that each representation captures distinct and complementary aspects of protein function, structure, or sequence—a key advantage for modeling complex biological systems. For example, BioEmbedding achieved the highest AUROC (87.37%) in our experiments, while AlphaFold2 captured interactions potentially driven by structural similarity. These findings reveal that no single embedding type captures the full spectrum of protein properties, establishing that multimodal integration is essential rather than optional for comprehensive interactome modeling.

Unlike existing databases that lack species specificity or utilize limited embedding approaches, DPEB distinctively combines human-specific data from multiple established repositories, including UniProt [7], STRING [33], and IntAct [28] with our novel multimodal embedding framework. This comprehensive protein interaction resource integrates structural, sequential, evolutionary, and functional properties within a unified analysis platform.

Our database supports multiple graph-based neural network architectures, including GraphSage, Graph Convolutional Networks (GCNs), Graph Transformer Networks (GTNs), Graph Neural Networks (GNNs), and Graph Isomorphism Networks (GINs). This flexibility enables researchers to apply the most appropriate model for specific biological questions, further enhancing the utility and adaptability of our platform for diverse research needs in computational biology.

By providing researchers with this innovative, human- focused resource that leverages the complementary strengths of four distinct embedding approaches, DPEB aims to accelerate research in systems biology, drug discovery, and personalized medicine through more accurate and comprehensive protein interaction predictions than previously possible.

As shown in the Figure 2, DPEB provides a flexible framework for constructing protein-protein interaction graphs, allowing users to select their preferred protein embeddings while incorporating established interaction data from external sources such as HuMap. In this approach, each protein functions as a node in the graph, with neighborhoods defined by relationships among proteins. Nodes carry features derived from amino acid sequences and structural properties, while edges encode valuable information about residue interactions drawn from known PPI databases. This network provides a mathematical representation of known and predicted protein-protein contacts based on available data, which helps to improve the accuracy of interaction prediction.

Dataset Acesss

we are hosting the DeepDrug Protein Embeddings Bank (DPEB) on the Amazon Web Services (AWS) Open Data Program. This public repository provides curated, multi-modal protein embeddings—including AlphaFold2 structural vectors, ESM and ProtVec sequence representations. By making DPEB openly accessible via AWS, we aim to facilitate reproducibility, promote downstream discovery in drug development and systems biology. The dataset can be accessed from this DPEB AWS S3 Bucket Link.

Dataset Directories

The data repository contains four main subdirectories under the deepdrug-dpeb/ directory, each corresponding to a different protein embedding type.

The data directory includes:

deepdrug-dpeb/
│
├── dpeb_aggreagated_embeddings_all_in_one.csv
│
├── AlphaFold-2/
│   ├── All_ePPI_Alphafold2_Embeddings_np_v1.3.rar
│   │   ├── Q9Y6X2.npy
│   │   ├── P12345.npy
│   │   └── ...
│   └── eppi_alphafold_aggregated_embeddings.csv
│   
├── ESM-2/
│   ├── esm2_dict_embeddings.rar
│   │   ├── Q9Y6X2.npy
│   │   ├── P12345.npy
│   │   └── ...
│   └── ProteinID_proteinSEQ_ESM_emb.csv
│ 
├── ProtVec/
│   ├── protvec_dict_embeddings.rar
│   │   ├── Q9Y6X2.npy
│   │   ├── P12345.npy
│   │   └── ...
│   └── protvec_aggregated_embeddings.csv
│   
└── BioEmbedding/
    ├── All_ePPI_Bio_Embeddings_np.rar
    │   ├── Q9Y6X2.npy
    │   ├── P12345.npy
    │   └── ...
    └── bio_embeddings_ePPI.csv

Each .npy file inside the .rar archive corresponds to a protein and contains its embedding matrix or vector:

AlphaFold2: [L × 384] structure-informed residue embeddings
ESMFold: [L × 1280] or [L × 2560] contextualized transformer embeddings
ProtVec: [100] pooled trigram-based sequence vector
BioEmbeddings: [L × 1024] embeddings from language models like SeqVec or ProtBert

The .csv metadata files contain UniProt IDs, amino acid sequences, and optionally precomputed averaged embeddings for fast access.

File descriptions:

.rar files: Archives containing individual .npy embedding files for each protein.
.csv files: Metadata files with UniProt IDs, amino acid sequences, or pre-aggregated embeddings.
dpeb_aggreagated_embeddings_all_in_one.csv: Combined metadata and aggregated embeddings for all proteins.

User suggestions—when to use which file:

Use the .rar archive if you need individual per-protein embeddings and want to analyze or model proteins separately (e.g., for custom downstream tasks or when working with raw embedding matrices).
Use the .csv metadata files in each directory if you want aggregated embeddings (e.g., averaged across residues) and quick access to UniProt IDs and sequences, typically for fast prototyping, graph construction, or ML tasks that do not require per-residue information.
Use dpeb_aggreagated_embeddings_all_in_one.csv if you want a single file containing aggregated embeddings of all types for all proteins.
This is recommended for benchmarking, tabular machine learning, or analyses requiring a unified multimodal representation for every protein across all embedding modalities. This file provides separate columns for each embedding type, so you can select, combine, or compare AlphaFold2, BioEmbedding, ESM-2, and ProtVec embeddings easily in one place.

Dataset Directory Size Table

Below is a summary table showing the size of each main embedding directory and their key files:

| Directory         | File(s)                                    | Size        |
|-------------------|--------------------------------------------|-------------|
| Alphafold-2/      | All_ePPI_Alphafold2_Embeddings_np_v1.3.rar | 14.857 GB   |
|                   | eppi_alphafold_aggregated_embeddings.csv   | 171.96 MB   |
| BioEmbedding/     | All_ePPI_Bio_Embeddings_np.rar             | 45.6364 GB  |
|                   | bio_embeddings_ePPI.csv                    | 501.52 MB   |
| ESM-2/            | esm2_dict_embeddings.rar                   | 49.3481 GB  |
|                   | ProteinID_proteinSEQ_ESM_emb.csv           | 622.86 MB   |
| ProtVec/          | protvec_dict_embeddings.rar                | 3.8175 GB   |
|                   | protvec_aggregated_embeddings.csv          | 90.46 MB    |
| *root directory*  | dpeb_aggreagated_embeddings_all_in_one.csv | 1.2749 GB   |

Data Download via AWS CLI

DPEB can also be accessed directly using the AWS Command Line Interface (CLI).
No credentials are required since the dataset is hosted under the AWS Open Data Program.

1. Install AWS CLI (Linux example)

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

2. Download Example
To download any file for example- aggregated ProtVec embeddings into your current directory, run:

aws s3 cp s3://deepdrug-dpeb/ProtVec/protvec_aggregated_embeddings.csv . --no-sign-request

You can replace the object path with any file or folder under S3 bucket path: deepdrug-dpeb/ to access other parts of the dataset (e.g., AlphaFold2, BioEmbeddings, ESM-2).

Dataset Download via Google Colab

A Python Colab notebook is provided to help users easily download data from the AWS DeepDrug Protein Embeddings Bank (DPEB) S3 bucket (deepdrug-dpeb). The Colab notebook and download instructions can be found at this link: DPEB download

Environment Setup

To reproduce the results or run any experiments in this repository, use the provided Conda environment file to set up your environment. The environment is named DPEB and includes dependencies such as PyTorch, DGL (CUDA 10.2), scikit-learn, transformers, and AlphaFold-related utilities.

Prerequisites

Anaconda or Miniconda
CUDA 10.2-compatible GPU (for using GPU-based operations via PyTorch + DGL)

Create the Environment

You can recreate the DPEB environment using the provided env.yml file:

conda env create -f /home/saiful/DPEB/env.yml

Activate the Environment

conda activate DPEB

Loading and Understanding `.npy` Embedding Files

Each .npy file inside the .rar archive contains the embedding and metadata for a single protein, stored as a Python dictionary. These files can be loaded and inspected using NumPy:

Example: Loading a Protein Embedding

import numpy as np

# Path to an example AlphaFold embedding file
file_path = "/data/saiful/ePPI/alphafold_eppi_embeddings/All_ePPI_Alphafold2_Embeddings_np_v1.3/X6RFL8_embedding.npy"

# Load the file (set allow_pickle=True to load Python objects)
embedding = np.load(file_path, allow_pickle=True)

# The stored object is a Python dictionary, so extract it using .item()
content = embedding.item()

# Inspect the structure
print("Extracted object type:", type(content))
print("Protein ID:", content['protein_id'])
print("FASTA sequence:", content['fasta'][:60], "...")  # Preview the sequence
print("Embedding shape:", content['embedding'].shape)

Structure of Each `.npy` File

Each file contains a Python dictionary with the following keys:

protein_id: A UniProt-style identifier for the protein
Example: "X6RFL8"
fasta: The amino acid sequence of the protein in FASTA format
Example: "MATAPYNYSYIFKYIIIGDMGVGKSCLLHQFTEKKFMADCPHTI..."
embedding: A NumPy array of shape [L × D] where:
- L is the number of amino acids (i.e., the length of the protein)
- D is the embedding dimension
  - Examples: 384 for AlphaFold2, 1024 for BioEmbeddings, etc.

This array contains per-residue embeddings suitable for structural or sequence-based modeling tasks such as classification or graph-based learning.

Example Output

```
Extracted object type: 
Protein ID: X6RFL8
FASTA sequence: MATAPYNYSYIFKYIIIGDMGVGKSCLLHQFTEKKFMADCPHTIGVEFGT ...
Embedding shape: (228, 384)
```

Use Cases

These embeddings can be directly used as input features for deep learning models in tasks like:

Graph-based protein–protein interaction prediction
Enzyme vs. non-enzyme classification
Protein function and family clustering
Any downstream computational biology pipeline

Tutorial

For a step-by-step guide on how to use the DeepDrug Protein Embeddings Bank (DPEB) and reproduce key experiments, please refer to the tutorial section of this repository:

DPEB/tutorial

This tutorial includes example scripts, usage instructions, and recommendations for running protein embedding pipelines using AlphaFold2, ESM, ProtVec, BioEmbeddings.

Supporting Data

In addition to the main DPEB embeddings hosted on the AWS Open Data Program, we provide a set of Supporting Data files to facilitate reproducibility of the analyses described in our paper. These include protein family labels, enzyme annotations, FASTA sequences, and protein–protein interaction data.

Download Supporting Data (Box Link)

Detailed file descriptions can be found inside Supporting Data/readme.txt file.

References

The reference numbers cited throughout the text (e.g., [3], [8], [18], [23], [28], [33]) correspond to the full bibliographic entries listed in the following document:

References.docx

License

Code: MIT
Data: Creative Commons Attribution 4.0 International (CC BY 4.0)

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
alphafold		alphafold
bioembedding		bioembedding
clustering and classification		clustering and classification
esm		esm
images		images
node classification		node classification
protein protein interaction		protein protein interaction
protvec		protvec
references		references
tutorial		tutorial
DPEB.yml		DPEB.yml
LICENSE		LICENSE
LICENSE-CCBY.txt		LICENSE-CCBY.txt
README.md		README.md
download_dpeb_from_s3_bucket.ipynb		download_dpeb_from_s3_bucket.ipynb
env.yml		env.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Uh oh!

Repository files navigation

DeepDrug Protein Embeddings Bank (DPEB) : A Multimodal Human Protein Embeddings Data

Our Contribution

Dataset Acesss

Dataset Directories

Data Download via AWS CLI

Dataset Download via Google Colab

Environment Setup

Prerequisites

Create the Environment

Activate the Environment

Loading and Understanding `.npy` Embedding Files

Example: Loading a Protein Embedding

Structure of Each `.npy` File

Example Output

Use Cases

Tutorial

Supporting Data

References

License

About

Licenses found

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

Licenses found

deepdrugai/DPEB

Folders and files

Latest commit

History

Repository files navigation

DeepDrug Protein Embeddings Bank (DPEB) : A Multimodal Human Protein Embeddings Data

Our Contribution

Dataset Acesss

Dataset Directories

Data Download via AWS CLI

Dataset Download via Google Colab

Environment Setup

Prerequisites

Create the Environment

Activate the Environment

Loading and Understanding .npy Embedding Files

Example: Loading a Protein Embedding

Structure of Each .npy File

Example Output

Use Cases

Tutorial

Supporting Data

References

License

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Loading and Understanding `.npy` Embedding Files

Structure of Each `.npy` File

Packages