__ __ _
/ / / /___ _____(_)___ __ ______
/ /_/ / __ \/ ___/ /_ / / / / / __ \
/ __ / /_/ / / / / / /_/ /_/ / / / /
/_/ /_/\____/_/ /_/ /___/\__, /_/ /_/
/____/
Official implementation of the Horizyn model for matching reactions and enzymes. Register here to query Horizyn interactively or via the API.
Horizyn is a dual-encoder contrastive learning model that learns to match enzymatic reactions with their catalyzing proteins. The model uses:
- Reaction Encoder: Concatenated RDKit+ (structural) and DRFP fingerprints → MLP
- Protein Encoder: Pre-computed T5 embeddings → MLP
- Loss: Maximum Likelihood Noise Contrastive Estimation (MLNCE)
- Embeddings: 512-dimensional normalized outputs for both encoders
Install dependencies with UV (recommended):
uv syncOr with pip:
pip install -e .Download the training dataset and protein embeddings (~1GB). This provides the ~216K pre-computed ProtT5-XL protein embeddings needed by both evaluation and prediction:
uv run python scripts/download_training_data.pyDownload both official checkpoints (~201MB each, ~402MB total):
uv run python scripts/download_checkpoint.pyThis downloads two checkpoints:
horizyn_v1_0_dev.ckpt— trained on the train split only (paper-faithful); use for evaluationhorizyn_v1_0_inf.ckpt— trained on full data; use for prediction
To download only one: uv run python scripts/download_checkpoint.py --only dev
Evaluate the dev checkpoint on the test set (requires both the dataset and dev checkpoint above):
uv run python scripts/evaluate.pyThe evaluation script computes retrieval metrics (Top-K hit rates, MRR) on the held-out test set. Expected: top-1 ≈ 32.4%.
Find the most likely catalyzing enzymes for a reaction SMILES, using the inference checkpoint against the bundled ~216K protein embeddings (requires both the dataset and inf checkpoint above):
# Example: ADP + H2O -> AMP + phosphate
uv run python scripts/predict.py "NC1=NC=NC2=C1N=CN2[C@@H]1O[C@H](COP(=O)([O-])[O-])[C@@H](OP(=O)([O-])[O-])[C@H]1O.[H]O[H]>>NC1=NC=NC2=C1N=CN2[C@@H]1O[C@H](COP(=O)([O-])[O-])[C@@H](O)[C@H]1O.O=P([O-])([O-])O" --top-k 10Use --bidirectional to score both forward and reverse reaction directions (averaged):
uv run python scripts/predict.py "SMILES>>SMILES" --bidirectional --top-k 20Train the SOTA model from scratch (requires ~16GB RAM, single GPU with 16GB+ VRAM):
uv run python train.py --config configs/sota.yamluv run pytest- RAM: 8GB minimum (4GB for data loaded entirely in memory)
- GPU: Single NVIDIA GPU with 16GB+ VRAM (e.g., T4, A10G, V100)
- Disk: 20GB free space for dataset and checkpoints
- Platform: Linux x86_64 with CUDA 12.1
horizyn/
├── horizyn/ # Main package
│ ├── model.py # DualContrastiveModel, MLP
│ ├── lightning_module.py # Training loop logic
│ ├── data_module.py # Data loading orchestration
│ ├── config.py # Configuration management
│ ├── losses.py # MLNCE loss function
│ ├── metrics.py # Retrieval metrics
│ ├── datasets/ # Dataset classes
│ │ ├── base.py # Base dataset abstractions
│ │ ├── collection.py # Dataset composition utilities
│ │ ├── csv.py # CSV dataset loader
│ │ ├── hdf5.py # HDF5 embedding loader
│ │ ├── transform.py # Data transformations
│ │ └── fingerprints/ # Chemical fingerprint generation
│ │ ├── base.py # Fingerprint base class
│ │ ├── rdkit_plus.py # RDKit structural fingerprints
│ │ └── drfp.py # Differential reaction fingerprints
│ ├── chemistry/ # Chemistry utilities
│ │ └── standardizer.py # SMILES standardization
│ └── utils/ # Utility functions
│ ├── cache.py # In-memory caching
│ └── collate.py # Batch collation
├── configs/ # Training configurations
│ ├── sota.yaml # SOTA configuration
│ └── nano.yaml # Small test configuration
├── scripts/ # Helper scripts
│ ├── download_training_data.py # Training data download
│ ├── download_checkpoint.py # Pre-trained checkpoint download
│ ├── predict.py # Query model with a reaction SMILES
│ └── evaluate.py # Model evaluation
├── train.py # Main training entry point
└── tests/ # Test suite
The Horizyn model uses a dual-encoder architecture:
- Query Encoder (Reactions): 2048-dim fingerprints → 4096-dim hidden → 512-dim embedding
- Target Encoder (Proteins): 1024-dim T5 embeddings → 4096-dim hidden → 512-dim embedding
- Loss Function: MLNCE with temperature parameter (β=10.0)
If you use this code in your research, please cite:
@article{horizyn2026,
title = {Dual-encoder contrastive learning accelerates enzyme discovery},
author = {Rocks, Jason W. and Truong, Dat P. and Rappoport, Dmitrij and Maddrell-Mander, Sam and Martin-Alarcon, Daniel A. and Lee, Toni and Crossan, Steve and Goldford, Joshua E.},
journal = {Proc. Natl. Acad. Sci. U.S.A.},
volume = {123},
number = {12},
pages = {e2520070123},
year = {2026},
doi = {10.1073/pnas.2520070123},
}This code is licensed under PolyForm Noncommercial License 1.0.0.
- ✅ Noncommercial use: Free to use and modify for noncommercial purposes
- ✅ Research and education: Permitted for academic, research, and educational purposes
- ❌ Commercial use: Prohibited without separate commercial licensing
- 📧 Commercial inquiries: info@dayhofflabs.com
See LICENSE for full terms or visit https://polyformproject.org/licenses/noncommercial/1.0.0
This repository is maintained by Dayhoff Labs. For questions or issues, please open a GitHub issue.