LUKE: USE the Forces

NOTE: This is a pre-release, the scripts here do not fully cooperate yet. Each step in the protocol currently exists as standalone scripts, and the pipeline is under construction

Largest Uncertainty Kaleidoscope Estimator: Uncertainty-driven Sampling of high-Error Forces

Yes, I fit the acronym to the title of the project.

LUKE: USE the Forces is a molecular fragmentation protocol designed to improve active learning in machine-learned interatomic potential models. Built on TorchANI, LUKE identifies atomic environments with high force uncertainty and fragments molecules around them, generating smaller molecular systems to enhance training data diversity.

Overview

LUKE leverages TorchANI to:

Detect high-uncertainty atomic force predictions
Fragment molecules around high-error atoms
Introduce new, diverse molecular structures to the training dataset
Improve localized understanding of chemical space

Features

Automated high-uncertainty detection using force magnitude predictions
Efficient molecular fragmentation guided by the TorchANI neighbor list
Designed for active learning workflows in neural network potentials
Seamless integration with existing TorchANI-based training pipelines

Installation

LUKE relies on TorchANI as a git submodule (vendored source). All runtime and development dependencies are declared in pyproject.toml (PEP 621). Install in editable mode with the chemistry and development extras for full functionality.

git clone --recursive git@github.com:roitberg-group/LUKE.git
cd LUKE
git submodule update --init --recursive
python3.11 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -e .[chem,dev]

Platform Note (Torch CPU wheels)

The CI pins torch==2.3.1 (CPU build) via the PyTorch CPU index on Linux. On macOS and Windows the +cpu suffix is not used—just the plain version. For a local environment that mirrors CI, use the helper script:

bash ./dev_ci_setup.sh

This script:

Creates/updates .venv with Python 3.11
Installs pinned torch (CPU variant where available)
Installs editable torchani (vendored submodule) with its dependencies
Installs LUKE with chemistry + dev extras
Verifies torchani internal tuple import

Command-line interface

After installation, a console command luke is available.

Run full pipeline:

luke pipeline example_structures/test_mol.xyz -o results --model ANI2xr --device cpu --threshold 0.5

Or directly via Python module:

python -m luke.cli pipeline example_structures/test_mol.xyz -o results

Usage

LUKE is designed to be integrated into molecular simulation and machine learning workflows. Below is an example of how to use the pipeline:

Example

python run.py --input example_structures/test_mol.xyz --output results/

This command will:

Read the input XYZ file.
Run ANI forces to detect high-uncertainty atomic environments.
Fragment molecules around high-error atoms.
Sanitize the resulting structures.
Save the output to the specified directory.

Detailed Usage Examples

Running the Full Pipeline

To execute the LUKE pipeline, use the run.py script. Below is an example:

python run.py --input example_structures/test_mol.xyz --output results/

This command will:

Read the Input File:
- Parses the molecular structure from the specified XYZ file.
Run ANI Forces:
- Computes atomic forces and identifies high-uncertainty atoms using the ANI model.
Isolate High-Error Atoms:
- Fragments molecules around high-error atoms to generate smaller substructures.
Sanitize Structures:
- Ensures chemical viability of the fragmented structures.
Save Results:
- Outputs the sanitized structures and logs to the specified directory.

Running Individual Modules

Each module in LUKE can be executed independently. Below are examples for running specific modules:

1. ANI Forces

python -m luke.ani_forces --dataset example_structures/test_mol.xyz --model ANI2xr --device cuda --batch_size 1000

Parameters:
- --dataset: Path to the input dataset (HDF5 or XYZ format).
- --model: ANI model to use (default: ANI2xr).
- --device: Device for computation (cuda or cpu).
- --batch_size: Number of structures processed per batch.

2. Structure Sanitizer

python -m luke.structure_sanitizer --input results/high_error_atoms.xyz --output results/sanitized_structures.xyz

Parameters:
- --input: Path to the input XYZ file.
- --output: Path to save the sanitized XYZ file.

Example Dataset

An example dataset is provided in the example_structures/ directory. Use test_mol.xyz to test the pipeline:

python run.py --input example_structures/test_mol.xyz --output results/

This will generate:

Fragmented molecular structures in the results/ directory.
Logs and intermediate results for debugging.

Troubleshooting

If you encounter issues while using LUKE, here are some common problems and their solutions:

Problem: torchani submodule is not initialized.
- Solution: Run the following commands to initialize the submodule:
```
git submodule update --init --recursive
```
Problem: Missing dependencies.
- Solution: Ensure you have installed all dependencies listed in environment.yaml or requirements.txt.
Problem: CUDA device not available.
- Solution: Check if your system has a compatible GPU and CUDA installed. If not, use the --device cpu flag.

Detailed Input/Output Formats

Input Formats

HDF5:
- Hierarchical data format for large datasets, must be formatted to TorchANI dataset standards.
- Contains molecular structures, atomic species, and coordinates.

XYZ:

Standard molecular structure format.

Example:

3
Comment line
H 0.0 0.0 0.0
O 0.0 0.0 1.0
H 1.0 0.0 0.0

Output Formats

Fragmented Structures:
- XYZ files containing smaller molecular fragments.
Logs:
- Detailed logs for debugging and analysis.
Intermediate Results:
- Stored in the specified output directory for further inspection.

Dependencies

Python 3.11
PyTorch
TorchANI
ASE
RDKit
Other dependencies (see requirements.txt)

Running Tests & Quality Gates

After installation (or via dev_ci_setup.sh):

ruff check luke tests
mypy luke
pytest --disable-warnings --cov=luke

To build distribution artifacts locally:

python -m build --sdist --wheel --outdir dist
twine check dist/*

Or run everything through the Makefile target (see below):

make ci

Local CI Mirror

dev_ci_setup.sh + make ci closely emulate the GitHub Actions workflow for reproducibility before pushing.

Makefile Targets

Common developer targets are provided:

make ci      # Full lint/type/test/build cycle
make lint    # Ruff lint
make type    # mypy type check
make test    # pytest with coverage

Contributing

Contributions are welcome! To contribute:

Fork the repository.
Create a new branch for your feature or bugfix.
Submit a pull request with a detailed description of your changes.

Guidelines

Follow PEP 8 for Python code.
Write comprehensive tests for new features.
Update the documentation as needed.

Roadmap

Integrate all standalone scripts into a cohesive pipeline.
Add more example scripts and datasets.
Improve test coverage.
Optimize performance for large datasets.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/workflows		.github/workflows
example_structures		example_structures
external		external
luke		luke
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
TODO.md		TODO.md
cli.py		cli.py
dev_ci_setup.sh		dev_ci_setup.sh
environment.yaml		environment.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run.py		run.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LUKE: USE the Forces

NOTE: This is a pre-release, the scripts here do not fully cooperate yet. Each step in the protocol currently exists as standalone scripts, and the pipeline is under construction

Largest Uncertainty Kaleidoscope Estimator: Uncertainty-driven Sampling of high-Error Forces

Overview

Features

Installation

Platform Note (Torch CPU wheels)

Command-line interface

Usage

Example

Detailed Usage Examples

Running the Full Pipeline

Running Individual Modules

1. ANI Forces

2. Structure Sanitizer

Example Dataset

Troubleshooting

Detailed Input/Output Formats

Input Formats

Output Formats

Dependencies

Running Tests & Quality Gates

Local CI Mirror

Makefile Targets

Contributing

Guidelines

Roadmap

License

About

Uh oh!

Releases

Packages

Languages

License

roitberg-group/LUKE

Folders and files

Latest commit

History

Repository files navigation

LUKE: USE the Forces

NOTE: This is a pre-release, the scripts here do not fully cooperate yet. Each step in the protocol currently exists as standalone scripts, and the pipeline is under construction

Largest Uncertainty Kaleidoscope Estimator: Uncertainty-driven Sampling of high-Error Forces

Overview

Features

Installation

Platform Note (Torch CPU wheels)

Command-line interface

Usage

Example

Detailed Usage Examples

Running the Full Pipeline

Running Individual Modules

1. ANI Forces

2. Structure Sanitizer

Example Dataset

Troubleshooting

Detailed Input/Output Formats

Input Formats

Output Formats

Dependencies

Running Tests & Quality Gates

Local CI Mirror

Makefile Targets

Contributing

Guidelines

Roadmap

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages