NOTE: This is a pre-release, the scripts here do not fully cooperate yet. Each step in the protocol currently exists as standalone scripts, and the pipeline is under construction
Yes, I fit the acronym to the title of the project.
LUKE: USE the Forces is a molecular fragmentation protocol designed to improve active learning in machine-learned interatomic potential models. Built on TorchANI, LUKE identifies atomic environments with high force uncertainty and fragments molecules around them, generating smaller molecular systems to enhance training data diversity.
LUKE leverages TorchANI to:
- Detect high-uncertainty atomic force predictions
- Fragment molecules around high-error atoms
- Introduce new, diverse molecular structures to the training dataset
- Improve localized understanding of chemical space
- Automated high-uncertainty detection using force magnitude predictions
- Efficient molecular fragmentation guided by the TorchANI neighbor list
- Designed for active learning workflows in neural network potentials
- Seamless integration with existing TorchANI-based training pipelines
LUKE relies on TorchANI as a git submodule (vendored source). All runtime and development
dependencies are declared in pyproject.toml (PEP 621). Install in editable mode with the
chemistry and development extras for full functionality.
git clone --recursive git@github.com:roitberg-group/LUKE.git
cd LUKE
git submodule update --init --recursive
python3.11 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -e .[chem,dev]The CI pins torch==2.3.1 (CPU build) via the PyTorch CPU index on Linux. On macOS and Windows the
+cpu suffix is not used—just the plain version. For a local environment that mirrors CI, use the
helper script:
bash ./dev_ci_setup.shThis script:
- Creates/updates
.venvwith Python 3.11 - Installs pinned torch (CPU variant where available)
- Installs editable torchani (vendored submodule) with its dependencies
- Installs LUKE with chemistry + dev extras
- Verifies torchani internal tuple import
After installation, a console command luke is available.
- Run full pipeline:
luke pipeline example_structures/test_mol.xyz -o results --model ANI2xr --device cpu --threshold 0.5- Or directly via Python module:
python -m luke.cli pipeline example_structures/test_mol.xyz -o resultsLUKE is designed to be integrated into molecular simulation and machine learning workflows. Below is an example of how to use the pipeline:
python run.py --input example_structures/test_mol.xyz --output results/This command will:
- Read the input XYZ file.
- Run ANI forces to detect high-uncertainty atomic environments.
- Fragment molecules around high-error atoms.
- Sanitize the resulting structures.
- Save the output to the specified directory.
To execute the LUKE pipeline, use the run.py script. Below is an example:
python run.py --input example_structures/test_mol.xyz --output results/This command will:
-
Read the Input File:
- Parses the molecular structure from the specified XYZ file.
-
Run ANI Forces:
- Computes atomic forces and identifies high-uncertainty atoms using the ANI model.
-
Isolate High-Error Atoms:
- Fragments molecules around high-error atoms to generate smaller substructures.
-
Sanitize Structures:
- Ensures chemical viability of the fragmented structures.
-
Save Results:
- Outputs the sanitized structures and logs to the specified directory.
Each module in LUKE can be executed independently. Below are examples for running specific modules:
python -m luke.ani_forces --dataset example_structures/test_mol.xyz --model ANI2xr --device cuda --batch_size 1000- Parameters:
--dataset: Path to the input dataset (HDF5 or XYZ format).--model: ANI model to use (default:ANI2xr).--device: Device for computation (cudaorcpu).--batch_size: Number of structures processed per batch.
python -m luke.structure_sanitizer --input results/high_error_atoms.xyz --output results/sanitized_structures.xyz- Parameters:
--input: Path to the input XYZ file.--output: Path to save the sanitized XYZ file.
An example dataset is provided in the example_structures/ directory. Use test_mol.xyz to test the pipeline:
python run.py --input example_structures/test_mol.xyz --output results/This will generate:
- Fragmented molecular structures in the
results/directory. - Logs and intermediate results for debugging.
If you encounter issues while using LUKE, here are some common problems and their solutions:
-
Problem:
torchanisubmodule is not initialized.-
Solution: Run the following commands to initialize the submodule:
git submodule update --init --recursive
-
-
Problem: Missing dependencies.
- Solution: Ensure you have installed all dependencies listed in
environment.yamlorrequirements.txt.
- Solution: Ensure you have installed all dependencies listed in
-
Problem: CUDA device not available.
- Solution: Check if your system has a compatible GPU and CUDA installed. If not, use the
--device cpuflag.
- Solution: Check if your system has a compatible GPU and CUDA installed. If not, use the
-
HDF5:
- Hierarchical data format for large datasets, must be formatted to TorchANI dataset standards.
- Contains molecular structures, atomic species, and coordinates.
-
XYZ:
-
Standard molecular structure format.
-
Example:
3 Comment line H 0.0 0.0 0.0 O 0.0 0.0 1.0 H 1.0 0.0 0.0
-
-
Fragmented Structures:
- XYZ files containing smaller molecular fragments.
-
Logs:
- Detailed logs for debugging and analysis.
-
Intermediate Results:
- Stored in the specified output directory for further inspection.
- Python 3.11
- PyTorch
- TorchANI
- ASE
- RDKit
- Other dependencies (see
requirements.txt)
After installation (or via dev_ci_setup.sh):
ruff check luke tests
mypy luke
pytest --disable-warnings --cov=lukeTo build distribution artifacts locally:
python -m build --sdist --wheel --outdir dist
twine check dist/*Or run everything through the Makefile target (see below):
make cidev_ci_setup.sh + make ci closely emulate the GitHub Actions workflow for reproducibility before pushing.
Common developer targets are provided:
make ci # Full lint/type/test/build cycle
make lint # Ruff lint
make type # mypy type check
make test # pytest with coverageContributions are welcome! To contribute:
- Fork the repository.
- Create a new branch for your feature or bugfix.
- Submit a pull request with a detailed description of your changes.
- Follow PEP 8 for Python code.
- Write comprehensive tests for new features.
- Update the documentation as needed.
- Integrate all standalone scripts into a cohesive pipeline.
- Add more example scripts and datasets.
- Improve test coverage.
- Optimize performance for large datasets.
This project is licensed under the MIT License. See the LICENSE file for details.