Substructure & MCS Search for Chemical Graphs
SMSD Pro is an open-source toolkit for exact substructure search and maximum common substructure (MCS) finding in chemical graphs. It runs on Java, C++ (header-only), and Python, with GPU acceleration (CUDA + Apple Metal). Built on established algorithms from the graph-isomorphism literature (VF2++, McSplit, McGregor, Horton, Vismara).
Copyright (c) 2018-2026 Syed Asad Rahman — BioInception PVT LTD
<dependency>
<groupId>com.bioinceptionlabs</groupId>
<artifactId>smsd</artifactId>
<version>6.2.1</version>
</dependency>curl -LO https://github.com/asad/SMSD/releases/download/v6.2.1/smsd-6.2.1-jar-with-dependencies.jar
java -jar smsd-6.2.1-jar-with-dependencies.jar \
--Q SMI --q "c1ccccc1" --T SMI --t "c1ccc(O)cc1" --json -pip install smsdimport smsd
result = smsd.substructure_search("c1ccccc1", "c1ccc(O)cc1")
mcs = smsd.mcs("c1ccccc1", "c1ccc2ccccc2c1")
# Tautomer-aware MCS
mcs = smsd.mcs("CC(=O)C", "CC(O)=C", tautomer_aware=True)
# Similarity upper bound (fast pre-filter)
sim = smsd.similarity("c1ccccc1", "c1ccc(O)cc1")
fp = smsd.fingerprint("c1ccccc1", kind="mcs")
# Circular fingerprint (ECFP4 equivalent, tautomer-aware)
ecfp4 = smsd.circular_fingerprint("c1ccccc1", radius=2, fp_size=2048)SMSD works standalone or alongside RDKit. Use RDKit for parsing, descriptors, and drawing; use SMSD for fast MCS and substructure matching.
RDKit molecules + SMSD matching (recommended for existing RDKit workflows):
from rdkit import Chem
import smsd
mol1 = Chem.MolFromSmiles("c1ccccc1")
mol2 = Chem.MolFromSmiles("c1ccc(O)cc1")
# MCS via SMSD -- pass RDKit Mol objects directly
result = smsd.mcs_rdkit(mol1, mol2)
# Substructure search
mapping = smsd.substructure_rdkit(mol1, mol2)
# Convert once, reuse with any SMSD function
g = smsd.from_rdkit(mol1)
sim = smsd.similarity(g, smsd.from_rdkit(mol2))SMSD standalone (no RDKit needed):
import smsd
result = smsd.mcs("c1ccccc1", "c1ccc(O)cc1")
mapping = smsd.substructure_search("c1ccccc1", "c1ccc(O)cc1")
sim = smsd.similarity("c1ccccc1", "c1ccc(O)cc1")CDK (Java) for parsing + SMSD for matching:
import com.bioinception.smsd.core.*;
import org.openscience.cdk.silent.SilentChemObjectBuilder;
import org.openscience.cdk.smiles.SmilesParser;
SmilesParser sp = new SmilesParser(SilentChemObjectBuilder.getInstance());
var mol1 = sp.parseSmiles("c1ccccc1");
var mol2 = sp.parseSmiles("c1ccc(O)cc1");
SMSD smsd = new SMSD(mol1, mol2, new ChemOptions());
boolean isSub = smsd.isSubstructure();
var mcs = smsd.findMCS();Performance note: RDKit is an optional dependency -- SMSD does not require it. The helpers convert via a SMILES round-trip (sub-millisecond overhead). For batch workloads, convert once with
from_rdkit()and reuse the MolGraph objects.
Use SMSD for matching, RDKit for visualization and export:
import smsd
# Depict MCS with highlighted atoms (works in Jupyter)
img = smsd.depict_mcs("c1ccccc1", "c1ccc(O)cc1")
img.save("mcs.png")
# Depict substructure match
img = smsd.depict_substructure("c1ccccc1", "c1ccc(O)cc1")
# Generate SVG
svg = smsd.to_svg("c1ccccc1")
# Export to SDF file
mols = [smsd.parse_smiles(s) for s in ["CCO", "c1ccccc1", "CC(=O)O"]]
smsd.export_sdf(mols, "output.sdf")
# Convert to RDKit Mol for any RDKit function
rdmol = smsd.to_rdkit(smsd.parse_smiles("c1ccccc1"))git clone https://github.com/asad/SMSD.git
# Add SMSD/cpp/include to your include path — no other dependencies needed#include "smsd/smsd.hpp"
auto mol1 = smsd::parseSMILES("c1ccccc1");
auto mol2 = smsd::parseSMILES("c1ccc(O)cc1");
bool isSub = smsd::isSubstructure(mol1, mol2, smsd::ChemOptions{});
auto mcs = smsd::findMCS(mol1, mol2, smsd::ChemOptions{}, smsd::McsOptions{});git clone https://github.com/asad/SMSD.git
cd SMSD
# Java
mvn -U clean package
# C++
mkdir cpp/build && cd cpp/build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
# Python
cd python && pip install -e .docker build -t smsd .
docker run --rm smsd --Q SMI --q "c1ccccc1" --T SMI --t "c1ccc(O)cc1" --json -Same machine, same Python process, best of 5 runs.
Full data: benchmarks/results_python.tsv
| Pair | Category | SMSD (ms) | RDKit (ms) | SMSD MCS | RDKit MCS |
|---|---|---|---|---|---|
| Cubane (self) | Cage | 0.003 | 0.241 | 8 | 8 |
| Coronene (self) | PAH | 0.006 | 0.727 | 24 | 24 |
| NAD / NADH | Cofactor | 0.012 | timeout | 44 | 33 |
| Caffeine / Theophylline | Drug pair | 0.016 | 0.354 | 13 | 13 |
| Morphine / Codeine | Alkaloid | 0.049 | 550.5 | 20 | 20 |
| Ibuprofen / Naproxen | NSAID | 0.069 | 3.5 | 15 | 15 |
| ATP / ADP | Nucleotide | 0.085 | 0.897 | 27 | 27 |
| PEG-12 / PEG-16 | Polymer | 1.6 | 2.2 | 40 | 40 |
| RDKit #1585 | Edge case | 25.0 | timeout | 29 | 24 |
| Paclitaxel / Docetaxel | Taxane | 2,405 | timeout | 56 | 53 |
SMSD faster on 17/19 pairs. Speedups range from 1.5x to 11,200x. Bold = SMSD found a larger MCS. timeout = 10 s limit.
28/28 pairs correct — all match CDK. Cached speedup: 2x-16x faster across all pairs.
Run python benchmarks/benchmark_python_vs_rdkit.py to reproduce.
| Level | Algorithm | Based on |
|---|---|---|
| L0 | Label-frequency upper bound | Degree-aware coverage-driven termination |
| L0.25 | Chain fast-path | O(n*m) DP for linear polymers (PEG, lipids) |
| L0.5 | Tree fast-path | Kilpelainen-Mannila DP for branched polymers (dendrimers, glycogen) |
| L0.75 | Greedy probe | O(N) fast path for near-identical molecules |
| L1 | Substructure containment | VF2++ check if smaller molecule is subgraph |
| L1.25 | Augmenting path extension | Forced-extension bond growth from substructure seed |
| L1.5 | Seed-and-extend | Bond-growth from rare-label seeds |
| L2 | McSplit + RRSplit | Partition refinement (McCreesh 2017) with maximality pruning |
| L3 | Bron-Kerbosch | Product-graph clique with Tomita pivoting + k-core + orbit pruning |
| L4 | McGregor extension | Forced-assignment bond-grow frontier (McGregor 1982) |
| L5 | Extra seeds | Ring skeleton, heavy-atom core, label-degree anchor seeds |
| Variant | Flag |
|---|---|
| MCIS (induced) | induced=true |
| MCCS (connected) | default |
| MCES (edge subgraph) | maximizeBonds=true |
| dMCS (disconnected) | disconnectedMCS=true |
| N-MCS (multi-molecule) | findNMCS() |
| Weighted MCS | atomWeights |
| Scaffold MCS | findScaffoldMCS() |
| Tautomer-aware MCS | ChemOptions.tautomerProfile() |
VF2++ (Juttner & Madarasi 2018) with FASTiso/VF3-Light matching order, 3-level NLF pruning, bit-parallel candidate domains, and GPU-accelerated domain initialization (CUDA + Metal).
Horton's candidate generation + 2-phase GF(2) elimination (Vismara 1997) for relevant cycles, orbit-based grouping for Unique Ring Families (URFs).
| Output | Description |
|---|---|
| SSSR / MCB | Smallest Set of Smallest Rings |
| RCB | Relevant Cycle Basis |
| URF | Unique Ring Families (automorphism orbit grouping) |
| Option | Values |
|---|---|
| Chirality | R/S tetrahedral, E/Z double bond |
| Isotope | matchIsotope=true |
| Tautomers | 15 transforms with pKa-informed weights (Sitzmann 2010) |
| Solvent | AQUEOUS, DMSO, METHANOL, CHLOROFORM, ACETONITRILE, DIETHYL_ETHER |
| Ring fusion | IGNORE / PERMISSIVE / STRICT |
| Bond order | STRICT / LOOSE / ANY |
| Aromaticity | STRICT / FLEXIBLE |
| Lenient SMILES | ParseOptions{.lenient=true} (C++) / ChemOptions.lenientSmiles (Java) |
Preset profiles: ChemOptions() (default), .tautomerProfile(), .fmcsProfile() (RDKit-compatible)
Solvent-aware tautomers (Tier 2 pKa): opts.withSolvent(Solvent.DMSO) adjusts tautomer equilibrium weights for non-aqueous environments.
| Platform | CPU | GPU |
|---|---|---|
| macOS (Apple Silicon) | OpenMP | Metal (zero-copy unified memory) |
| Linux | OpenMP | CUDA |
| Windows | OpenMP | CUDA |
| Any (no GPU) | OpenMP | Automatic CPU fallback |
GPU acceleration covers RASCAL batch screening and domain initialization. Recursive backtracking (VF2++, BK, McSplit) runs on CPU. Dispatch: CUDA -> Metal -> OpenMP -> sequential.
| Tool | Description |
|---|---|
| CIP R/S/E/Z assignment | Full digraph-based stereo descriptors (IUPAC 2013 Rules 1-2) |
| Circular fingerprint (ECFP/FCFP) | Tautomer-aware Morgan/ECFP with configurable radius (-1 = whole molecule) |
| Count-based ECFP/FCFP | ecfpCounts() / fcfpCounts() — superior to binary for ML |
| Topological Torsion fingerprint | 4-atom path with atom typing (SOTA on peptide benchmarks) |
| Path fingerprint | Graph-aware, tautomer-invariant path enumeration |
| MCS fingerprint | MCS-aware, auto-sized |
| Similarity metrics | Tanimoto, Dice, Cosine, Soergel (binary + count-vector) |
| Fingerprint formats | toBitSet(), toHex(), toBinaryString(), fromBitSet(), fromHex() |
| MCS SMILES extraction | findMcsSmiles() — extract MCS as canonical SMILES |
| findAllMCS | Top-N MCS enumeration with canonical SMILES dedup |
| SMARTS-based MCS | findMcsSmarts() — largest substructure matching a SMARTS pattern |
| R-group decomposition | decomposeRGroups() |
| MatchResult | Structured result: size, mapping, tanimoto, query/target atom counts |
| RASCAL screening | O(V+E) similarity upper bound |
| Canonical SMILES / SMARTS | deterministic, toolkit-independent (including X total connectivity) |
| Reaction atom mapping | mapReaction() |
| 2D depiction | SVG rendering with atom highlighting |
| Lenient SMILES parser | Best-effort recovery from malformed SMILES |
| N-MCS | Multi-molecule MCS with provenance tracking |
| Tautomer validation | validateTautomerConsistency() — proton conservation check |
| 30 tautomer transforms | pKa-informed weights, 6 solvents, pH-sensitive, ring-chain tautomerism |
| Format | Read | Write |
|---|---|---|
| SMILES | Java, C++ | Java, C++ |
| SMARTS | Java, C++ | C++ |
| MOL V2000 | Java, C++ | C++ |
| SDF | Java, C++ | — |
| Mol2, PDB, CML | Java | — |
Every release includes all platforms:
| Download | Description |
|---|---|
SMSD.Pro-6.2.1.dmg |
macOS installer (Apple Silicon) — drag to Applications |
SMSD.Pro-6.2.1.msi |
Windows installer — next, next, finish |
smsd-pro_6.2.1_amd64.deb |
Linux installer — sudo dpkg -i |
smsd-6.2.1.jar |
Pure library JAR (Maven/Gradle dependency) |
smsd-6.2.1-jar-with-dependencies.jar |
Standalone CLI (just java -jar) |
smsd-cpp-6.2.1-headers.tar.gz |
C++ header-only library (unpack, #include "smsd/smsd.hpp") |
pip install smsd |
Python package (PyPI) |
# Native installer — download .dmg / .msi / .deb, double-click, done
# CLI
java -jar smsd-6.2.1-jar-with-dependencies.jar --Q SMI --q "c1ccccc1" --T SMI --t "c1ccc(O)cc1" --json -
# Docker CLI
docker build -t smsd .
docker run --rm smsd --Q SMI --q "c1ccccc1" --T SMI --t "c1ccc(O)cc1" --json -
# Python
pip install smsd- 1,082 Java tests (7 consolidated suites) — heterocycles, reactions, drug pairs, tautomers, stereochemistry, ring perception, URF families, hydrogen handling, adversarial edge cases, fast-path validation, solvent corrections
- 170 C++ tests (3 suites) — 63 core + 91 parser (including SMARTS X primitive) + 16 batch/GPU
- 1,003 diverse molecules — all parse correctly in C++ SMILES parser
- AddressSanitizer — zero memory errors
- Python tests — full API coverage including hydrogen handling and charged species
| Document | Description |
|---|---|
| WHITEPAPER | Algorithms & design (11-level MCS, VF2++, ring perception) |
| HOWTO-INSTALL | Build from source guide |
| NOTICE | Attribution, trademark, and novel algorithm terms |
If you use SMSD Pro in your research, please cite:
Rahman SA, Bashton M, Holliday GL, Schrader R, Thornton JM. Small Molecule Subgraph Detector (SMSD) toolkit. Journal of Cheminformatics, 1:12, 2009. DOI: 10.1186/1758-2946-1-12
GitHub renders a "Cite this repository" button from CITATION.cff.
Syed Asad Rahman — BioInception PVT LTD
Copyright (c) 2018-2026 BioInception PVT LTD. Algorithm Copyright (c) 2009-2026 Syed Asad Rahman.