Skip to content

asad/SMSD

Repository files navigation

SMSD Pro

SMSD Pro

Substructure & MCS Search for Chemical Graphs

Maven Central PyPI Downloads License Release


SMSD Pro is an open-source toolkit for exact substructure search and maximum common substructure (MCS) finding in chemical graphs. It runs on Java, C++ (header-only), and Python, with GPU acceleration (CUDA + Apple Metal). Built on established algorithms from the graph-isomorphism literature (VF2++, McSplit, McGregor, Horton, Vismara).

Copyright (c) 2018-2026 Syed Asad Rahman — BioInception PVT LTD


Install

Java (Maven)

<dependency>
  <groupId>com.bioinceptionlabs</groupId>
  <artifactId>smsd</artifactId>
  <version>6.2.1</version>
</dependency>

Java (Download JAR)

curl -LO https://github.com/asad/SMSD/releases/download/v6.2.1/smsd-6.2.1-jar-with-dependencies.jar

java -jar smsd-6.2.1-jar-with-dependencies.jar \
  --Q SMI --q "c1ccccc1" --T SMI --t "c1ccc(O)cc1" --json -

Python (pip)

pip install smsd
import smsd

result = smsd.substructure_search("c1ccccc1", "c1ccc(O)cc1")
mcs    = smsd.mcs("c1ccccc1", "c1ccc2ccccc2c1")

# Tautomer-aware MCS
mcs    = smsd.mcs("CC(=O)C", "CC(O)=C", tautomer_aware=True)

# Similarity upper bound (fast pre-filter)
sim    = smsd.similarity("c1ccccc1", "c1ccc(O)cc1")

fp     = smsd.fingerprint("c1ccccc1", kind="mcs")

# Circular fingerprint (ECFP4 equivalent, tautomer-aware)
ecfp4 = smsd.circular_fingerprint("c1ccccc1", radius=2, fp_size=2048)

Using with RDKit

SMSD works standalone or alongside RDKit. Use RDKit for parsing, descriptors, and drawing; use SMSD for fast MCS and substructure matching.

RDKit molecules + SMSD matching (recommended for existing RDKit workflows):

from rdkit import Chem
import smsd

mol1 = Chem.MolFromSmiles("c1ccccc1")
mol2 = Chem.MolFromSmiles("c1ccc(O)cc1")

# MCS via SMSD -- pass RDKit Mol objects directly
result = smsd.mcs_rdkit(mol1, mol2)

# Substructure search
mapping = smsd.substructure_rdkit(mol1, mol2)

# Convert once, reuse with any SMSD function
g = smsd.from_rdkit(mol1)
sim = smsd.similarity(g, smsd.from_rdkit(mol2))

SMSD standalone (no RDKit needed):

import smsd

result = smsd.mcs("c1ccccc1", "c1ccc(O)cc1")
mapping = smsd.substructure_search("c1ccccc1", "c1ccc(O)cc1")
sim = smsd.similarity("c1ccccc1", "c1ccc(O)cc1")

CDK (Java) for parsing + SMSD for matching:

import com.bioinception.smsd.core.*;
import org.openscience.cdk.silent.SilentChemObjectBuilder;
import org.openscience.cdk.smiles.SmilesParser;

SmilesParser sp = new SmilesParser(SilentChemObjectBuilder.getInstance());
var mol1 = sp.parseSmiles("c1ccccc1");
var mol2 = sp.parseSmiles("c1ccc(O)cc1");

SMSD smsd = new SMSD(mol1, mol2, new ChemOptions());
boolean isSub = smsd.isSubstructure();
var mcs = smsd.findMCS();

Performance note: RDKit is an optional dependency -- SMSD does not require it. The helpers convert via a SMILES round-trip (sub-millisecond overhead). For batch workloads, convert once with from_rdkit() and reuse the MolGraph objects.

Export & Depiction (via RDKit)

Use SMSD for matching, RDKit for visualization and export:

import smsd

# Depict MCS with highlighted atoms (works in Jupyter)
img = smsd.depict_mcs("c1ccccc1", "c1ccc(O)cc1")
img.save("mcs.png")

# Depict substructure match
img = smsd.depict_substructure("c1ccccc1", "c1ccc(O)cc1")

# Generate SVG
svg = smsd.to_svg("c1ccccc1")

# Export to SDF file
mols = [smsd.parse_smiles(s) for s in ["CCO", "c1ccccc1", "CC(=O)O"]]
smsd.export_sdf(mols, "output.sdf")

# Convert to RDKit Mol for any RDKit function
rdmol = smsd.to_rdkit(smsd.parse_smiles("c1ccccc1"))

C++ (Header-Only)

git clone https://github.com/asad/SMSD.git
# Add SMSD/cpp/include to your include path — no other dependencies needed
#include "smsd/smsd.hpp"

auto mol1 = smsd::parseSMILES("c1ccccc1");
auto mol2 = smsd::parseSMILES("c1ccc(O)cc1");

bool isSub = smsd::isSubstructure(mol1, mol2, smsd::ChemOptions{});
auto mcs   = smsd::findMCS(mol1, mol2, smsd::ChemOptions{}, smsd::McsOptions{});

Build from Source

git clone https://github.com/asad/SMSD.git
cd SMSD

# Java
mvn -U clean package

# C++
mkdir cpp/build && cd cpp/build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

# Python
cd python && pip install -e .

Docker

docker build -t smsd .
docker run --rm smsd --Q SMI --q "c1ccccc1" --T SMI --t "c1ccc(O)cc1" --json -

Benchmarks

MCS: SMSD (pip) vs RDKit 2025.09.2

Same machine, same Python process, best of 5 runs. Full data: benchmarks/results_python.tsv

Pair Category SMSD (ms) RDKit (ms) SMSD MCS RDKit MCS
Cubane (self) Cage 0.003 0.241 8 8
Coronene (self) PAH 0.006 0.727 24 24
NAD / NADH Cofactor 0.012 timeout 44 33
Caffeine / Theophylline Drug pair 0.016 0.354 13 13
Morphine / Codeine Alkaloid 0.049 550.5 20 20
Ibuprofen / Naproxen NSAID 0.069 3.5 15 15
ATP / ADP Nucleotide 0.085 0.897 27 27
PEG-12 / PEG-16 Polymer 1.6 2.2 40 40
RDKit #1585 Edge case 25.0 timeout 29 24
Paclitaxel / Docetaxel Taxane 2,405 timeout 56 53

SMSD faster on 17/19 pairs. Speedups range from 1.5x to 11,200x. Bold = SMSD found a larger MCS. timeout = 10 s limit.

Substructure: SMSD Java (cached) vs CDK DfPattern 2.11

28/28 pairs correct — all match CDK. Cached speedup: 2x-16x faster across all pairs.

Run python benchmarks/benchmark_python_vs_rdkit.py to reproduce.


Algorithms

MCS Pipeline (11-level funnel)

Level Algorithm Based on
L0 Label-frequency upper bound Degree-aware coverage-driven termination
L0.25 Chain fast-path O(n*m) DP for linear polymers (PEG, lipids)
L0.5 Tree fast-path Kilpelainen-Mannila DP for branched polymers (dendrimers, glycogen)
L0.75 Greedy probe O(N) fast path for near-identical molecules
L1 Substructure containment VF2++ check if smaller molecule is subgraph
L1.25 Augmenting path extension Forced-extension bond growth from substructure seed
L1.5 Seed-and-extend Bond-growth from rare-label seeds
L2 McSplit + RRSplit Partition refinement (McCreesh 2017) with maximality pruning
L3 Bron-Kerbosch Product-graph clique with Tomita pivoting + k-core + orbit pruning
L4 McGregor extension Forced-assignment bond-grow frontier (McGregor 1982)
L5 Extra seeds Ring skeleton, heavy-atom core, label-degree anchor seeds

MCS Variants

Variant Flag
MCIS (induced) induced=true
MCCS (connected) default
MCES (edge subgraph) maximizeBonds=true
dMCS (disconnected) disconnectedMCS=true
N-MCS (multi-molecule) findNMCS()
Weighted MCS atomWeights
Scaffold MCS findScaffoldMCS()
Tautomer-aware MCS ChemOptions.tautomerProfile()

Substructure Search (VF2++)

VF2++ (Juttner & Madarasi 2018) with FASTiso/VF3-Light matching order, 3-level NLF pruning, bit-parallel candidate domains, and GPU-accelerated domain initialization (CUDA + Metal).

Ring Perception

Horton's candidate generation + 2-phase GF(2) elimination (Vismara 1997) for relevant cycles, orbit-based grouping for Unique Ring Families (URFs).

Output Description
SSSR / MCB Smallest Set of Smallest Rings
RCB Relevant Cycle Basis
URF Unique Ring Families (automorphism orbit grouping)

Chemistry Options

Option Values
Chirality R/S tetrahedral, E/Z double bond
Isotope matchIsotope=true
Tautomers 15 transforms with pKa-informed weights (Sitzmann 2010)
Solvent AQUEOUS, DMSO, METHANOL, CHLOROFORM, ACETONITRILE, DIETHYL_ETHER
Ring fusion IGNORE / PERMISSIVE / STRICT
Bond order STRICT / LOOSE / ANY
Aromaticity STRICT / FLEXIBLE
Lenient SMILES ParseOptions{.lenient=true} (C++) / ChemOptions.lenientSmiles (Java)

Preset profiles: ChemOptions() (default), .tautomerProfile(), .fmcsProfile() (RDKit-compatible)

Solvent-aware tautomers (Tier 2 pKa): opts.withSolvent(Solvent.DMSO) adjusts tautomer equilibrium weights for non-aqueous environments.


Platform & GPU Support

Platform CPU GPU
macOS (Apple Silicon) OpenMP Metal (zero-copy unified memory)
Linux OpenMP CUDA
Windows OpenMP CUDA
Any (no GPU) OpenMP Automatic CPU fallback

GPU acceleration covers RASCAL batch screening and domain initialization. Recursive backtracking (VF2++, BK, McSplit) runs on CPU. Dispatch: CUDA -> Metal -> OpenMP -> sequential.


Additional Tools

Tool Description
CIP R/S/E/Z assignment Full digraph-based stereo descriptors (IUPAC 2013 Rules 1-2)
Circular fingerprint (ECFP/FCFP) Tautomer-aware Morgan/ECFP with configurable radius (-1 = whole molecule)
Count-based ECFP/FCFP ecfpCounts() / fcfpCounts() — superior to binary for ML
Topological Torsion fingerprint 4-atom path with atom typing (SOTA on peptide benchmarks)
Path fingerprint Graph-aware, tautomer-invariant path enumeration
MCS fingerprint MCS-aware, auto-sized
Similarity metrics Tanimoto, Dice, Cosine, Soergel (binary + count-vector)
Fingerprint formats toBitSet(), toHex(), toBinaryString(), fromBitSet(), fromHex()
MCS SMILES extraction findMcsSmiles() — extract MCS as canonical SMILES
findAllMCS Top-N MCS enumeration with canonical SMILES dedup
SMARTS-based MCS findMcsSmarts() — largest substructure matching a SMARTS pattern
R-group decomposition decomposeRGroups()
MatchResult Structured result: size, mapping, tanimoto, query/target atom counts
RASCAL screening O(V+E) similarity upper bound
Canonical SMILES / SMARTS deterministic, toolkit-independent (including X total connectivity)
Reaction atom mapping mapReaction()
2D depiction SVG rendering with atom highlighting
Lenient SMILES parser Best-effort recovery from malformed SMILES
N-MCS Multi-molecule MCS with provenance tracking
Tautomer validation validateTautomerConsistency() — proton conservation check
30 tautomer transforms pKa-informed weights, 6 solvents, pH-sensitive, ring-chain tautomerism

File Formats

Format Read Write
SMILES Java, C++ Java, C++
SMARTS Java, C++ C++
MOL V2000 Java, C++ C++
SDF Java, C++
Mol2, PDB, CML Java

Release Downloads

Every release includes all platforms:

Download Description
SMSD.Pro-6.2.1.dmg macOS installer (Apple Silicon) — drag to Applications
SMSD.Pro-6.2.1.msi Windows installer — next, next, finish
smsd-pro_6.2.1_amd64.deb Linux installer — sudo dpkg -i
smsd-6.2.1.jar Pure library JAR (Maven/Gradle dependency)
smsd-6.2.1-jar-with-dependencies.jar Standalone CLI (just java -jar)
smsd-cpp-6.2.1-headers.tar.gz C++ header-only library (unpack, #include "smsd/smsd.hpp")
pip install smsd Python package (PyPI)
# Native installer — download .dmg / .msi / .deb, double-click, done

# CLI
java -jar smsd-6.2.1-jar-with-dependencies.jar --Q SMI --q "c1ccccc1" --T SMI --t "c1ccc(O)cc1" --json -

# Docker CLI
docker build -t smsd .
docker run --rm smsd --Q SMI --q "c1ccccc1" --T SMI --t "c1ccc(O)cc1" --json -

# Python
pip install smsd

Tests

  • 1,082 Java tests (7 consolidated suites) — heterocycles, reactions, drug pairs, tautomers, stereochemistry, ring perception, URF families, hydrogen handling, adversarial edge cases, fast-path validation, solvent corrections
  • 170 C++ tests (3 suites) — 63 core + 91 parser (including SMARTS X primitive) + 16 batch/GPU
  • 1,003 diverse molecules — all parse correctly in C++ SMILES parser
  • AddressSanitizer — zero memory errors
  • Python tests — full API coverage including hydrogen handling and charged species

Documentation

Document Description
WHITEPAPER Algorithms & design (11-level MCS, VF2++, ring perception)
HOWTO-INSTALL Build from source guide
NOTICE Attribution, trademark, and novel algorithm terms

Citation

If you use SMSD Pro in your research, please cite:

Rahman SA, Bashton M, Holliday GL, Schrader R, Thornton JM. Small Molecule Subgraph Detector (SMSD) toolkit. Journal of Cheminformatics, 1:12, 2009. DOI: 10.1186/1758-2946-1-12

GitHub renders a "Cite this repository" button from CITATION.cff.


Author

Syed Asad RahmanBioInception PVT LTD

Copyright (c) 2018-2026 BioInception PVT LTD. Algorithm Copyright (c) 2009-2026 Syed Asad Rahman.

License

Apache License 2.0 — see LICENSE and NOTICE