SMSD Pro

Substructure & MCS Search for Chemical Graphs

SMSD Pro is an open-source toolkit for exact substructure search and maximum common substructure (MCS) finding in chemical graphs. It runs on Java, C++ (header-only), and Python, with GPU acceleration (CUDA + Apple Metal). Built on established algorithms from the graph-isomorphism literature (VF2++, McSplit, McGregor, Horton, Vismara).

Install

Java (Maven)

<dependency>
  <groupId>com.bioinceptionlabs</groupId>
  <artifactId>smsd</artifactId>
  <version>6.2.1</version>
</dependency>

Java (Download JAR)

curl -LO https://github.com/asad/SMSD/releases/download/v6.2.1/smsd-6.2.1-jar-with-dependencies.jar

java -jar smsd-6.2.1-jar-with-dependencies.jar \
  --Q SMI --q "c1ccccc1" --T SMI --t "c1ccc(O)cc1" --json -

Python (pip)

pip install smsd

import smsd

result = smsd.substructure_search("c1ccccc1", "c1ccc(O)cc1")
mcs    = smsd.mcs("c1ccccc1", "c1ccc2ccccc2c1")

# Tautomer-aware MCS
mcs    = smsd.mcs("CC(=O)C", "CC(O)=C", tautomer_aware=True)

# Similarity upper bound (fast pre-filter)
sim    = smsd.similarity("c1ccccc1", "c1ccc(O)cc1")

fp     = smsd.fingerprint("c1ccccc1", kind="mcs")

# Circular fingerprint (ECFP4 equivalent, tautomer-aware)
ecfp4 = smsd.circular_fingerprint("c1ccccc1", radius=2, fp_size=2048)

Using with RDKit

SMSD works standalone or alongside RDKit. Use RDKit for parsing, descriptors, and drawing; use SMSD for fast MCS and substructure matching.

RDKit molecules + SMSD matching (recommended for existing RDKit workflows):

from rdkit import Chem
import smsd

mol1 = Chem.MolFromSmiles("c1ccccc1")
mol2 = Chem.MolFromSmiles("c1ccc(O)cc1")

# MCS via SMSD -- pass RDKit Mol objects directly
result = smsd.mcs_rdkit(mol1, mol2)

# Substructure search
mapping = smsd.substructure_rdkit(mol1, mol2)

# Convert once, reuse with any SMSD function
g = smsd.from_rdkit(mol1)
sim = smsd.similarity(g, smsd.from_rdkit(mol2))

SMSD standalone (no RDKit needed):

import smsd

result = smsd.mcs("c1ccccc1", "c1ccc(O)cc1")
mapping = smsd.substructure_search("c1ccccc1", "c1ccc(O)cc1")
sim = smsd.similarity("c1ccccc1", "c1ccc(O)cc1")

CDK (Java) for parsing + SMSD for matching:

import com.bioinception.smsd.core.*;
import org.openscience.cdk.silent.SilentChemObjectBuilder;
import org.openscience.cdk.smiles.SmilesParser;

SmilesParser sp = new SmilesParser(SilentChemObjectBuilder.getInstance());
var mol1 = sp.parseSmiles("c1ccccc1");
var mol2 = sp.parseSmiles("c1ccc(O)cc1");

SMSD smsd = new SMSD(mol1, mol2, new ChemOptions());
boolean isSub = smsd.isSubstructure();
var mcs = smsd.findMCS();

Performance note: RDKit is an optional dependency -- SMSD does not require it. The helpers convert via a SMILES round-trip (sub-millisecond overhead). For batch workloads, convert once with from_rdkit() and reuse the MolGraph objects.

Export & Depiction (via RDKit)

Use SMSD for matching, RDKit for visualization and export:

import smsd

# Depict MCS with highlighted atoms (works in Jupyter)
img = smsd.depict_mcs("c1ccccc1", "c1ccc(O)cc1")
img.save("mcs.png")

# Depict substructure match
img = smsd.depict_substructure("c1ccccc1", "c1ccc(O)cc1")

# Generate SVG
svg = smsd.to_svg("c1ccccc1")

# Export to SDF file
mols = [smsd.parse_smiles(s) for s in ["CCO", "c1ccccc1", "CC(=O)O"]]
smsd.export_sdf(mols, "output.sdf")

# Convert to RDKit Mol for any RDKit function
rdmol = smsd.to_rdkit(smsd.parse_smiles("c1ccccc1"))

C++ (Header-Only)

git clone https://github.com/asad/SMSD.git
# Add SMSD/cpp/include to your include path — no other dependencies needed

#include "smsd/smsd.hpp"

auto mol1 = smsd::parseSMILES("c1ccccc1");
auto mol2 = smsd::parseSMILES("c1ccc(O)cc1");

bool isSub = smsd::isSubstructure(mol1, mol2, smsd::ChemOptions{});
auto mcs   = smsd::findMCS(mol1, mol2, smsd::ChemOptions{}, smsd::McsOptions{});

Build from Source

git clone https://github.com/asad/SMSD.git
cd SMSD

# Java
mvn -U clean package

# C++
mkdir cpp/build && cd cpp/build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

# Python
cd python && pip install -e .

Docker

docker build -t smsd .
docker run --rm smsd --Q SMI --q "c1ccccc1" --T SMI --t "c1ccc(O)cc1" --json -

Benchmarks

MCS: SMSD (pip) vs RDKit 2025.09.2

Same machine, same Python process, best of 5 runs. Full data: benchmarks/results_python.tsv

Pair	Category	SMSD (ms)	RDKit (ms)	SMSD MCS	RDKit MCS
Cubane (self)	Cage	0.003	0.241	8	8
Coronene (self)	PAH	0.006	0.727	24	24
NAD / NADH	Cofactor	0.012	timeout	44	33
Caffeine / Theophylline	Drug pair	0.016	0.354	13	13
Morphine / Codeine	Alkaloid	0.049	550.5	20	20
Ibuprofen / Naproxen	NSAID	0.069	3.5	15	15
ATP / ADP	Nucleotide	0.085	0.897	27	27
PEG-12 / PEG-16	Polymer	1.6	2.2	40	40
RDKit #1585	Edge case	25.0	timeout	29	24
Paclitaxel / Docetaxel	Taxane	2,405	timeout	56	53

SMSD faster on 17/19 pairs. Speedups range from 1.5x to 11,200x. Bold = SMSD found a larger MCS. timeout = 10 s limit.

Substructure: SMSD Java (cached) vs CDK DfPattern 2.11

28/28 pairs correct — all match CDK. Cached speedup: 2x-16x faster across all pairs.

Run python benchmarks/benchmark_python_vs_rdkit.py to reproduce.

Algorithms

MCS Pipeline (11-level funnel)

Level	Algorithm	Based on
L0	Label-frequency upper bound	Degree-aware coverage-driven termination
L0.25	Chain fast-path	O(n*m) DP for linear polymers (PEG, lipids)
L0.5	Tree fast-path	Kilpelainen-Mannila DP for branched polymers (dendrimers, glycogen)
L0.75	Greedy probe	O(N) fast path for near-identical molecules
L1	Substructure containment	VF2++ check if smaller molecule is subgraph
L1.25	Augmenting path extension	Forced-extension bond growth from substructure seed
L1.5	Seed-and-extend	Bond-growth from rare-label seeds
L2	McSplit + RRSplit	Partition refinement (McCreesh 2017) with maximality pruning
L3	Bron-Kerbosch	Product-graph clique with Tomita pivoting + k-core + orbit pruning
L4	McGregor extension	Forced-assignment bond-grow frontier (McGregor 1982)
L5	Extra seeds	Ring skeleton, heavy-atom core, label-degree anchor seeds

MCS Variants

Variant	Flag
MCIS (induced)	`induced=true`
MCCS (connected)	default
MCES (edge subgraph)	`maximizeBonds=true`
dMCS (disconnected)	`disconnectedMCS=true`
N-MCS (multi-molecule)	`findNMCS()`
Weighted MCS	`atomWeights`
Scaffold MCS	`findScaffoldMCS()`
Tautomer-aware MCS	`ChemOptions.tautomerProfile()`

Substructure Search (VF2++)

VF2++ (Juttner & Madarasi 2018) with FASTiso/VF3-Light matching order, 3-level NLF pruning, bit-parallel candidate domains, and GPU-accelerated domain initialization (CUDA + Metal).

Ring Perception

Horton's candidate generation + 2-phase GF(2) elimination (Vismara 1997) for relevant cycles, orbit-based grouping for Unique Ring Families (URFs).

Output	Description
SSSR / MCB	Smallest Set of Smallest Rings
RCB	Relevant Cycle Basis
URF	Unique Ring Families (automorphism orbit grouping)

Chemistry Options

Option	Values
Chirality	R/S tetrahedral, E/Z double bond
Isotope	`matchIsotope=true`
Tautomers	15 transforms with pKa-informed weights (Sitzmann 2010)
Solvent	AQUEOUS, DMSO, METHANOL, CHLOROFORM, ACETONITRILE, DIETHYL_ETHER
Ring fusion	IGNORE / PERMISSIVE / STRICT
Bond order	STRICT / LOOSE / ANY
Aromaticity	STRICT / FLEXIBLE
Lenient SMILES	`ParseOptions{.lenient=true}` (C++) / `ChemOptions.lenientSmiles` (Java)

Preset profiles: ChemOptions() (default), .tautomerProfile(), .fmcsProfile() (RDKit-compatible)

Solvent-aware tautomers (Tier 2 pKa): opts.withSolvent(Solvent.DMSO) adjusts tautomer equilibrium weights for non-aqueous environments.

Platform & GPU Support

Platform	CPU	GPU
macOS (Apple Silicon)	OpenMP	Metal (zero-copy unified memory)
Linux	OpenMP	CUDA
Windows	OpenMP	CUDA
Any (no GPU)	OpenMP	Automatic CPU fallback

GPU acceleration covers RASCAL batch screening and domain initialization. Recursive backtracking (VF2++, BK, McSplit) runs on CPU. Dispatch: CUDA -> Metal -> OpenMP -> sequential.

Additional Tools

Tool	Description
CIP R/S/E/Z assignment	Full digraph-based stereo descriptors (IUPAC 2013 Rules 1-2)
Circular fingerprint (ECFP/FCFP)	Tautomer-aware Morgan/ECFP with configurable radius (-1 = whole molecule)
Count-based ECFP/FCFP	`ecfpCounts()` / `fcfpCounts()` — superior to binary for ML
Topological Torsion fingerprint	4-atom path with atom typing (SOTA on peptide benchmarks)
Path fingerprint	Graph-aware, tautomer-invariant path enumeration
MCS fingerprint	MCS-aware, auto-sized
Similarity metrics	Tanimoto, Dice, Cosine, Soergel (binary + count-vector)
Fingerprint formats	`toBitSet()`, `toHex()`, `toBinaryString()`, `fromBitSet()`, `fromHex()`
MCS SMILES extraction	`findMcsSmiles()` — extract MCS as canonical SMILES
findAllMCS	Top-N MCS enumeration with canonical SMILES dedup
SMARTS-based MCS	`findMcsSmarts()` — largest substructure matching a SMARTS pattern
R-group decomposition	`decomposeRGroups()`
MatchResult	Structured result: size, mapping, tanimoto, query/target atom counts
RASCAL screening	O(V+E) similarity upper bound
Canonical SMILES / SMARTS	deterministic, toolkit-independent (including `X` total connectivity)
Reaction atom mapping	`mapReaction()`
2D depiction	SVG rendering with atom highlighting
Lenient SMILES parser	Best-effort recovery from malformed SMILES
N-MCS	Multi-molecule MCS with provenance tracking
Tautomer validation	`validateTautomerConsistency()` — proton conservation check
30 tautomer transforms	pKa-informed weights, 6 solvents, pH-sensitive, ring-chain tautomerism

File Formats

Format	Read	Write
SMILES	Java, C++	Java, C++
SMARTS	Java, C++	C++
MOL V2000	Java, C++	C++
SDF	Java, C++	—
Mol2, PDB, CML	Java	—

Release Downloads

Every release includes all platforms:

Download	Description
`SMSD.Pro-6.2.1.dmg`	macOS installer (Apple Silicon) — drag to Applications
`SMSD.Pro-6.2.1.msi`	Windows installer — next, next, finish
`smsd-pro_6.2.1_amd64.deb`	Linux installer — `sudo dpkg -i`
`smsd-6.2.1.jar`	Pure library JAR (Maven/Gradle dependency)
`smsd-6.2.1-jar-with-dependencies.jar`	Standalone CLI (just `java -jar`)
`smsd-cpp-6.2.1-headers.tar.gz`	C++ header-only library (unpack, `#include "smsd/smsd.hpp"`)
`pip install smsd`	Python package (PyPI)

# Native installer — download .dmg / .msi / .deb, double-click, done

# CLI
java -jar smsd-6.2.1-jar-with-dependencies.jar --Q SMI --q "c1ccccc1" --T SMI --t "c1ccc(O)cc1" --json -

# Docker CLI
docker build -t smsd .
docker run --rm smsd --Q SMI --q "c1ccccc1" --T SMI --t "c1ccc(O)cc1" --json -

# Python
pip install smsd

Tests

1,082 Java tests (7 consolidated suites) — heterocycles, reactions, drug pairs, tautomers, stereochemistry, ring perception, URF families, hydrogen handling, adversarial edge cases, fast-path validation, solvent corrections
170 C++ tests (3 suites) — 63 core + 91 parser (including SMARTS X primitive) + 16 batch/GPU
1,003 diverse molecules — all parse correctly in C++ SMILES parser
AddressSanitizer — zero memory errors
Python tests — full API coverage including hydrogen handling and charged species

Documentation

Document	Description
WHITEPAPER	Algorithms & design (11-level MCS, VF2++, ring perception)
HOWTO-INSTALL	Build from source guide
NOTICE	Attribution, trademark, and novel algorithm terms

Citation

If you use SMSD Pro in your research, please cite:

Rahman SA, Bashton M, Holliday GL, Schrader R, Thornton JM. Small Molecule Subgraph Detector (SMSD) toolkit. Journal of Cheminformatics, 1:12, 2009. DOI: 10.1186/1758-2946-1-12

GitHub renders a "Cite this repository" button from CITATION.cff.

Author

Syed Asad Rahman — BioInception PVT LTD

License

Apache License 2.0 — see LICENSE and NOTICE

Name		Name	Last commit message	Last commit date
Latest commit History 1,048 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
cpp		cpp
docs		docs
icons		icons
python		python
src		src
.gitignore		.gitignore
CITATION.cff		CITATION.cff
Dockerfile		Dockerfile
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
pom.xml		pom.xml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

SMSD Pro

Install

Java (Maven)

Java (Download JAR)

Python (pip)

Using with RDKit

Export & Depiction (via RDKit)

C++ (Header-Only)

Build from Source

Docker

Benchmarks

MCS: SMSD (pip) vs RDKit 2025.09.2

Substructure: SMSD Java (cached) vs CDK DfPattern 2.11

Algorithms

MCS Pipeline (11-level funnel)

MCS Variants

Substructure Search (VF2++)

Ring Perception

Chemistry Options

Platform & GPU Support

Additional Tools

File Formats

Release Downloads

Tests

Documentation

Citation

Author

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 39

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages