Skip to main content

Clustering based on density with variable density clusters

Project description

PyPI Version Conda-forge Version Conda-forge downloads License Travis Build Status https://codecov.io/gh/scikit-learn-contrib/hdbscan/branch/master/graph/badge.svg Docs JOSS article Launch example notebooks in Binder

HDBSCAN

HDBSCAN - Hierarchical Density-Based Spatial Clustering of Applications with Noise. Performs DBSCAN over varying epsilon values and integrates the result to find a clustering that gives the best stability over epsilon. This allows HDBSCAN to find clusters of varying densities (unlike DBSCAN), and be more robust to parameter selection.

In practice this means that HDBSCAN returns a good clustering straight away with little or no parameter tuning – and the primary parameter, minimum cluster size, is intuitive and easy to select.

HDBSCAN is ideal for exploratory data analysis; it’s a fast and robust algorithm that you can trust to return meaningful clusters (if there are any).

Based on the papers:

McInnes L, Healy J. Accelerated Hierarchical Density Based Clustering In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE, pp 33-42. 2017 [pdf]

R. Campello, D. Moulavi, and J. Sander, Density-Based Clustering Based on Hierarchical Density Estimates In: Advances in Knowledge Discovery and Data Mining, Springer, pp 160-172. 2013

Documentation, including tutorials, are available on ReadTheDocs at http://hdbscan.readthedocs.io/en/latest/ .

Notebooks comparing HDBSCAN to other clustering algorithms, explaining how HDBSCAN works and comparing performance with other python clustering implementations are available.

How to use HDBSCAN

The hdbscan package inherits from sklearn classes, and thus drops in neatly next to other sklearn clusterers with an identical calling API. Similarly it supports input in a variety of formats: an array (or pandas dataframe, or sparse matrix) of shape (num_samples x num_features); an array (or sparse matrix) giving a distance matrix between samples.

import hdbscan
from sklearn.datasets import make_blobs

data, _ = make_blobs(1000)

clusterer = hdbscan.HDBSCAN(min_cluster_size=10)
cluster_labels = clusterer.fit_predict(data)

Performance

Significant effort has been put into making the hdbscan implementation as fast as possible. It is orders of magnitude faster than the reference implementation in Java, and is currently faster than highly optimized single linkage implementations in C and C++. version 0.7 performance can be seen in this notebook . In particular performance on low dimensional data is better than sklearn’s DBSCAN , and via support for caching with joblib, re-clustering with different parameters can be almost free.

Additional functionality

The hdbscan package comes equipped with visualization tools to help you understand your clustering results. After fitting data the clusterer object has attributes for:

  • The condensed cluster hierarchy

  • The robust single linkage cluster hierarchy

  • The reachability distance minimal spanning tree

All of which come equipped with methods for plotting and converting to Pandas or NetworkX for further analysis. See the notebook on how HDBSCAN works for examples and further details.

The clusterer objects also have an attribute providing cluster membership strengths, resulting in optional soft clustering (and no further compute expense). Finally each cluster also receives a persistence score giving the stability of the cluster over the range of distance scales present in the data. This provides a measure of the relative strength of clusters.

Outlier Detection

The HDBSCAN clusterer objects also support the GLOSH outlier detection algorithm. After fitting the clusterer to data the outlier scores can be accessed via the outlier_scores_ attribute. The result is a vector of score values, one for each data point that was fit. Higher scores represent more outlier like objects. Selecting outliers via upper quantiles is often a good approach.

Based on the paper:

R.J.G.B. Campello, D. Moulavi, A. Zimek and J. Sander Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection, ACM Trans. on Knowledge Discovery from Data, Vol 10, 1 (July 2015), 1-51.

Robust single linkage

The hdbscan package also provides support for the robust single linkage clustering algorithm of Chaudhuri and Dasgupta. As with the HDBSCAN implementation this is a high performance version of the algorithm outperforming scipy’s standard single linkage implementation. The robust single linkage hierarchy is available as an attribute of the robust single linkage clusterer, again with the ability to plot or export the hierarchy, and to extract flat clusterings at a given cut level and gamma value.

Example usage:

import hdbscan
from sklearn.datasets import make_blobs

data, _ = make_blobs(1000)

clusterer = hdbscan.RobustSingleLinkage(cut=0.125, k=7)
cluster_labels = clusterer.fit_predict(data)
hierarchy = clusterer.cluster_hierarchy_
alt_labels = hierarchy.get_clusters(0.100, 5)
hierarchy.plot()
Based on the paper:

K. Chaudhuri and S. Dasgupta. “Rates of convergence for the cluster tree.” In Advances in Neural Information Processing Systems, 2010.

Branch detection

The hdbscan package supports a branch-detection post-processing step by Bot et al.. Cluster shapes, such as branching structures, can reveal interesting patterns that are not expressed in density-based cluster hierarchies. The BranchDetector class mimics the HDBSCAN API and can be used to detect branching hierarchies in clusters. It provides condensed branch hierarchies, branch persistences, and branch memberships and supports joblib’s caching functionality. A notebook demonstrating the BranchDetector is available.

Example usage:

import hdbscan
from sklearn.datasets import make_blobs

data, _ = make_blobs(1000)

clusterer = hdbscan.HDBSCAN(branch_detection_data=True).fit(data)
branch_detector = hdbscan.BranchDetector().fit(clusterer)
branch_detector.cluster_approximation_graph_.plot(edge_width=0.1)
Based on the paper:

D.M. Bot, J. Peeters, J. Liesenborgs and J. Aerts FLASC: a flare-sensitive clustering algorithm. PeerJ Computer Science, Vol 11, April 2025, e2792. https://doi.org/10.7717/peerj-cs.2792.

Installing

Easiest install, if you have Anaconda (thanks to conda-forge which is awesome!):

conda install -c conda-forge hdbscan

PyPI install, presuming you have an up to date pip:

pip install hdbscan

Binary wheels for a number of platforms are available thanks to the work of Ryan Helinski <rlhelinski@gmail.com>.

If pip is having difficulties pulling the dependencies then we’d suggest to first upgrade pip to at least version 10 and try again:

pip install --upgrade pip
pip install hdbscan

Otherwise install the dependencies manually using anaconda followed by pulling hdbscan from pip:

conda install cython
conda install numpy scipy
conda install scikit-learn
pip install hdbscan

For a manual install of the latest code directly from GitHub:

pip install --upgrade git+https://github.com/scikit-learn-contrib/hdbscan.git#egg=hdbscan

Alternatively download the package, install requirements, and manually run the installer:

wget https://github.com/scikit-learn-contrib/hdbscan/archive/master.zip
unzip master.zip
rm master.zip
cd hdbscan-master

pip install -r requirements.txt

python setup.py install

Running the Tests

The package tests can be run after installation using the command:

nosetests -s hdbscan

or, if nose is installed but nosetests is not in your PATH variable:

python -m nose -s hdbscan

If one or more of the tests fail, please report a bug at https://github.com/scikit-learn-contrib/hdbscan/issues/new

Python Version

The hdbscan library supports both Python 2 and Python 3. However we recommend Python 3 as the better option if it is available to you.

Help and Support

For simple issues you can consult the FAQ in the documentation. If your issue is not suitably resolved there, please check the issues on github. Finally, if no solution is available there feel free to open an issue ; the authors will attempt to respond in a reasonably timely fashion.

Contributing

We welcome contributions in any form! Assistance with documentation, particularly expanding tutorials, is always welcome. To contribute please fork the project make your changes and submit a pull request. We will do our best to work through any issues with you and get your code merged into the main branch.

Citing

If you have used this codebase in a scientific publication and wish to cite it, please use the Journal of Open Source Software article.

L. McInnes, J. Healy, S. Astels, hdbscan: Hierarchical density based clustering In: Journal of Open Source Software, The Open Journal, volume 2, number 11. 2017

@article{mcinnes2017hdbscan,
  title={hdbscan: Hierarchical density based clustering},
  author={McInnes, Leland and Healy, John and Astels, Steve},
  journal={The Journal of Open Source Software},
  volume={2},
  number={11},
  pages={205},
  year={2017}
}

To reference the high performance algorithm developed in this library please cite our paper in ICDMW 2017 proceedings.

McInnes L, Healy J. Accelerated Hierarchical Density Based Clustering In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE, pp 33-42. 2017

@inproceedings{mcinnes2017accelerated,
  title={Accelerated Hierarchical Density Based Clustering},
  author={McInnes, Leland and Healy, John},
  booktitle={Data Mining Workshops (ICDMW), 2017 IEEE International Conference on},
  pages={33--42},
  year={2017},
  organization={IEEE}
}

If you used the branch-detection functionality in this library please cite our PeerJ paper:

Bot DM, Peeters J, Liesenborgs J, Aerts J. FLASC: a flare-sensitive clustering algorithm. In: PeerJ Computer Science, Volume 11, e2792, 2025. https://doi.org/10.7717/peerj-cs.2792

@article{bot2025flasc,
    title   = {{FLASC: a flare-sensitive clustering algorithm}},
    author  = {Bot, Dani{\"{e}}l M. and Peeters, Jannes and Liesenborgs, Jori and Aerts, Jan},
    year    = {2025},
    month   = {apr},
    journal = {PeerJ Comput. Sci.},
    volume  = {11},
    pages   = {e2792},
    issn    = {2376-5992},
    doi     = {10.7717/peerj-cs.2792},
    url     = {https://peerj.com/articles/cs-2792},
}

Licensing

The hdbscan package is 3-clause BSD licensed. Enjoy.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hdbscan-0.8.41.tar.gz (7.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

hdbscan-0.8.41-cp313-cp313-win_amd64.whl (671.7 kB view details)

Uploaded CPython 3.13Windows x86-64

hdbscan-0.8.41-cp313-cp313-macosx_10_13_universal2.whl (1.4 MB view details)

Uploaded CPython 3.13macOS 10.13+ universal2 (ARM64, x86-64)

hdbscan-0.8.41-cp312-cp312-win_amd64.whl (671.8 kB view details)

Uploaded CPython 3.12Windows x86-64

hdbscan-0.8.41-cp312-cp312-macosx_10_13_universal2.whl (1.4 MB view details)

Uploaded CPython 3.12macOS 10.13+ universal2 (ARM64, x86-64)

hdbscan-0.8.41-cp311-cp311-win_amd64.whl (687.1 kB view details)

Uploaded CPython 3.11Windows x86-64

hdbscan-0.8.41-cp311-cp311-macosx_10_9_universal2.whl (1.4 MB view details)

Uploaded CPython 3.11macOS 10.9+ universal2 (ARM64, x86-64)

hdbscan-0.8.41-cp310-cp310-win_amd64.whl (687.0 kB view details)

Uploaded CPython 3.10Windows x86-64

hdbscan-0.8.41-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (4.2 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

hdbscan-0.8.41-cp310-cp310-macosx_13_0_x86_64.whl (755.3 kB view details)

Uploaded CPython 3.10macOS 13.0+ x86-64

File details

Details for the file hdbscan-0.8.41.tar.gz.

File metadata

  • Download URL: hdbscan-0.8.41.tar.gz
  • Upload date:
  • Size: 7.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for hdbscan-0.8.41.tar.gz
Algorithm Hash digest
SHA256 e41e823e5bb21ff2173f252d226266b1dda82bdbba5d89106eafb251429dff3d
MD5 d43e2a04b992a196f84fa26a38dd9d9d
BLAKE2b-256 0c2232a66dd4ce72145ec1b792c794b98897be467bdf18c10aa8b48275530b11

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.41-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: hdbscan-0.8.41-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 671.7 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for hdbscan-0.8.41-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 e90f6b9e2fcc94f9ac09d537f8b414191d1a837d62a355edd78e12820b63f0e2
MD5 4461ccd69e0cd54d6e912e2a75974080
BLAKE2b-256 37a6e208ef8bb6e9e97e4b274951160a5bb754f20a79d5673563222d79b00461

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.41-cp313-cp313-macosx_10_13_universal2.whl.

File metadata

File hashes

Hashes for hdbscan-0.8.41-cp313-cp313-macosx_10_13_universal2.whl
Algorithm Hash digest
SHA256 8c2f0111395bd1beba1c095edf6b123ec529c8ebd7f4ccd02aaabd6b016454de
MD5 a4861dfcc8b555f54ec4ce514d0d1631
BLAKE2b-256 90510befb66e11c5989b7ec419da2bc652023d30113d4bf4df09cf42a42494d8

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.41-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: hdbscan-0.8.41-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 671.8 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for hdbscan-0.8.41-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 dce39272d2d4f1dde50dde9cc428cadb84ed16326de872b01761f7ec4f690419
MD5 056e003343a8cd26b101633892345ee1
BLAKE2b-256 59ab6314e52aee546cc14b74fbb575b8713eeec2255880ec99d6490838306d3a

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.41-cp312-cp312-macosx_10_13_universal2.whl.

File metadata

File hashes

Hashes for hdbscan-0.8.41-cp312-cp312-macosx_10_13_universal2.whl
Algorithm Hash digest
SHA256 07ae4c44098449bd9de12145ad17e92ef699754a43988cbea2dd1a95b89bf142
MD5 08e375ec8cc9b5821957feddb6e628d1
BLAKE2b-256 f66bb589c0e903e00108c62f324e99840ab050f1da344fab9cf143ce8ebf1d38

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.41-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: hdbscan-0.8.41-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 687.1 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for hdbscan-0.8.41-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 0af3e3bab1eb6b07ea497afc4d2db1b58122974fb052bd21f0ea4b42fcf8d535
MD5 b7e812c09984d0c842f5d0f48ca32a18
BLAKE2b-256 eff19a17849751488049003a6af08b270eac1e0135d1d29dfd006bcc4edcca00

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.41-cp311-cp311-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for hdbscan-0.8.41-cp311-cp311-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 be948fc76d0035d93d309920f14ab6a0185580a63c5a63b05739f08b45dc6c03
MD5 2e56b48b1a18679ed5e72803155aeeda
BLAKE2b-256 70585c1cbfac6dd5fd4310da17b09950c68dba1a7c1bdd267eb31468b140bd3a

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.41-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: hdbscan-0.8.41-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 687.0 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for hdbscan-0.8.41-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 7a689386170d91d1bd9386665b521e4ae66b6a78e0b7e34265ea5b1aa1eb165f
MD5 04e50e3ef86c734fd5ccb9570f0eafea
BLAKE2b-256 63f137daced2420b5edaea6fb91875fa4c08e1ef72eaf6f84bc8d4442f45c1dc

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.41-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for hdbscan-0.8.41-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 3970e33b7b370cdca5d0fc5d31171c4b5d588590ec6f04b83e08f743099ba950
MD5 7e4ffa5d7e5ca142eadf6fbd31401f7b
BLAKE2b-256 10d9cf2dc6c14ff2a85f2f48a5c3e034df3690b655f5dc09e2e7db6bc140e0ce

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.41-cp310-cp310-macosx_13_0_x86_64.whl.

File metadata

File hashes

Hashes for hdbscan-0.8.41-cp310-cp310-macosx_13_0_x86_64.whl
Algorithm Hash digest
SHA256 0589ea22e225e4ed6fae8b0a6ac6d18c0aff15165b8ddea7561962e1006b7e63
MD5 0664c5ecfca17773c72cbc1433a0bbf5
BLAKE2b-256 687a8bc50300f7c1240284b8a69d6c69c59ba37e6349bc9ca097760d3efae077

See more details on using hashes here.

Supported by

Image AWS Cloud computing and Security Sponsor Image Datadog Monitoring Image Depot Continuous Integration Image Fastly CDN Image Google Download Analytics Image Pingdom Monitoring Image Sentry Error logging Image StatusPage Status page