R package dbscan - Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and Related Algorithms
Introduction
This R package (Hahsler, Piekenbrock, and Doran
2019) provides a fast C++ (re)implementation of
several density-based algorithms with a focus on the DBSCAN family for
clustering spatial data. The package includes:
Clustering
DBSCAN: Density-based spatial clustering of applications with
noise (Ester et al. 1996).
Jarvis-Patrick Clustering: Clustering using a similarity measure
based on shared near neighbors (Jarvis and Patrick
1973).
HDBSCAN: Hierarchical DBSCAN with simplified hierarchy extraction
(Campello et al. 2015).
FOSC: Framework for optimal selection of clusters for unsupervised
and semisupervised clustering of hierarchical cluster tree (Campello,
Moulavi, and Sander 2013).
OPTICS/OPTICSXi: Ordering points to identify the clustering
structure and cluster extraction methods (Ankerst et al.
1999).
The implementations use the kd-tree data structure (from library ANN)
for faster k-nearest neighbor search, and are for Euclidean distance
typically faster than the native R implementations (e.g., dbscan in
package fpc), or the implementations in
WEKA,
ELKI and Python’s
scikit-learn.
@Article{,
title = {{dbscan}: Fast Density-Based Clustering with {R}},
author = {Michael Hahsler and Matthew Piekenbrock and Derek Doran},
journal = {Journal of Statistical Software},
year = {2019},
volume = {91},
number = {1},
pages = {1--30},
doi = {10.18637/jss.v091.i01},
}
Installation
Stable CRAN version: Install from within R with
install.packages("dbscan")
Current development version: Install from
r-universe.
The dbscan package is licensed under the GNU General Public License
(GPL) Version 3. The
OPTICSXi R implementation was directly ported from the ELKI
framework’s Java implementation (GNU AGPLv3), with permission by the
original author, Erich Schubert.
Ankerst, Mihael, Markus M Breunig, Hans-Peter Kriegel, and Jörg Sander.
1999. “OPTICS: Ordering Points to Identify the Clustering Structure.” In
ACM Sigmod Record, 28:49–60. 2. ACM.
https://doi.org/10.1145/304181.304187.
Breunig, Markus M, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander.
2000. “LOF: Identifying Density-Based Local Outliers.” In ACM Int.
Conf. On Management of Data, 29:93–104. 2. ACM.
https://doi.org/10.1145/335191.335388.
Campello, Ricardo JGB, Davoud Moulavi, and Jörg Sander. 2013.
“Density-Based Clustering Based on Hierarchical Density Estimates.” In
Pacific-Asia Conference on Knowledge Discovery and Data Mining,
160–72. Springer. https://doi.org/10.1007/978-3-642-37456-2_14.
Campello, Ricardo JGB, Davoud Moulavi, Arthur Zimek, and Joerg Sander.
2015. “Hierarchical Density Estimates for Data Clustering,
Visualization, and Outlier Detection.” ACM Transactions on Knowledge
Discovery from Data (TKDD) 10 (1): 5.
https://doi.org/10.1145/2733381.
Ertöz, Levent, Michael Steinbach, and Vipin Kumar. 2003. “Finding
Clusters of Different Sizes, Shapes, and Densities in Noisy, High
Dimensional Data.” In Proceedings of the 2003 SIAM International
Conference on Data Mining (SDM), 47–58.
https://doi.org/10.1137/1.9781611972733.5.
Ester, Martin, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. 1996.
“A Density-Based Algorithm for Discovering Clusters in Large Spatial
Databases with Noise.” In Proceedings of 2nd International Conference
on Knowledge Discovery and Data Mining (KDD-96), 226–31.
https://dl.acm.org/doi/10.5555/3001460.3001507.
Hahsler, Michael, Matthew Piekenbrock, and Derek Doran. 2019.
“dbscan: Fast Density-Based Clustering with
R.” Journal of Statistical Software 91 (1): 1–30.
https://doi.org/10.18637/jss.v091.i01.
Jarvis, R. A., and E. A. Patrick. 1973. “Clustering Using a Similarity
Measure Based on Shared Near Neighbors.” IEEE Transactions on
Computers C-22 (11): 1025–34.
https://doi.org/10.1109/T-C.1973.223640.
Moulavi, Davoud, Pablo A. Jaskowiak, Ricardo J. G. B. Campello, Arthur
Zimek, and Jörg Sander. 2014. “Density-Based Clustering Validation.” In
Proceedings of the 2014 SIAM International Conference on Data Mining
(SDM), 839–47. https://doi.org/10.1137/1.9781611973440.96.
Introduction
This R package (Hahsler, Piekenbrock, and Doran 2019) provides a fast C++ (re)implementation of several density-based algorithms with a focus on the DBSCAN family for clustering spatial data. The package includes:
Clustering
Outlier Detection
Cluster Evaluation
Fast Nearest-Neighbor Search (using kd-trees)
The implementations use the kd-tree data structure (from library ANN) for faster k-nearest neighbor search, and are for Euclidean distance typically faster than the native R implementations (e.g., dbscan in package
fpc), or the implementations in WEKA, ELKI and Python’s scikit-learn.The following R packages use
dbscan: AnimalSequences, bioregion, clayringsmiletus, CLONETv2, clusterWebApp, cordillera, CPC, crosshap, crownsegmentr, CspStandSegmentation, daltoolbox, DataSimilarity, diceR, dobin, doc2vec, dPCP, emcAdr, eventstream, evprof, fastml, FCPS, flowcluster, funtimes, FuzzyDBScan, HaploVar, immunaut, karyotapR, ksharp, LLMing, LOMAR, maotai, MapperAlgo, metaCluster, metasnf, mlr3cluster, neuroim2, oclust, omicsTools, openSkies, opticskxi, OTclust, outlierensembles, outlierMBC, pagoda2, parameters, ParBayesianOptimization, performance, PiC, rcrisp, rMultiNet, seriation, sfdep, sfnetworks, sharp, smotefamily, snap, spdep, spNetwork, ssMRCD, stream, SuperCell, synr, tidySEM, VBphenoR, VIProDesign, weirdTo cite package ‘dbscan’ in publications use:
Installation
Stable CRAN version: Install from within R with
Current development version: Install from r-universe.
Usage
Load the package and use the numeric variables in the iris dataset
DBSCAN
Visualize the resulting clustering (noise points are shown in black).
OPTICS
Extract DBSCAN-like clustering from OPTICS and create a reachability plot (extracted DBSCAN clusters at eps_cl=.4 are colored)
HDBSCAN
Visualize the hierarchical clustering as a simplified tree. HDBSCAN finds 2 stable clusters.
Using dbscan with tidyverse
dbscanprovides for all clustering algorithmstidy(),augment(), andglance()so they can be easily used with tidyverse, ggplot2 and tidymodels.Get cluster statistics as a tibble
Visualize the clustering with ggplot2 (use an x for noise points)
Using dbscan from Python
R, the R package
dbscan, and the Python packagerpy2need to be installed.License
The dbscan package is licensed under the GNU General Public License (GPL) Version 3. The OPTICSXi R implementation was directly ported from the ELKI framework’s Java implementation (GNU AGPLv3), with permission by the original author, Erich Schubert.
Changes
References
Ankerst, Mihael, Markus M Breunig, Hans-Peter Kriegel, and Jörg Sander. 1999. “OPTICS: Ordering Points to Identify the Clustering Structure.” In ACM Sigmod Record, 28:49–60. 2. ACM. https://doi.org/10.1145/304181.304187.
Breunig, Markus M, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. 2000. “LOF: Identifying Density-Based Local Outliers.” In ACM Int. Conf. On Management of Data, 29:93–104. 2. ACM. https://doi.org/10.1145/335191.335388.
Campello, Ricardo JGB, Davoud Moulavi, and Jörg Sander. 2013. “Density-Based Clustering Based on Hierarchical Density Estimates.” In Pacific-Asia Conference on Knowledge Discovery and Data Mining, 160–72. Springer. https://doi.org/10.1007/978-3-642-37456-2_14.
Campello, Ricardo JGB, Davoud Moulavi, Arthur Zimek, and Joerg Sander. 2015. “Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection.” ACM Transactions on Knowledge Discovery from Data (TKDD) 10 (1): 5. https://doi.org/10.1145/2733381.
Ertöz, Levent, Michael Steinbach, and Vipin Kumar. 2003. “Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data.” In Proceedings of the 2003 SIAM International Conference on Data Mining (SDM), 47–58. https://doi.org/10.1137/1.9781611972733.5.
Ester, Martin, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. 1996. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.” In Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), 226–31. https://dl.acm.org/doi/10.5555/3001460.3001507.
Hahsler, Michael, Matthew Piekenbrock, and Derek Doran. 2019. “dbscan: Fast Density-Based Clustering with R.” Journal of Statistical Software 91 (1): 1–30. https://doi.org/10.18637/jss.v091.i01.
Jarvis, R. A., and E. A. Patrick. 1973. “Clustering Using a Similarity Measure Based on Shared Near Neighbors.” IEEE Transactions on Computers C-22 (11): 1025–34. https://doi.org/10.1109/T-C.1973.223640.
Moulavi, Davoud, Pablo A. Jaskowiak, Ricardo J. G. B. Campello, Arthur Zimek, and Jörg Sander. 2014. “Density-Based Clustering Validation.” In Proceedings of the 2014 SIAM International Conference on Data Mining (SDM), 839–47. https://doi.org/10.1137/1.9781611973440.96.