fastcluster: Fast hierarchical clustering routines for R and Python
The fastcluster package is a C++ library for hierarchical, agglomerative
clustering. It efficiently implements the seven most widely used clustering
schemes: single, complete, average, weighted/McQuitty, Ward, centroid and
median linkage. The library has interfaces to two languages: R and Python.
The Python module is designed to replace the functions
linkage, single, complete, average, weighted, centroid, median, ward
in the module scipy.cluster.hierarchy with the same functionality but
faster algorithms. Moreover, the function linkage_vector provides
memory-efficient clustering for vector data.
The R package is meant to replace hclust in the stats package
and the flashClust package.
See the author’s home page for more information, in particular a performance
comparison with other clustering packages. The user’s manual is the file
docs/fastcluster.pdf in the source distribution.
The Python package can be installed from PyPI (conveniently with pip),
GitHub, or from the source package at CRAN. All distributions compile and
install the same Python/C++ libraries.
[!NOTE]
The following sections describe the Python interface. See the R package on
CRAN for the documentation of the R interface.
The argument X is either a compressed distance matrix or a collection of n
observation vectors in d dimensions as an (n×d) array. Apart from the
argument preserve_input, the methods have the same input and output as the
functions of the same name in the module scipy.cluster.hierarchy.
The optional argument preserve_input specifies whether the fastcluster package
first copies the distance matrix or writes into the existing array. If the
dissimilarities are generated for the clustering step only and are not
needed afterward, approximately half the memory can be saved by specifying
preserve_input=False. Note that the input array X contains unspecified
values after this procedure. You may want to write
linkage(X, method='…', preserve_input=False)
del X
to make sure that the matrix X is not accessed accidentally after it has been
used as scratch memory.
provides memory-saving clustering for vector data. It also accepts a collection
of n observation vectors in d dimensions as an (n×d) array as the first
parameter. The parameter method is either single, ward, centroid or
median. The ward, centroid and median methods require the Euclidean
metric. In case of single linkage, the metric parameter can be chosen from
all metrics which are implemented in scipy.spatial.dist.pdist. There may be differences between
linkage(scipy.spatial.dist.pdist(X, metric='…'))
and
linkage_vector(X, metric='…')
since a few corrections have been made compared to the pdist function. Please
consult the user’s manual for comprehensive details.
The fastcluster package is distributed under the BSD license. See the file
LICENSE in the source distribution.
Citation
To cite fastcluster in publications, please use:
Daniel Müllner, fastcluster: Fast Hierarchical, Agglomerative Clustering
Routines for R and Python, Journal of Statistical Software, 53 (2013), no. 9,
1–18, https://doi.org/10.18637/jss.v053.i09.
fastcluster: Fast hierarchical clustering routines for R and Python
The fastcluster package is a C++ library for hierarchical, agglomerative clustering. It efficiently implements the seven most widely used clustering schemes: single, complete, average, weighted/McQuitty, Ward, centroid and median linkage. The library has interfaces to two languages: R and Python.
The Python module is designed to replace the functions
in the module
scipy.cluster.hierarchywith the same functionality but faster algorithms. Moreover, the functionlinkage_vectorprovides memory-efficient clustering for vector data.The R package is meant to replace
hclustin thestatspackage and theflashClustpackage.See the author’s home page for more information, in particular a performance comparison with other clustering packages. The user’s manual is the file docs/fastcluster.pdf in the source distribution.
Distribution
The distributions on GitHub and PyPi contain only the files for the Python interface. The full source distribution with both interfaces is available on CRAN: https://CRAN.R-project.org/package=fastcluster.
The Python package can be installed from PyPI (conveniently with pip), GitHub, or from the source package at CRAN. All distributions compile and install the same Python/C++ libraries.
Quick installation
Usage
The fastcluster module is imported as usual by
It provides the following functions:
The argument
Xis either a compressed distance matrix or a collection of n observation vectors in d dimensions as an (n×d) array. Apart from the argumentpreserve_input, the methods have the same input and output as the functions of the same name in the modulescipy.cluster.hierarchy.The optional argument
preserve_inputspecifies whether the fastcluster package first copies the distance matrix or writes into the existing array. If the dissimilarities are generated for the clustering step only and are not needed afterward, approximately half the memory can be saved by specifyingpreserve_input=False. Note that the input arrayXcontains unspecified values after this procedure. You may want to writeto make sure that the matrix
Xis not accessed accidentally after it has been used as scratch memory.The method
provides memory-saving clustering for vector data. It also accepts a collection of n observation vectors in d dimensions as an (n×d) array as the first parameter. The parameter
methodis eithersingle,ward,centroidormedian. Theward,centroidandmedianmethods require the Euclidean metric. In case of single linkage, themetricparameter can be chosen from all metrics which are implemented inscipy.spatial.dist.pdist. There may be differences betweenand
since a few corrections have been made compared to the pdist function. Please consult the user’s manual for comprehensive details.
Copyright
License
The fastcluster package is distributed under the BSD license. See the file LICENSE in the source distribution.
Citation
To cite fastcluster in publications, please use:
Daniel Müllner, fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python, Journal of Statistical Software, 53 (2013), no. 9, 1–18, https://doi.org/10.18637/jss.v053.i09.
Further links