Computation of measures for clustering comparison (ARI, AMI, NID and
even the χ2 distance) are usually based on the contingency table.
Traditional implementations (e.g., function adjustedRandIndex of
package mclust) are in Ω(n+uv) where
n is the size of the vectors the classifications of which are to be
compared,
u and v are the respective number of classes in each vectors.
In aricode we propose an implementation, based on radix sort, that
is in Θ(n) in time and space. Importantly, the complexity does not depend on u and v. Our
implementation of the ARI for instance is one or two orders of magnitude
faster than some standard implementation in R.
NVI: computes the the normalized variation information
AMI: computes the adjusted mutual information
Chi2: computes the Chi-square statistics
Frobenius compute the Frobenius norm between two classification as
defined in Arlot et al,
2019
entropy: computes the conditional and joint entropies
compare_clustering: computes all clustering comparison measures at
once
sort_pairs: radix sort for pairs of elements
Timings
Here are some timings to compare the cost of computing the adjusted Rand
Index with aricode or with the commonly used function
adjustedRandIndex of the mclust package: the cost of the latter can
be prohibitive for large vectors:
aricode
A package for efficient computations of standard clustering comparison measures
Installation
Stable version on the CRAN.
The development version is available via:
Description
Computation of measures for clustering comparison (ARI, AMI, NID and even the χ2 distance) are usually based on the contingency table. Traditional implementations (e.g., function
adjustedRandIndexof package mclust) are in Ω(n+uv) whereIn aricode we propose an implementation, based on radix sort, that is in Θ(n) in time and space.
Importantly, the complexity does not depend on u and v. Our implementation of the ARI for instance is one or two orders of magnitude faster than some standard implementation in
R.Available measures and functions
The functions included in
aricodeare:(A)RI: computes the (adjusted) Rand indexMARI: computes the modified adjusted rand index as defined in Sundqvist et al, 2023NID: computes the normalized information distanceNMI: computes the normalized mutual informationNVI: computes the the normalized variation informationAMI: computes the adjusted mutual informationChi2: computes the Chi-square statisticsFrobeniuscompute the Frobenius norm between two classification as defined in Arlot et al, 2019entropy: computes the conditional and joint entropiescompare_clustering: computes all clustering comparison measures at oncesort_pairs: radix sort for pairs of elementsTimings
Here are some timings to compare the cost of computing the adjusted Rand Index with aricode or with the commonly used function
adjustedRandIndexof the mclust package: the cost of the latter can be prohibitive for large vectors: