TreeCluster is a tool that, given a tree T (Newick format) and a distance threshold t, finds the minimum number of clusters of the leaves of T such that some user-specified constraint is met in each cluster. The user can also specify a branch support threshold s such that no pair of leaves in any cluster can be connected by branches with support less than or equal to s. Note that all leaves given a cluster of -1 are singletons, meaning they did not cluster with any other leaves (i.e., each leaf with a cluster of -1 is in its own cluster).
The default method is “Max Clade” (see Clustering Methods). There is no explicit default distance threshold, but because Cluster Picker recommends a distance threshold of 0.045 and because the same objective function is optimized by both Cluster Picker and TreeCluster “Max Clade”, we currently recommend 0.045 as well.
Note that TreeCluster can run within seconds even on ultra-large datasets, so it may make sense to use a range of thresholds and determine the appropriate choice based on the results. We intend to develop non-parametric methods of TreeCluster clustering in the future.
Installation
TreeCluster can be installed using pip:
sudo pip install treecluster
If you are using a machine on which you lack administrative powers, TreeCluster can be installed locally using pip:
To help users, we have provided example files in the example directory, and we have provided some helper scripts that implement common clustering-related tasks in the helper_scripts directory:
Avg Clade: Cluster the leaves such that the following conditions hold for each cluster:
The average pairwise distance between leaves in the cluster is at most t
Leaves cannot be connected by branches with support less than or equal to s
The leaves in the cluster must define a clade in T
For a tree with n leaves, this algorithm is O(n) in the worst case
If verbose mode is enabled (-v), the clades defined by the clusters will be printed to standard error
Leaf Dist Avg: Cluster the leaves by cutting the tree at t distance away from the bottom of the tree, where “bottom” is defined as the average root-to-tip distance
Branches with support less than or equal to s are simply treated as infinitely long
For a tree with n leaves, this algorithm is O(n) in the worst case
Leaf Dist Max: Cluster the leaves by cutting the tree at t distance away from the bottom of the tree, where “bottom” is defined as the furthest leaf from the root
Branches with support less than or equal to s are simply treated as infinitely long
For a tree with n leaves, this algorithm is O(n) in the worst case
Leaf Dist Min: Cluster the leaves by cutting the tree at t distance away from the bottom of the tree, where “bottom” is defined as the closest leaf to the root
Branches with support less than or equal to s are simply treated as infinitely long
For a tree with n leaves, this algorithm is O(n) in the worst case
Length: Cluster the leaves such that the following conditions hold for each cluster:
The cluster does not contain any edges above length t
Leaves cannot be connected by branches with support less than or equal to s
For a tree with n leaves, this algorithm is O(n) in the worst case
Length Clade: Cluster the leaves such that the following conditions hold for each cluster:
The cluster does not contain any edges above length t
Leaves cannot be connected by branches with support less than or equal to s
The leaves in the cluster must define a clade in T
For a tree with n leaves, this algorithm is O(n) in the worst case
If verbose mode is enabled (-v), the clades defined by the clusters will be printed to standard error
Max: Cluster the leaves such that the following conditions hold for each cluster:
The maximum pairwise distance between leaves in the cluster is at most t
Leaves cannot be connected by branches with support less than or equal to s
For a tree with n leaves, this algorithm is O(n) in the worst case
Max Clade: Cluster the leaves such that the following conditions hold for each cluster:
The maximum pairwise distance between leaves in the cluster is at most t
Leaves cannot be connected by branches with support less than or equal to s
The leaves in the cluster must define a clade in T
For a tree with n leaves, this algorithm is O(n) in the worst case
If verbose mode is enabled (-v), the clades defined by the clusters will be printed to standard error
Med Clade: Cluster the leaves such that the following conditions hold for each cluster:
The median pairwise distance between leaves in the cluster is at most t
Leaves cannot be connected by branches with support less than or equal to s
The leaves in the cluster must define a clade in T
For a tree with n leaves, this algorithm is O(n² log n) in the worst case
If verbose mode is enabled (-v), the clades defined by the clusters will be printed to standard error
Root Dist: Cluster the leaves by cutting the tree at t distance away from the root
Branches with support less than or equal to s are simply treated as infinitely long
For a tree with n leaves, this algorithm is O(n) in the worst case
Single Linkage: Cluster the leaves such that the following conditions hold:
For any two leaves u and v, if the distance between u and v is at most t, they must be in the same cluster
Leaves cannot be connected by branches with support less than or equal to s
The number of clusters is maximized
For a tree with n leaves, this algorithm is O(n) in the worst case
Sum Branch: Cluster the leaves such that the following conditions hold for each cluster:
The total branch length of the spanning tree connecting the leaves of the cluster is at most t
Leaves cannot be connected by branches with support less than or equal to s
For a tree with n leaves, this algorithm is O(n) in the worst case
Sum Branch Clade: Cluster the leaves such that the following conditions hold for each cluster:
The total branch length of the spanning tree connecting the leaves of the cluster is at most t
Leaves cannot be connected by branches with support less than or equal to s
The leaves in the cluster must define a clade in T
For a tree with n leaves, this algorithm is O(n) in the worst case
If verbose mode is enabled (-v), the clades defined by the clusters will be printed to standard error
Threshold-Free Approaches
Argmax Clusters: Choose the threshold that maximizes the number of non-singleton clusters over all thresholds from 0 to t
Currently, for the sake of speed, only every 0.001 threshold is tested (i.e., 0, 0.001, 0.002, …, t)
Balaban M, Moshiri N, Mai U, Jia X, Mirarab S (2019). “TreeCluster: Clustering biological sequences using phylogenetic trees.” PLoS ONE. 14(8):e0221068. doi:10.1371/journal.pone.0221068
TreeCluster
TreeCluster is a tool that, given a tree T (Newick format) and a distance threshold t, finds the minimum number of clusters of the leaves of T such that some user-specified constraint is met in each cluster. The user can also specify a branch support threshold s such that no pair of leaves in any cluster can be connected by branches with support less than or equal to s. Note that all leaves given a cluster of -1 are singletons, meaning they did not cluster with any other leaves (i.e., each leaf with a cluster of -1 is in its own cluster).
TreeCluster was motivated by Cluster Picker.
The default method is “Max Clade” (see Clustering Methods). There is no explicit default distance threshold, but because Cluster Picker recommends a distance threshold of 0.045 and because the same objective function is optimized by both Cluster Picker and TreeCluster “Max Clade”, we currently recommend 0.045 as well.
Note that TreeCluster can run within seconds even on ultra-large datasets, so it may make sense to use a range of thresholds and determine the appropriate choice based on the results. We intend to develop non-parametric methods of TreeCluster clustering in the future.
Installation
TreeCluster can be installed using
pip:If you are using a machine on which you lack administrative powers, TreeCluster can be installed locally using
pip:TreeCluster can also be installed using conda:
Usage
Example Files and Helper Scripts
To help users, we have provided example files in the
exampledirectory, and we have provided some helper scripts that implement common clustering-related tasks in thehelper_scriptsdirectory:helper_scripts/score_clusters.py: Given two clustering files, calculate a comparison metric between themClustering Methods
Avg Clade: Cluster the leaves such that the following conditions hold for each cluster:
-v), the clades defined by the clusters will be printed to standard errorLeaf Dist Avg: Cluster the leaves by cutting the tree at t distance away from the bottom of the tree, where “bottom” is defined as the average root-to-tip distance
Leaf Dist Max: Cluster the leaves by cutting the tree at t distance away from the bottom of the tree, where “bottom” is defined as the furthest leaf from the root
Leaf Dist Min: Cluster the leaves by cutting the tree at t distance away from the bottom of the tree, where “bottom” is defined as the closest leaf to the root
Length: Cluster the leaves such that the following conditions hold for each cluster:
Length Clade: Cluster the leaves such that the following conditions hold for each cluster:
-v), the clades defined by the clusters will be printed to standard errorMax: Cluster the leaves such that the following conditions hold for each cluster:
Max Clade: Cluster the leaves such that the following conditions hold for each cluster:
-v), the clades defined by the clusters will be printed to standard errorMed Clade: Cluster the leaves such that the following conditions hold for each cluster:
-v), the clades defined by the clusters will be printed to standard errorRoot Dist: Cluster the leaves by cutting the tree at t distance away from the root
Single Linkage: Cluster the leaves such that the following conditions hold:
Sum Branch: Cluster the leaves such that the following conditions hold for each cluster:
Sum Branch Clade: Cluster the leaves such that the following conditions hold for each cluster:
-v), the clades defined by the clusters will be printed to standard errorThreshold-Free Approaches
Requirements
Citing TreeCluster
If you use TreeCluster in your work, please cite: