This is the GitHub repository for the program Clumppling (CLUster Matching and Permutation Program that uses integer Linear programmING), a framework for aligning mixed-membership clustering results of population structure analysis.
Current version v 2.0 (Last update: Nov 2025)
This README provides a quick-start guide for installation and use. See the software manual for full details.
Refer to this tutorial for a brief guide on running an end-to-end analysis, including data preparation, population structure analysis, cluster alignment (using Clumppling), and visualization.
Feature Highlights
Flexible input parsing compatible with popular ancestry inference softwares like STRUCTURE and ADMIXTURE.
Clustering alignment within and across various K values (i.e., the number of ancestries).
Mode detection in clustering results for identifying and summarizing distinct solutions.
Visualization of alignment patterns and aligned modes in a connected graph layout.
Modular design for easy integration.
Usage
There are two ways to run Clumppling.
You can run it remotely on the server, which does not require downloading or installing the program locally. The remote version provides the core functionalities of the program. Check out the Remote Notebook section.
You can download and install the Python package onto your local machine and run the program locally. The local version provides an extended list of functionalities (see the pdf Manual for details). Check out the Local Installation section.
Remote Notebook
The remote version is available through an online Colaboratory notebook, which is a Jupyter notebook that runs in the cloud served by Google. If you are interested, more details about Colab notebooks can be found at https://colab.google/.
There is no need to download and install the program locally.
To run Clumppling remotely, click on THIS LINK) which will bring you to the notebook. Next, open the notebook in Colab and follow the instructions in the notebook.
One by one, Click the run (little round-shaped buttons with a triangle in the middle) buttons next to each block on the left.
Upload input files (e.g., the example files provided here) as a zip folder, specify the input data format, and change input parameters (if needed) following the instructions.
You will be able to download a zipped file containing the alignment results at the end of the notebook.
Local Installation
The local version requires downloading and installing the program to your local machine.
1. Use a command line interpreter (i.e., a shell)
Linux and macOS users can use the built-in Terminal.
For Windows users, you will need to obtain a terminal. For example:
After you follow Step 3 to install Conda, you can use the built-in Anaconda Prompt available from the Anaconda Navigator. Note that the installation of Python and Conda on Windows only requires running the installers and there is no need for running commands in the command window.
For Windows users, go to https://www.python.org/downloads/windows/ to download the installer corresponding to your operating system, e.g., Windows installer (64-bit). Run the executable installer and check the box ‘Add Python to environment variables’ during the installation.
For macOS users, go to https://www.python.org/downloads/macos/ to download the macOS 64-bit universal2 installer and double-click on the python--macosx.pkg file to start the Python installer.
For Linux users, if Python is not pre-installed, you can install it via command lines (sudo yum install -y python3 for CentOS and Red Hat Linux and sudo apt-get install python3 for all other Linux systems).
You can verify the installation by running
python --version
in the command line interpreter, which should give you the version of the installed Python (>=3.9,<3.13 required).
3. Install conda and create a virtual environment
Go to https://www.anaconda.com/download to download the conda installer and run the installer. Conda is a popular package management system and environment management system.
A virtual environment is a Python environment such that the Python interpreter, libraries and scripts installed into it are isolated from those installed in other virtual environments, and (by default) any libraries installed in a “system” Python, i.e., one which is installed as part of your operating system”
Using a virtual environment helps to keep the dependencies required by different projects separate and to avoid conflicts between projects.
Create a virtual environment named clumppling-env (feel free to specify your own name) by typing the following command in the command-line interpreter
conda create -n clumppling-env python=3.12
Activate the virtual environment by
conda activate clumppling-env
4. Install the Clumppling package
(1) Install the package
Usually, pip is automatically installed when you installed Python. If it is not yet available in the system, follow the instructions from https://pip.pypa.io/en/stable/installation/ to install it.
Then run the following command to install the package:
pip install clumppling
Alternatively, you may choose to install the package in one of the two other ways:
If you have Git installed, run pip install git+https://github.com/PopGenClustering/Clumppling.
If you don’t have Git, run pip install https://github.com/PopGenClustering/Clumppling/archive/master.zip.
(2) Download the example files from the examples directory in the GitHub repository For each zipped example dataset, unzip the files into a folder with the same name as the zip file, and put it inside a folder called “input” under a path of your choice.
More will be discussed in the section How to run (with example data).
5. Check whether the installation is successful
Run the following command:
python -m clumppling -h
If the installation was successful, you should see the usage of the program in the command window. The usage tells you the required and optional arguments to the program. It should look like:
usage: __main__.py [-h] -i INPUT -o OUTPUT -f {generalQ,admixture,structure,fastStructure} [-v VIS]
[--custom_cmap CUSTOM_CMAP] [--plot_type {graph,list,withinK,major,all}] [--include_cost INCLUDE_COST]
[--include_label INCLUDE_LABEL] [--alt_color ALT_COLOR] [--ind_labels IND_LABELS]
[--ordered_uniq_labels ORDERED_UNIQ_LABELS] [--regroup_ind REGROUP_IND]
[--reorder_within_group REORDER_WITHIN_GROUP] [--reorder_by_max_k REORDER_BY_MAX_K]
[--order_cls_by_label ORDER_CLS_BY_LABEL] [--plot_unaligned PLOT_UNALIGNED]
[--fig_format {png,jpg,jpeg,tif,tiff,svg,pdf,eps,ps,bmp,gif}] [--extension EXTENSION]
[--skip_rows SKIP_ROWS] [--remove_missing REMOVE_MISSING]
[--cd_method {louvain,leiden,infomap,markov_clustering,label_propagation,walktrap,custom}]
[--cd_res CD_RES] [--test_comm TEST_COMM] [--comm_min COMM_MIN] [--comm_max COMM_MAX] [--merge MERGE]
[--use_rep USE_REP] [--use_best_pair USE_BEST_PAIR]
Clumppling: a tool for cluster matching and permutation program with integer linear programming
required arguments:
-i INPUT, --input INPUT
Input file path
-o OUTPUT, --output OUTPUT
Output file directory
-f {generalQ,admixture,structure,fastStructure}, --format {generalQ,admixture,structure,fastStructure}
File format
optional arguments:
-v VIS, --vis VIS Whether to generate figure(s): True (default)/False
--custom_cmap CUSTOM_CMAP
A plain text file containing customized colors (one per line; in hex code): if empty (default),
using the default colormap, otherwise use the user-specified colormap
--plot_type {graph,list,withinK,major,all}
Type of plot to generate: 'graph' (default), 'list', 'withinK', 'major', 'all'
--include_cost INCLUDE_COST
Whether to include cost values in the graph plot: True (default)/False
--include_label INCLUDE_LABEL
Whether to include individual labels in the plot: True (default)/False
--alt_color ALT_COLOR
Whether to use alternative colors for connection lines: True (default)/False
--ind_labels IND_LABELS
A plain text file containing individual labels (one per line) (default: last column from labels in
input file, which consists of columns [0, 1, 3] separated by delimiter)
--ordered_uniq_labels ORDERED_UNIQ_LABELS
A plain text file containing ordered unique individual labels (one per line) to specify the order
of grouped labels (default: based on first-seen order from ind_labels)
--regroup_ind REGROUP_IND
Whether to regroup individuals so that those with the same labels stay together (if labels are
available): True (default)/False
--reorder_within_group REORDER_WITHIN_GROUP
Whether to reorder individuals within each label group in the plot (if labels are available): True
(default)/False
--reorder_by_max_k REORDER_BY_MAX_K
Whether to reorder individuals based on the major mode with largest K: True (default)/False (based
on the major mode with smallest K)
--order_cls_by_label ORDER_CLS_BY_LABEL
Whether to reorder clusters based on total memberships within each label group in the plot: True
(default)/False (by overall total memberships)
--plot_unaligned PLOT_UNALIGNED
Whether to plot unaligned modes (in a list): True/False (default)
--fig_format {png,jpg,jpeg,tif,tiff,svg,pdf,eps,ps,bmp,gif}
Figure format for output files (default: tiff)
--extension EXTENSION
Extension of input files
--skip_rows SKIP_ROWS
Skip top rows in input files
--remove_missing REMOVE_MISSING
Remove individuals with missing data: True (default)/False
--cd_method {louvain,leiden,infomap,markov_clustering,label_propagation,walktrap,custom}
Community detection method to use (default: louvain)
--cd_res CD_RES Resolution parameter for the default Louvain community detection (default: 1.0)
--test_comm TEST_COMM
Whether to test community structure (default: True)
--comm_min COMM_MIN Minimum threshold for cost matrix (default: 1e-6)
--comm_max COMM_MAX Maximum threshold for cost matrix (default: 1e-2)
--merge MERGE Whether to merge two clusters when aligning K+1 to K (default: True)
--use_rep USE_REP Use representative modes (alternative: average): True (default)/False
--use_best_pair USE_BEST_PAIR
Use best pair as anchor for across-K alignment (alternative: major): True (default)/False
Usage
Examples:
python -m clumppling \
-i INPUT_PATH \
-o OUTPUT_PATH \
-f generalQ \
--extension .Q # if not specified, all files under INPUT_PATH will be treated as input files
Example Outputs
Main Function
Input arguments
The main module takes in three required arguments and several optional ones. The required arguments are
-i (--input) path to load input files
-o (--output) path to save output files
-f (--format) input data format. This choice must be one of “generalQ”, “admixture”, “structure”, or “fastStructure”.
The optional arguments are
for input parsing: extension, skip_rows, remove_missing
for community detection: cd_method, cd_res, test_comm, comm_min, comm_max
for alignment across-K: merge, use_rep,use_best_pair
The .indivq files for Cape Verde data contains the column indicating their population indices. Rows in a .indivq file with K=5 clusters (ancestries) look like:
where the columns represent the individual index (5), the individual label (HGDP00908), the missing rate ((0)), the population index (1), and the clustering memberships (after colon).
The Cape Verde data is also available in general Q format (.Q files) in examples/capeverde_admixtureQ.zip. For the same rows as above, in a .Q file they look like:
where the content format pretty much resembles that of the .indivq file. Other sections in the structure file is not required neither utilized by Clumppling.
Ensure that the data files have been successfully downloaded and put under the right directory.
Download the example files from the directory “examples” in the GitHub repository. For each example dataset, unzip the files into the folder with the same name as the zip file.
Ensure that the current path is the correct directory. By default, you should be in the parent directory of the “examples” folder, i.e., in your command-line interpreter, make sure that you navigate to the directory where the folder “exmaples” is located. Alternatively, update the paths correspondingly in the following example scripts.
Run the program on the Cape Verde data under the default setting, with user-provided individual labels:
The outputs will be saved in “examples/capeverde_output” under your current directory and a zipped file of the same name will also be generated and zipped in examples/capeverde_output.zip.
Similarly, you can run the program on the chicken data as follows:
The outputs will be saved in “examples/chicken_output” under your current directory and a zipped file of the same name will also be generated and zipped in examples/chicken_output.zip.
The output folder will contain the following structure (see examples/capeverde_output for reference after finishing running the example; suppose use_rep=True):
File names and subfolders may vary depending on your input and options.
Submodules
Each submodule is callable independently.
parseInput
clumppling.parseInput: Handles reading and parsing input files containing clustering results. Supports various formats and prepares data for downstream analysis.
Usage:
usage: __main__.py [-h] -i INPUT -o OUTPUT -f {generalQ,admixture,structure,fastStructure} [--extension EXTENSION] [--skip_rows SKIP_ROWS]
[--remove_missing REMOVE_MISSING]
clumppling.parseInput
options:
-h, --help show this help message and exit
-i INPUT, --input INPUT
Input file path
-o OUTPUT, --output OUTPUT
Output file directory
-f {generalQ,admixture,structure,fastStructure}, --format {generalQ,admixture,structure,fastStructure}
File format
--extension EXTENSION
Extension of input files
--skip_rows SKIP_ROWS
Skip top rows in input files
--remove_missing REMOVE_MISSING
Remove individuals with missing data: True/False
clumppling.alignWithinK: Aligns clusters within a single value of K to ensure consistent labeling and facilitate comparison across replicates.
Usage:
usage: __main__.py [-h] [--qfiles [QFILES ...]] [--qfilelist QFILELIST] -o OUTPUT
clumppling.alignWithinK
options:
-h, --help show this help message and exit
--qfiles [QFILES ...]
List of Q files to align, passed as command-line arguments
--qfilelist QFILELIST
A plain text file containing Q file names (one per line).
-o OUTPUT, --output OUTPUT
Output file name
clumppling.detectMode: Detects modes (distinct clustering solutions) among multiple runs for a given K, helping to identify stable and alternative solutions.
Usage:
usage: __main__.py [-h] --align_res ALIGN_RES --qfilelist QFILELIST -o OUTPUT [--qnamelist QNAMELIST]
[--cd_method {louvain,leiden,infomap,markov_clustering,label_propagation,walktrap,custom}] [--cd_res CD_RES] [--test_comm TEST_COMM]
[--comm_min COMM_MIN] [--comm_max COMM_MAX]
clumppling.alignWithinK
options:
-h, --help show this help message and exit
--align_res ALIGN_RES
Path to the alignment results file
--qfilelist QFILELIST
A plain text file containing Q file names (one per line).
-o OUTPUT, --output OUTPUT
Output file directory
--qnamelist QNAMELIST
A plain text file containing replicate names (one per line) (default: file base from qfilelist)
--cd_method {louvain,leiden,infomap,markov_clustering,label_propagation,walktrap,custom}
Community detection method to use (default: louvain)
--cd_res CD_RES Resolution parameter for the default Louvain community detection (default: 1.0)
--test_comm TEST_COMM
Whether to test community structure (default: True)
--comm_min COMM_MIN Minimum threshold for cost matrix (default: 1e-4)
--comm_max COMM_MAX Maximum threshold for cost matrix (default: 1e-2)
clumppling.alignAcrossK: Aligns clusters across different values of K, enabling tracking of cluster membership changes as K varies.
Usage:
usage: __main__.py [-h] [--qfilelist QFILELIST] -o OUTPUT [--qnamelist QNAMELIST] [--use_best_pair USE_BEST_PAIR]
clumppling.alignAcrossK
options:
-h, --help show this help message and exit
--qfilelist QFILELIST
A plain text file containing Q file names (one per line).
-o OUTPUT, --output OUTPUT
Directory to save output files
--qnamelist QNAMELIST
A plain text file containing replicate names (one per line) (default: file base from qfilelist)
--use_best_pair USE_BEST_PAIR
Use best pair as anchor for across-K alignment (alternative: major): True (default)/False
Example:
# prepare mode files
for K in 3 5; do
SRC=examples/submodules/K${K}_modes
DST=examples/submodules/K3K5_modes
mkdir -p $DST
for f in "$SRC"/*.Q; do
if [ -f "$f" ]; then
cp "$f" "$DST/K${K}$(basename "$f")"
fi
done
done
# generate list of mode files and names
ls examples/submodules/K3K5_modes/*_rep.Q > examples/submodules/K3K5_modes/K3K5_modes.qfilelist
for f in examples/submodules/K3K5_modes/*_rep.Q; do [ -f "$f" ] && basename "$f" | sed 's/\_rep.Q$//' >> examples/submodules/K3K5_modes/K3K5_modes.qnamelist; done
# run clumppling
python -m clumppling.alignAcrossK \
--qfilelist examples/submodules/K3K5_modes/K3K5_modes.qfilelist \
--qnamelist examples/submodules/K3K5_modes/K3K5_modes.qnamelist \
-o examples/submodules/K3K5_acrossK_output
clumppling.compModels: Compares clustering results from different models, with potentially different K values for the results of each model.
Usage:
usage: __main__.py [-h] --models MODELS [MODELS ...] --qfilelists QFILELISTS [QFILELISTS ...]
[--qnamelists QNAMELISTS [QNAMELISTS ...]] [--mode_stats_files MODE_STATS_FILES [MODE_STATS_FILES ...]]
[--ind_labels IND_LABELS] -o OUTPUT [-v VIS] [--custom_cmap CUSTOM_CMAP]
[--bg_colors BG_COLORS [BG_COLORS ...]] [--include_sim_in_label INCLUDE_SIM_IN_LABEL]
[--fig_format {png,jpg,jpeg,tif,tiff,svg,pdf,eps,ps,bmp,gif}]
clumppling.compModels
options:
-h, --help show this help message and exit
--models MODELS [MODELS ...]
List of model names.
--qfilelists QFILELISTS [QFILELISTS ...]
List of files containing Q file names from each model.
--qnamelists QNAMELISTS [QNAMELISTS ...]
List of files containing replicate names from each model.
--mode_stats_files MODE_STATS_FILES [MODE_STATS_FILES ...]
List of files containing mode statistics from each model.
--ind_labels IND_LABELS
A plain text file containing individual labels (one per line)
-o OUTPUT, --output OUTPUT
Output file directory
-v VIS, --vis VIS Whether to generate figure(s): True (default)/False
--custom_cmap CUSTOM_CMAP
A plain text file containing customized colors (one per line; in hex code): if empty (default), using
the default colormap, otherwise use the user-specified colormap
--bg_colors BG_COLORS [BG_COLORS ...]
List of background colors to be used in the interleaving display: if empty (default), using the gray
scale colors, otherwise use the user-specified colors
--include_sim_in_label INCLUDE_SIM_IN_LABEL
Whether to include (original) alignment similarity in mode labels (if provided): True (default)/False
--fig_format {png,jpg,jpeg,tif,tiff,svg,pdf,eps,ps,bmp,gif}
Figure format for output files (default: tiff)
Liu, X., Kopelman, N. M., & Rosenberg, N. A. (2024). Clumppling: cluster matching and permutation program with integer linear programming. Bioinformatics, 40(1), btad751. https://doi.org/10.1093/bioinformatics/btad751
The Cape Verde data used as the example comes from: Verdu, P., Jewett, E. M., Pemberton, T. J., Rosenberg, N. A., & Baptista, M. (2017). Parallel trajectories of genetic and linguistic admixture in a genetically admixed creole population. Current Biology, 27(16), 2529-2535. https://doi.org/10.1016/j.cub.2017.07.002.
The chicken data used as the example comes from: Rosenberg, N. A., Burke, T., Elo, K., Feldman, M. W., Freidlin, P. J., Groenen, M. A., … & Weigend, S. (2001). Empirical evaluation of genetic clustering methods using multilocus genotypes from 20 chicken breeds. Genetics, 159(2), 699-713. https://doi.org/10.1093/genetics/159.2.699.
Acknowledgements
We thank Egor Lappo for assisting with packaging.
We thank Egor Lappo, Daniel Cotter, Maike Morrison, Chloe Shiff, and Juan Esteban Rodriguez Rodriguez for helping with the testing of the program.
Special thanks to GW and AS for reporting issues and helping us improve the tool.
Version Update History
Version 0.0 -> 1.0
Modularize each step.
Add input parsing features:
use extension to specify the file extension of the input files.
use skip_rows to specify number of rows to skip from the input files.
use remove_missing to choose whether to remove individuals with missing data (clusters).
Add flexibility in algorithmic settings:
test_comm: whether to test for community structure during mode detection, as well as extreme values for determining if nodes fall into communities (comm_min and comm_max).
use_best_pair: whether to align across-K using the best pair of modes or the pair of major modes as the anchor.
keep the features merge and use_rep.
Use the package cdlib for community detection. Change cd_method to multiple choices (default: ‘louvain’) and move cd_custom as the choice ‘custom’.
Add flexibility in plotting settings:
plot_type: which plot(s) to generate: ‘all’, ‘graph’ (default), ‘list’, ‘major’, or ‘withinK’.
include_cost: include edges indicating alignment costs in the graph of structure plots.
include_label: whether to include group labels of individuals (if available) on the x-axis and draw corresponding vertical lines in the structure plots separating groups.
ind_labels: accept user-specified individual labels from a file.
Add visualization of alignment patterns.
Version 1.0 -> 2.0
Add the model comparison module to keep modes from different clustering models separate while aligning them.
Add flexibility in plotting:
reorder_ind: whether to reorder individuals within each label group, in decreasing order their memberships in the cluster with largest total membership (in each label group, or over all label groups).
reorder_by_max_k: when reordering individuals (reorder_ind=True), whether to reorder based on the major mode with largest K, or the major mode with smallest K.
order_cls_by_label: when reordering individuals (reorder_ind=True), whether to reorder clusters based on total memberships within each label group or total memberships over all label groups.
Enable regrouping of individuals. If regroup_ind is set to True (default) and population labels are available (either extracted from the input files or provided separately), then individuals with the same population labels will be reordered to stay together. If individuals are not grouped by populations, ``include_label’’ must be be set to False to ensure the generation of plots without errors.
Clumppling
This is the GitHub repository for the program Clumppling (CLUster Matching and Permutation Program that uses integer Linear programmING), a framework for aligning mixed-membership clustering results of population structure analysis.
Current version v 2.0 (Last update: Nov 2025)
This README provides a quick-start guide for installation and use. See the software manual for full details.
Refer to this tutorial for a brief guide on running an end-to-end analysis, including data preparation, population structure analysis, cluster alignment (using Clumppling), and visualization.
Feature Highlights
Usage
There are two ways to run Clumppling.
Remote Notebook
The remote version is available through an online Colaboratory notebook, which is a Jupyter notebook that runs in the cloud served by Google. If you are interested, more details about Colab notebooks can be found at https://colab.google/.
There is no need to download and install the program locally.
To run Clumppling remotely, click on THIS LINK) which will bring you to the notebook. Next, open the notebook in Colab and follow the instructions in the notebook.
One by one, Click the run (little round-shaped buttons with a triangle in the middle) buttons next to each block on the left.
Upload input files (e.g., the example files provided here) as a zip folder, specify the input data format, and change input parameters (if needed) following the instructions.
You will be able to download a zipped file containing the alignment results at the end of the notebook.
Local Installation
The local version requires downloading and installing the program to your local machine.
1. Use a command line interpreter (i.e., a shell)
2. Install Python (Version >=3.9,<3.13)
You can download the Python installer from https://www.python.org/downloads/.
sudo yum install -y python3for CentOS and Red Hat Linux andsudo apt-get install python3for all other Linux systems).You can verify the installation by running
in the command line interpreter, which should give you the version of the installed Python (>=3.9,<3.13 required).
3. Install conda and create a virtual environment
Go to https://www.anaconda.com/download to download the conda installer and run the installer. Conda is a popular package management system and environment management system.
Using a virtual environment helps to keep the dependencies required by different projects separate and to avoid conflicts between projects.
Create a virtual environment named
clumppling-env(feel free to specify your own name) by typing the following command in the command-line interpreterActivate the virtual environment by
4. Install the Clumppling package
(1) Install the package
Usually, pip is automatically installed when you installed Python. If it is not yet available in the system, follow the instructions from https://pip.pypa.io/en/stable/installation/ to install it.
Then run the following command to install the package:
(2) Download the example files from the examples directory in the GitHub repository
For each zipped example dataset, unzip the files into a folder with the same name as the zip file, and put it inside a folder called “input” under a path of your choice.
More will be discussed in the section How to run (with example data).
5. Check whether the installation is successful
Run the following command:
If the installation was successful, you should see the usage of the program in the command window. The usage tells you the required and optional arguments to the program. It should look like:
Usage
Examples:
Example Outputs
Main Function
Input arguments
The main module takes in three required arguments and several optional ones. The required arguments are
-i(--input) path to load input files-o(--output) path to save output files-f(--format) input data format. This choice must be one of “generalQ”, “admixture”, “structure”, or “fastStructure”.The optional arguments are
extension,skip_rows,remove_missingcd_method,cd_res,test_comm,comm_min,comm_maxmerge,use_rep,use_best_pair-v(--vis),plot_type,include_cost,include_label,ind_labels,custom_cmap,reorder_ind,regroup_ind,ordered_uniq_labels,reorder_by_max_k,order_cls_by_label,plot_unaligned,fig_format.See the above helper message from ``clumppling -h’’ for usage of each argument.
Example data
As a quick start, let’s use the Cape Verde data and the chicken data as examples. The data files are available in the zip files examples/capeverde.zip and examples/chicken.zip under “examples/“.
Cape Verde data
The .indivq files for Cape Verde data contains the column indicating their population indices. Rows in a .indivq file with K=5 clusters (ancestries) look like:
where the columns represent the individual index (
5), the individual label (HGDP00908), the missing rate ((0)), the population index (1), and the clustering memberships (after colon).The Cape Verde data is also available in general Q format (.Q files) in examples/capeverde_admixtureQ.zip. For the same rows as above, in a .Q file they look like:
The corresponding population labels of the Cape Verde individuals are provided separately in the file examples/capeverde_ind_labels.txt.
Chicken data
The chicken data is available as the output file format of (STRUCTURE)[https://web.stanford.edu/group/pritchardlab/structure.html] software (_f files). The clustering memberships are reported in the section starting with
where the content format pretty much resembles that of the .indivq file. Other sections in the structure file is not required neither utilized by Clumppling.
A file with custom colors is also provided at examples/custom_colors.txt for use in examples.
How to run (with example data)
Ensure that the data files have been successfully downloaded and put under the right directory. Download the example files from the directory “examples” in the GitHub repository. For each example dataset, unzip the files into the folder with the same name as the zip file.
Ensure that the current path is the correct directory. By default, you should be in the parent directory of the “examples” folder, i.e., in your command-line interpreter, make sure that you navigate to the directory where the folder “exmaples” is located. Alternatively, update the paths correspondingly in the following example scripts.
Run the program on the Cape Verde data under the default setting, with user-provided individual labels:
The outputs will be saved in “examples/capeverde_output” under your current directory and a zipped file of the same name will also be generated and zipped in
examples/capeverde_output.zip.Similarly, you can run the program on the chicken data as follows:
The outputs will be saved in “examples/chicken_output” under your current directory and a zipped file of the same name will also be generated and zipped in
examples/chicken_output.zip.These commands are also provided in the example script for running Clumppling on Cape Verde data (Admixture .indviq files) and example script for running Clumppling on chicken data (Structure _f files).
Outputs
The output folder will contain the following structure (see
examples/capeverde_outputfor reference after finishing running the example; supposeuse_rep=True):aligned_modes/: Contains files with clusters aligned within each K.modes/: Contains detected modes for each K.acrossK_alignment/: Contains results of cluster alignment across different K values.plots/: Contains generated visualizations (structure plots, alignment graphs).logs/: Contains log files from the run.File names and subfolders may vary depending on your input and options.
Submodules
Each submodule is callable independently.
parseInputclumppling.parseInput: Handles reading and parsing input files containing clustering results. Supports various formats and prepares data for downstream analysis.Usage:
Example:
alignWithinKclumppling.alignWithinK: Aligns clusters within a single value of K to ensure consistent labeling and facilitate comparison across replicates.Usage:
Example:
detectModeclumppling.detectMode: Detects modes (distinct clustering solutions) among multiple runs for a given K, helping to identify stable and alternative solutions.Usage:
Example:
alignAcrossKclumppling.alignAcrossK: Aligns clusters across different values of K, enabling tracking of cluster membership changes as K varies.Usage:
Example:
Visualizations
For generating figures, see examples/plot_submodules.py as an example of visualizing the results from examples/submodules/K3K5_acrossK_output in different ways.
Run
All commands are also provided in examples/run_submodules.sh.
compModelsclumppling.compModels: Compares clustering results from different models, with potentially different K values for the results of each model.Usage:
Example:
Example: see examples/run_comp_models.sh.
License
MIT License
References
Liu, X., Kopelman, N. M., & Rosenberg, N. A. (2024). Clumppling: cluster matching and permutation program with integer linear programming. Bioinformatics, 40(1), btad751. https://doi.org/10.1093/bioinformatics/btad751
The Cape Verde data used as the example comes from:
Verdu, P., Jewett, E. M., Pemberton, T. J., Rosenberg, N. A., & Baptista, M. (2017). Parallel trajectories of genetic and linguistic admixture in a genetically admixed creole population. Current Biology, 27(16), 2529-2535. https://doi.org/10.1016/j.cub.2017.07.002.
The chicken data used as the example comes from:
Rosenberg, N. A., Burke, T., Elo, K., Feldman, M. W., Freidlin, P. J., Groenen, M. A., … & Weigend, S. (2001). Empirical evaluation of genetic clustering methods using multilocus genotypes from 20 chicken breeds. Genetics, 159(2), 699-713. https://doi.org/10.1093/genetics/159.2.699.
Acknowledgements
Version Update History
Version 0.0 -> 1.0
extensionto specify the file extension of the input files.skip_rowsto specify number of rows to skip from the input files.remove_missingto choose whether to remove individuals with missing data (clusters).test_comm: whether to test for community structure during mode detection, as well as extreme values for determining if nodes fall into communities (comm_minandcomm_max).use_best_pair: whether to align across-K using the best pair of modes or the pair of major modes as the anchor.mergeanduse_rep.cdlibfor community detection. Changecd_methodto multiple choices (default: ‘louvain’) and movecd_customas the choice ‘custom’.plot_type: which plot(s) to generate: ‘all’, ‘graph’ (default), ‘list’, ‘major’, or ‘withinK’.include_cost: include edges indicating alignment costs in the graph of structure plots.include_label: whether to include group labels of individuals (if available) on the x-axis and draw corresponding vertical lines in the structure plots separating groups.ind_labels: accept user-specified individual labels from a file.Version 1.0 -> 2.0
Add the model comparison module to keep modes from different clustering models separate while aligning them.
Add flexibility in plotting:
reorder_ind: whether to reorder individuals within each label group, in decreasing order their memberships in the cluster with largest total membership (in each label group, or over all label groups).reorder_by_max_k: when reordering individuals (reorder_ind=True), whether to reorder based on the major mode with largest K, or the major mode with smallest K.order_cls_by_label: when reordering individuals (reorder_ind=True), whether to reorder clusters based on total memberships within each label group or total memberships over all label groups.Enable regrouping of individuals. If
regroup_indis set to True (default) and population labels are available (either extracted from the input files or provided separately), then individuals with the same population labels will be reordered to stay together. If individuals are not grouped by populations, ``include_label’’ must be be set to False to ensure the generation of plots without errors.