KCFTOOLS is a Java-based toolset for identifying genomic variations through counting kmer presence/absence between reference and query genomes. It utilizes precomputed k-mer count databases (from KMC) to perform a wide array of genomic analyses including variant detection, IBS window identification, and genotype matrix generation.
To quickly get started with kcftools, refer to the run_kcftools.sh script located in the utils directory. Assuming that you have installed kcftools via Bioconda.
KCFTOOLS is designed for high-throughput genomic analysis using efficient k-mer based methods. By leveraging fast k-mer counting from tools like KMC, KCFTOOLS can rapidly compare genome samples to a reference, identify variations, and produce downstream outputs useful for population genetics and comparative genomics studies.
Methodology
KCFTOOLS (specifically the getVariations plugin), splits the reference sequence into non-overlapping windows: either fixed-length regions, gene models, or transcript features from a GTF file—and the presence of reference k-mers is screened against query k-mer databases built using KMC3. For each window, the number of observed k-mers is counted, and variations are identified as consecutive gaps between matching k-mers. These gaps are used to compute the k-mer distance, representing the number of bases not covered by observed k-mers. This distance is divided into inner distance (gaps between hits within the window) and tail distance (gaps at the window edges), providing a detailed measure of sequence divergence or gene loss at multiple resolutions. The identity score for each window is being calculated using the below formula,
getVariations plugin works only with KMC databases produced by kmc version 3.0.0 or higher.
This version currently supports only KMC database files generated with a signature length of 9 (i.e., using -p 9). Files created with other signature lengths are not guaranteed to work and may lead to unexpected behavior.
Memory Usage with --memory or -m Option: The getVariations plugin can be significantly faster when used with the --memory option, which loads the KMC database entirely into memory. However, this may lead to Java heap space errors on large DBs. To prevent such issues:
Run with a custom heap size using the -Xmx JVM option Example: kcftools -Xmx16G getVariations ...
Or, set the default heap size via the environment variable KCFTOOLS_HEAP_SIZE Example: export KCFTOOLS_HEAP_SIZE=16G
Usage
kmc database
To use kcftools, you first need to create a KMC database from your query data (fasta/fastq). This can be done using the KMC tool:
--score_a : Minimum score for allele ref (default: 95)
--score_b : Minimum score for allele alt (default: 60)
--score_n : Minimum score for allele missing (default: 30)
--maf : Minimum allele frequency (default: 0.05)
--max-missing : Maximum missing data fraction (default: 0.8)
--chrs : List file of chromosomes to include (default: all)
increaseWindow
Combine subsequent windows to generate the KCF file with increased window size.
kcftools increaseWindow [options]
Required Options:
-i, --input=<kcfFile> : Input `.kcf` file
-o, --output=<kcfFile> : Output `.kcf` file
-w, --window=<windowSize> : Window Size (must be higher than the input `.kcf` windowSize)
kcf2tsv
Convert a .kcf file to a TSV format similar to IBSpy output (with scores).
--score_a : Minimum score for allele ref (default: 95)
--score_b : Minimum score for allele alt (default: 60)
--score_n : Minimum score for allele missing (default: 30)
--maf : Minimum allele frequency (default: 0.05)
--max-missing : Maximum missing data fraction (default: 0.8)
--chrs : List file of chromosomes to include (default: all)
scoreRecalc
Recalculate identity scores in a .kcf file using new weights for inner distance, tail distance, and kmer ratio.
kcftools scoreRecalc [options]
Required Options:
-i, --input=<kcfFile> : Input `.kcf` file
-o, --output=<kcfFile> : Output `.kcf` file
--wi=<weightInner> : Weight for inner distance
--wt=<weightTail> : Weight for tail distance
--wr=<weightKmerRatio> : Weight for kmer ratio
KCF file format
Kmer Count Format (.kcf) file summarizes the variation profile of a query relative to a reference genome based on k-mer presence/absence matrices.
KCF File Header Description
The .kcf file starts with a set of metadata headers describing the format, source, and parameters used during the k-mer based analysis. Below is a breakdown of each header:
##format=KCF0.1
Specifies the version of the KCF file format.
##date=2024-12-05
Date on which the file was generated.
##source=kcftools
Indicates the software used to generate the file (kcftools).
##reference=lsatv11.chr3.fasta
Reference genome FASTA file used to derive the reference k-mers.
##contig=<ID=chr3,length=324658466>
Specifies the reference contig (chromosome ID and its length).
##INFO=<ID=IS,Type=Float,Description="Minimum score for the window">
##INFO=<ID=XS,Type=Float,Description="Maximum score for the window">
##INFO=<ID=MS,Type=Float,Description="Mean score for the window">
##INFO=<ID=IO,Type=Integer,Description="Minimum observed kmers in the window">
##INFO=<ID=XO,Type=Integer,Description="Maximum observed kmers in the window">
##INFO=<ID=MO,Type=Integer,Description="Mean observed kmers in the window">
##INFO=<ID=IV,Type=Integer,Description="Minimum variations in the window">
##INFO=<ID=XV,Type=Integer,Description="Maximum variations in the window">
##INFO=<ID=MV,Type=Integer,Description="Mean variations in the window">
These define window-level summary statistics for identity score (IS, XS, MS), observed k-mers (IO, XO, MO), and variations (IV, XV, MV).
The command-line(s) invocation used to produce the .kcf file for reproducibility.
KCF File Data Column Description
Each row in the KCF file represents a non-overlapping genomic window analyzed for k-mer presence/absence variation. Below are the descriptions for each column:
KCFTOOLS
KCFTOOLS is a Java-based toolset for identifying genomic variations through counting kmer presence/absence between reference and query genomes. It utilizes precomputed k-mer count databases (from KMC) to perform a wide array of genomic analyses including variant detection, IBS window identification, and genotype matrix generation.
Detailed documentation is available at kcftools.readthedocs.io.
Quick Start
To quickly get started with
kcftools, refer to therun_kcftools.shscript located in theutilsdirectory. Assuming that you have installedkcftoolsvia Bioconda.Contents
Introduction
KCFTOOLS is designed for high-throughput genomic analysis using efficient k-mer based methods. By leveraging fast k-mer counting from tools like KMC, KCFTOOLS can rapidly compare genome samples to a reference, identify variations, and produce downstream outputs useful for population genetics and comparative genomics studies.
Methodology
KCFTOOLS (specifically the
getVariationsplugin), splits the reference sequence into non-overlapping windows: either fixed-length regions, gene models, or transcript features from a GTF file—and the presence of reference k-mers is screened against query k-mer databases built using KMC3. For each window, the number of observed k-mers is counted, and variations are identified as consecutive gaps between matching k-mers. These gaps are used to compute the k-mer distance, representing the number of bases not covered by observed k-mers. This distance is divided into inner distance (gaps between hits within the window) and tail distance (gaps at the window edges), providing a detailed measure of sequence divergence or gene loss at multiple resolutions. The identity score for each window is being calculated using the below formula,Identity Score=Wo⋅(total k-mersobs k-mers)+Wi⋅(1−eff lengthinner dist)+Wt⋅(1−eff lengthtail dist)⋅100
where:
kcftools getVariationsmethodology.Features
.kcfsample files into a unified cohort..kcffiles..kcffiles into population-level genotype table..kcfdata..kcffiles to TSV format (to replicate IBSpy-like output).Workflow
kcftoolsworkflowInstallation
You can install kcftools using either Bioconda or from source.
1. Using Bioconda (recommended)
If you have Bioconda set up, simply run:
2. From Source
Requirements
Steps
Clone the repository:
Build the project using Maven:
The JAR file will be located in the
targetdirectory:Run the tool:
⚠️ Limitations and Performance Notes
KMC DB Compatibility:
getVariationsplugin works only with KMC databases produced bykmcversion 3.0.0 or higher.-p 9).Files created with other signature lengths are not guaranteed to work and may lead to unexpected behavior.
Memory Usage with
--memoryor-mOption:The
getVariationsplugin can be significantly faster when used with the--memoryoption, which loads the KMC database entirely into memory.However, this may lead to Java heap space errors on large DBs. To prevent such issues:
-XmxJVM optionExample:
kcftools -Xmx16G getVariations ...KCFTOOLS_HEAP_SIZEExample:
export KCFTOOLS_HEAP_SIZE=16GUsage
kmcdatabaseTo use
kcftools, you first need to create a KMC database from your query data (fasta/fastq). This can be done using the KMC tool:Example command to run kmc
General Usage
kcftoolsprovides several subcommands. General usage:getVariationsDetect and count variations by comparing reference k-mers with a query KMC database.
Required Options:
Optional:
cohortCombine multiple
.kcffiles into a single cohort for population-level analysis.Required Options:
findIBSIdentify Identity-by-State (IBS) or variable regions in a sample.
Required Options:
Optional:
splitKCFSplit a KCF file by chromosome.
Required Options:
getAttributesExtract attributes from a KCF file into individual TSV files.
Required Options:
kcf2gtGenerate a genotype matrix from a
.kcffile, suitable for GWAS or population studies.Required Options:
Optional:
increaseWindowCombine subsequent windows to generate the KCF file with increased window size.
Required Options:
kcf2tsvConvert a
.kcffile to a TSV format similar to IBSpy output (with scores).Required Options:
kcf2plinkConvert a
.kcffile to PLINK format for downstream genetic analysis (experimental feature).Required Options:
Optional:
scoreRecalcRecalculate identity scores in a
.kcffile using new weights for inner distance, tail distance, and kmer ratio.Required Options:
KCF file format
Kmer Count Format (
.kcf) file summarizes the variation profile of a query relative to a reference genome based on k-mer presence/absence matrices.KCF File Header Description
The
.kcffile starts with a set of metadata headers describing the format, source, and parameters used during the k-mer based analysis. Below is a breakdown of each header:Specifies the version of the KCF file format.
Date on which the file was generated.
Indicates the software used to generate the file (
kcftools).Reference genome FASTA file used to derive the reference k-mers.
Specifies the reference contig (chromosome ID and its length).
These define window-level summary statistics for identity score (IS, XS, MS), observed k-mers (IO, XO, MO), and variations (IV, XV, MV).
Define per-sample fields: identity-by-state (IB), number of variations (VA), number of observed k-mers (OB), and calculated score (SC).
kcftoolsruntime parameters: window size, k-mer length, IBS mode, and total number of windows.The command-line(s) invocation used to produce the
.kcffile for reproducibility.KCF File Data Column Description
Each row in the KCF file represents a non-overlapping genomic window analyzed for k-mer presence/absence variation. Below are the descriptions for each column:
CHROMSTARTENDTOTAL_KMERSINFOFORMAT<Sample>Format Field Attributes
IBVAOBIDLDRDSCNotes
⚠️ Warning: Unit tests and comprehensive validation are still under development. Use results carefully and validate downstream.
License
This project is licensed under the GNU General Public License v3.0 only (GPL-3.0-only).
See the LICENSE file for details.
Contact
For questions or contributions, please reach out to: c.s.sivasubramani@gmail.com