Kmer-db is a fast and memory-efficient tool for large-scale k-mer analyses (indexing, querying, estimating evolutionary relationships, etc.).
Quick start
git clone --recurse-submodules https://github.com/refresh-bio/kmer-db
cd kmer-db && gmake
INPUT=./test/virus
OUTPUT=./output
mkdir $OUTPUT
# build a database from all 18-mers (default) contained in a set of sequences
./bin/kmer-db build $INPUT/seqs.part1.list $OUTPUT/k18.db
# establish numbers of common k-mers between new sequences and the database
./bin/kmer-db new2all $OUTPUT/k18.db $INPUT/seqs.part2.list $OUTPUT/n2a.csv
# calculate jaccard index from common k-mers
./bin/kmer-db distance jaccard $OUTPUT/n2a.csv $OUTPUT/n2a.jaccard
# extend the database with new sequences
./bin/kmer-db build -extend $INPUT/seqs.part2.list $OUTPUT/k18.db
# establish numbers of common k-mers between all sequences in the database
./bin/kmer-db all2all $OUTPUT/k18.db $OUTPUT/a2a.csv
# build a database from 10% of 25-mers using 16 threads
./bin/kmer-db build -k 25 -f 0.1 -t 16 $INPUT/seqs.part1.list $OUTPUT/k25.db
# establish number of common 25-mers between single sequence and the database
# (minhash filtering that retains 10% of MT159713 k-mers is done prior to the comparison)
./bin/kmer-db one2all $OUTPUT/k25.db $INPUT/data/MT159713.fasta $OUTPUT/MT159713.csv
# build two partial databases
./bin/kmer-db build $INPUT/seqs.part1.list $OUTPUT/k18.parts1.db
./bin/kmer-db build $INPUT/seqs.part2.list $OUTPUT/k18.parts2.db
# establish numbers of common k-mers between all sequences in the databases,
# computations are done in the sparse mode, the output matrix is also sparse
echo $OUTPUT/k18.parts1.db > $OUTPUT/db.list
echo $OUTPUT/k18.parts2.db >> $OUTPUT/db.list
./bin/kmer-db all2all-parts $OUTPUT/db.list $OUTPUT/k18.parts.csv
Kmer-db comes with a set of precompiled binaries for Linux, macOS, and Windows.
The software is also available on Bioconda:
conda install -c bioconda kmer-db
For detailed instructions how to set up Bioconda, please refer to the Bioconda manual.
Kmer-db can be also built from the sources distributed as:
GNU Make project for Linux and macOS (gmake 4.3 and gcc/g++ 11 or newer required),
Visual Studio 2022 solution for Windows.
Vector extensions
Kmer-db can be built for x86-64 and ARM64 8 architectures (including Apple Mx based on ARM64 8.4 core) and takes advantage of AVX2 (x86-64) and NEON (ARM) CPU extensions. The default target platform is x86-64 with AVX2 extensions. This, however, can be changed by setting PLATFORM variable for make:
make PLATFORM=none # unspecified platform, no extensions
make PLATFORM=sse2 # x86-64 with SSE2
make PLATFORM=avx # x86-64 with AVX
make PLATFORM=avx2 # x86-64 with AVX2 (default)
make PLATFORM=native # x86-64 with AVX2 and native architecture
make PLATFORM=arm8 # ARM64 8 with NEON
make PLATFORM=m1 # ARM64 8.4 (especially Apple M1) with NEON
Note, that x86-64 binaries determine the supported extensions at runtime, which makes them backwards-compatible. For instance, the AVX executable will also work on SSE-only platform, but with limited performance.
2. Usage
kmer-db <mode> [options] <positional arguments>
Kmer-db operates in one of the following modes:
build - building a database from samples,
all2all - counting common k-mers - all samples in the database,
all2all-sp - counting common k-mers - all samples in the database (sparse computation),
all2all-parts - counting common k-mers - all samples within from databases (sparse computation),
new2all - counting common k-mers - set of new samples versus database,
one2all - counting common k-mers - single sample versus database,
distance - calculating similarities/distances,
minhash - storing minhashed k-mers.
Common options:
-t <threads> - number of threads (default: number of available cores),
The meaning of other options and positional arguments depends on the selected mode.
2.1. Building a database
Construction of k-mers database is an obligatory step for further analyses. The procedure accepts several input types:
FASTA file (fa, fna, fasta, fa.gz, fna.gz, fasta.gz) with one or multiple (-multisample-fasta switch) samples
file with a newline-separated list of samples:
sample_file_1
sample_file_2
sample_file_3
...
Every file can be in one of the formats:
FASTA genomes/reads (default). If a file on the list cannot be found, the following extensions are tested: fa, fna, fasta, gz, fa.gz, fna.gz, fasta.gz.
KMC-generated k-mer files (-from-kmers switch specified). A set of two KMC files (.kmc_pre + .kmc_suf) is required for every list entry.
minhashed k-mers (-from-minhash switch specified). Minhashed k-mer files (.minhash) must be generated by minhash command prior to the database construction. Note, that minhashing may be also done during the database construction by specyfying -f option.
database (output) - file with generated k-mer database,
-k <kmer-length> - length of k-mers (default: 18); ignored when -from-kmers or -from-minhash switch is specified,
-f <fraction> - fraction of all k-mers to be accepted by the minhash filter during database construction (default: 1); ignored when -from-minhash switch is present,
-multisample-fasta - each sequence in a FASTA file is treated as a separate sample,
-extend - extend the existing database with new samples,
-alphabet - alphabet:
nt (4 symbol nucleotide with indistinguishable T/U; default)
aa (20 symbol amino acid)
aa12_mmseqs (amino acid reduced to 12 symbols as in MMseqs: AST,C,DN,EQ,FY,G,H,IV,KR,LM,P,W
aa11_diamond (amino acid reduced to 11 symbols as in Diamond: KREDQN,C,G,H,ILV,M,F,Y,W,P,STA
aa6_dayhoff (amino acid reduced to 6 symbols as proposed by Dayhoff: STPAG,NDEQ,HRK,MILV,FYW,C
-preserve-strand- preserve strand instead of taking canonical k-mers (allowed only in nt alphabet; default: off)
-t <threads> - number of threads (default: number of available cores).
2.2. Counting common k-mers
Samples in the database against each other:
Dense computations - recomended when the distance matrix contains few zeros. Output can be stored in the dense or sparse form (-sparse switch).
Sparse computations, partial databases - use when the distance matrix contains many zeros and there are multiple partial databases. Output matrix is always in the sparse form:
database (input) - k-mer database file created by build mode,
db_list (input) - file containing list of databases files created by build mode,
common_table (output) - file containing table with common k-mer counts,
-buffer <size_mb> - size of cache buffer in megabytes; use L3 size for Intel CPUs and L2 for AMD for best performance; default: 8,
-t <threads> - number of threads (default: number of available cores),
-sparse - stores output matrix in a sparse form (always on in all2all-sp and all2all-parts modes),
-min [<criterion>:]<value> - retains elements with criterion greater than or equal to value (see details below),
-max [<criterion>:]<value> - retains elements with criterion lower than or equal to value (see details below),
-sample-rows [<criterion>:]<count> - retains count elements in every row using one of the strategies: (i) random selection (no criterion); (ii) the best elements with respect to criterion.
criterion can be num-kmers (number of common k-mers) or one of the distance/similarity measures: jaccard, min, max, cosine, mash, ani, ani-shorder (see 2.3 for definitions). No criterion indicates num-kmers (filtering) or random elements selection (sampling). Multiple filters can be combined.
database (input) - k-mer database file created by build mode,
samples (input) - file containing samples in one of the supported formats (see build mode); if samples are given as genomes (default) or k-mers (-from-kmers switch), the minhashing is done automatically with the same filter as in the database,
common_table (output) - file containing table with common k-mer counts,
-multisample-fasta / -from-kmers / -from-minhash - see build mode for details,
-t <threads> - number of threads (default: number of available cores),
-sparse - stores output matrix in a sparse form,
-min [<criterion>:]<value> - retains elements with criterion greater than or equal to value (see details below),
-max [<criterion>:]<value> - retains elements with criterion lower than or equal to value (see details below),
criterion can be num-kmers (number of common k-mers) or one of the distance/similarity measures: jaccard, min, max, cosine, mash, ani, ani-shorder (see 2.3 for definitions). No criterion indicates num-kmers. Multiple filters can be combined.
The meaning of the parameters is the same as in new2all mode, but instead of specifying file with sample list, a single sample file is used as a query.
Output format
Modes all2all, all2all-sp, all2all-parts, new2all, and one2all produce a comma-separated table with numbers of common k-mers. For all2all, new2all, and one2all modes, the table is by default stored in a dense form:
kmer-length: k fraction: f
db-samples
s1
s2
…
sn
query-samples
total-kmers
|s1|
|s2|
…
|sn|
q1
|q1|
|q1 ∩ s1|
|q1 ∩ s2|
…
|q1 ∩ sn|
q2
|q2|
|q2 ∩ s1|
|q2 ∩ s2|
…
|q2 ∩ sn|
…
…
…
…
…
…
qm
|qm|
|qm ∩ s1|
|qm ∩ s2|
…
|qm ∩ sn|
where:
k - k-mer length,
f - minhash fraction (1, when minhashing is disabled),
s1, s2, …, sn - database sample names,
q1, q2, …, qm - query sample names,
|a| - number of k-mers in sample a,
|a ∩ b| - number of k-mers common for samples a and b.
When -sparse switch is specified or all2all-sp, all2all-parts modes are used, the table is stored in a sparse form. In particular, zeros are omitted while non-zero elements are represented as pairs (column_id: value) with 1-based column indexing. Thus, rows may have different number of elements, e.g.:
kmer-length: k fraction: f
db-samples
s1
s2
…
sn
query-samples
total-kmers
|s1|
|s2|
…
|sn|
q1
|q1|
i11: |q1 ∩ si11|
i12: |q1 ∩ si12|
q2
|q2|
i21: |q2 ∩ si21|
i22: |q2 ∩ si22|
i23: |q2 ∩ si23|
q2
|q2|
…
…
…
qm
|qm|
im1: |qm ∩ sim1|
For performance reasons, all2all, all2all-sp, and all2all-parts modes produce a lower triangular matrix.
ani (average nucleotide identity): ANI(q,s)=1−Mash(p,q),
ani-shorter - same as ani but with min used instead of jaccard.
common_table (input) - file containing table with numbers of common k-mers produced by all2all, new2all, or one2all mode (both, dense and sparse matrices are supported),
output_table (output) - file containing table with calculated distance measure,
-phylip-out - store output distance matrix in a Phylip format,
-sparse - outputs a sparse matrix (only for dense input matrices - sparse inputs always produce sparse outputs),
-min [<criterion>:]<value> - retains elements with criterion greater than or equal to value (see details below),
-max [<criterion>:]<value> - retains elements with criterion lower than or equal to value (see details below),
criterion can be num-kmers (number of common k-mers) or one of the distance/similarity measures: jaccard, min, max, cosine, mash, ani, ani-shorder (see 2.3 for definitions). If no criterion is specified, measure argument is used by default. Multiple filters can be combined.
2.4. Storing minhashed k-mers
This is an optional analysis step which stores minhashed k-mers on the hard disk to be later consumed by build, new2all, or one2all modes with -from-minhash switch. It can be skipped if one wants to use all k-mers from samples for distance estimation or employs minhashing during database construction. Syntax:
Kmer-db
Kmer-db is a fast and memory-efficient tool for large-scale k-mer analyses (indexing, querying, estimating evolutionary relationships, etc.).
Quick start
Table of contents
1. Installation
Kmer-db comes with a set of precompiled binaries for Linux, macOS, and Windows. The software is also available on Bioconda:
For detailed instructions how to set up Bioconda, please refer to the Bioconda manual. Kmer-db can be also built from the sources distributed as:
Vector extensions
Kmer-db can be built for x86-64 and ARM64 8 architectures (including Apple Mx based on ARM64 8.4 core) and takes advantage of AVX2 (x86-64) and NEON (ARM) CPU extensions. The default target platform is x86-64 with AVX2 extensions. This, however, can be changed by setting
PLATFORMvariable formake:Note, that x86-64 binaries determine the supported extensions at runtime, which makes them backwards-compatible. For instance, the AVX executable will also work on SSE-only platform, but with limited performance.
2. Usage
kmer-db <mode> [options] <positional arguments>Kmer-db operates in one of the following modes:
build- building a database from samples,all2all- counting common k-mers - all samples in the database,all2all-sp- counting common k-mers - all samples in the database (sparse computation),all2all-parts- counting common k-mers - all samples within from databases (sparse computation),new2all- counting common k-mers - set of new samples versus database,one2all- counting common k-mers - single sample versus database,distance- calculating similarities/distances,minhash- storing minhashed k-mers.Common options:
-t <threads>- number of threads (default: number of available cores),The meaning of other options and positional arguments depends on the selected mode.
2.1. Building a database
Construction of k-mers database is an obligatory step for further analyses. The procedure accepts several input types:
compressed or uncompressed genomes/reads:
kmer-db build [-k <kmer-length>] [-f <fraction>] [-multisample-fasta] [-extend] [-alphabet <alphabet>] [-preserve-strand] [-t <threads>] <samples> <database>KMC-generated k-mers:
kmer-db build -from-kmers [-f <fraction>] [-extend] [-t <threads>] <samples> <database>minhashed k-mers produced by
minhashmode:kmer-db build -from-minhash [-extend] [-t <threads>] <samples> <database>Parameters:
samples(input) - one of the following:-multisample-fastaswitch) samples-from-kmersswitch specified). A set of two KMC files (.kmc_pre + .kmc_suf) is required for every list entry.-from-minhashswitch specified). Minhashed k-mer files (.minhash) must be generated byminhashcommand prior to the database construction.Note, that minhashing may be also done during the database construction by specyfying
-foption.database(output) - file with generated k-mer database,-k <kmer-length>- length of k-mers (default: 18); ignored when-from-kmersor-from-minhashswitch is specified,-f <fraction>- fraction of all k-mers to be accepted by the minhash filter during database construction (default: 1); ignored when-from-minhashswitch is present,-multisample-fasta- each sequence in a FASTA file is treated as a separate sample,-extend- extend the existing database with new samples,-alphabet- alphabet:nt(4 symbol nucleotide with indistinguishable T/U; default)aa(20 symbol amino acid)aa12_mmseqs(amino acid reduced to 12 symbols as in MMseqs: AST,C,DN,EQ,FY,G,H,IV,KR,LM,P,Waa11_diamond(amino acid reduced to 11 symbols as in Diamond: KREDQN,C,G,H,ILV,M,F,Y,W,P,STAaa6_dayhoff(amino acid reduced to 6 symbols as proposed by Dayhoff: STPAG,NDEQ,HRK,MILV,FYW,C-preserve-strand- preserve strand instead of taking canonical k-mers (allowed only inntalphabet; default: off)-t <threads>- number of threads (default: number of available cores).2.2. Counting common k-mers
Samples in the database against each other:
Dense computations - recomended when the distance matrix contains few zeros. Output can be stored in the dense or sparse form (
-sparseswitch).kmer-db all2all [-buffer <size_mb>] [-t <threads>] [-sparse [-min [<criterion>:]<value>]* [-max [<criterion>:]<value>]* ] <database> <common_table>Sparse computations - recommended when the distance matrix contains many zeros. Output matrix is always in the sparse form:
kmer-db all2all-sp [-buffer <size_mb>] [-t <threads>] [-min [<criterion>:]<value>]* [-max [<criterion>:]<value>]* [-sample-rows [<criterion>:]<count>] <database> <common_table>Sparse computations, partial databases - use when the distance matrix contains many zeros and there are multiple partial databases. Output matrix is always in the sparse form:
kmer-db all2all-parts [-buffer <size_mb>] [-t <threads>] [-min [<criterion>:]<value>]* [-max [<criterion>:]<value>]* [-sample-rows [<criterion>:]<count>] <db_list> <common_table>Parameters:
database(input) - k-mer database file created bybuildmode,db_list(input) - file containing list of databases files created bybuildmode,common_table(output) - file containing table with common k-mer counts,-buffer <size_mb>- size of cache buffer in megabytes; use L3 size for Intel CPUs and L2 for AMD for best performance; default: 8,-t <threads>- number of threads (default: number of available cores),-sparse- stores output matrix in a sparse form (always on inall2all-spandall2all-partsmodes),-min [<criterion>:]<value>- retains elements withcriteriongreater than or equal tovalue(see details below),-max [<criterion>:]<value>- retains elements withcriterionlower than or equal tovalue(see details below),-sample-rows [<criterion>:]<count>- retainscountelements in every row using one of the strategies: (i) random selection (nocriterion); (ii) the best elements with respect tocriterion.criterioncan benum-kmers(number of common k-mers) or one of the distance/similarity measures:jaccard,min,max,cosine,mash,ani,ani-shorder(see 2.3 for definitions). Nocriterionindicatesnum-kmers(filtering) or random elements selection (sampling). Multiple filters can be combined.New samples against the database:
kmer-db new2all [-multisample-fasta | -from-kmers | -from-minhash] [-t <threads>] [-sparse [-min [<criterion>:]<value>]* [-max [<criterion>:]<value>]* ] <database> <samples> <common_table>Parameters:
database(input) - k-mer database file created bybuildmode,samples(input) - file containing samples in one of the supported formats (seebuildmode); if samples are given as genomes (default) or k-mers (-from-kmersswitch), the minhashing is done automatically with the same filter as in the database,common_table(output) - file containing table with common k-mer counts,-multisample-fasta/-from-kmers/-from-minhash- seebuildmode for details,-t <threads>- number of threads (default: number of available cores),-sparse- stores output matrix in a sparse form,-min [<criterion>:]<value>- retains elements withcriteriongreater than or equal tovalue(see details below),-max [<criterion>:]<value>- retains elements withcriterionlower than or equal tovalue(see details below),criterioncan benum-kmers(number of common k-mers) or one of the distance/similarity measures:jaccard,min,max,cosine,mash,ani,ani-shorder(see 2.3 for definitions). Nocriterionindicatesnum-kmers. Multiple filters can be combined.Single sample against the database:
kmer-db one2all [-from-kmers | -from-minhash] [-t <threads>] <database> <sample> <common_table>The meaning of the parameters is the same as in
new2allmode, but instead of specifying file with sample list, a single sample file is used as a query.Output format
Modes
all2all,all2all-sp,all2all-parts,new2all, andone2allproduce a comma-separated table with numbers of common k-mers. Forall2all,new2all, andone2allmodes, the table is by default stored in a dense form:where:
When
-sparseswitch is specified orall2all-sp,all2all-partsmodes are used, the table is stored in a sparse form. In particular, zeros are omitted while non-zero elements are represented as pairs (column_id: value) with 1-based column indexing. Thus, rows may have different number of elements, e.g.:For performance reasons,
all2all,all2all-sp, andall2all-partsmodes produce a lower triangular matrix.2.3. Calculating similarities or distances
kmer-db distance <measure> [-sparse [-min [<criterion>:]<value>]* [-max [<criterion>:]<value>]* ] <common_table> <output_table>Parameters:
measure- names of the similarity/distance measure to be calculated, can be one of the following:jaccard: J(q,s)=∣p∩q∣/∣p∪q∣,min: min(q,s)=∣p∩q∣/min(∣p∣,∣q∣),max: max(q,s)=∣p∩q∣/max(∣p∣,∣q∣),cosine: cos(q,s)=∣p∩q∣/∣p∣⋅∣q∣,mash(Mash distance): Mash(q,s)=−k1ln1+J(q,s)2⋅J(q,s),ani(average nucleotide identity): ANI(q,s)=1−Mash(p,q),ani-shorter- same asanibut withminused instead ofjaccard.common_table(input) - file containing table with numbers of common k-mers produced byall2all,new2all, orone2allmode (both, dense and sparse matrices are supported),output_table(output) - file containing table with calculated distance measure,-phylip-out- store output distance matrix in a Phylip format,-sparse- outputs a sparse matrix (only for dense input matrices - sparse inputs always produce sparse outputs),-min [<criterion>:]<value>- retains elements withcriteriongreater than or equal tovalue(see details below),-max [<criterion>:]<value>- retains elements withcriterionlower than or equal tovalue(see details below),criterioncan benum-kmers(number of common k-mers) or one of the distance/similarity measures:jaccard,min,max,cosine,mash,ani,ani-shorder(see 2.3 for definitions). If nocriterionis specified,measureargument is used by default. Multiple filters can be combined.2.4. Storing minhashed k-mers
This is an optional analysis step which stores minhashed k-mers on the hard disk to be later consumed by
build,new2all, orone2allmodes with-from-minhashswitch. It can be skipped if one wants to use all k-mers from samples for distance estimation or employs minhashing during database construction. Syntax:kmer-db minhash [-f <fraction>] [-k <kmer-length>] [-multisample-fasta] [-alphabet <alphabet>] [-preserve-strand] <samples>kmer-db minhash -from-kmers [-f <fraction>] <samples>Parameters:
sample_list(input) - file containing list of samples in one of the supported formats (seebuildmode),-f <fraction>- fraction of all k-mers to be accepted by the minhash filter (default: 0.01),-k <kmer-length>- length of k-mers (default: 18; maximum: 30); ignored when-from-kmersswitch is specified,-multisample-fasta/-from-kmers- seebuildmode for details.-alphabet- alphabet:nt(4 symbol nucleotide with indistinguishable T/U; default)aa(20 symbol amino acid)aa12_mmseqs(amino acid reduced to 12 symbols as in MMseqs: AST,C,DN,EQ,FY,G,H,IV,KR,LM,P,Waa11_diamond(amino acid reduced to 11 symbols as in Diamond: KREDQN,C,G,H,ILV,M,F,Y,W,P,STAaa6_dayhoff(amino acid reduced to 6 symbols as proposed by Dayhoff: STPAG,NDEQ,HRK,MILV,FYW,C-preserve-strand- preserve strand instead of taking canonical k-mers (allowed only inntalphabet; default: off)For each sample from the list, a binary file with .minhash extension containing filtered k-mers is created.
3. Datasets
List of the pathogens investigated in Kmer-db study can be found here
Citing
Deorowicz, S., Gudyś, A., Długosz, M., Kokot, M., Danek, A. (2019) Kmer-db: instant evolutionary distance estimation, Bioinformatics, 35(1): 133–136
Zielezinski A, Gudyś A, Barylski J, Siminski K, Rozwalak P, Dutilh BE, Deorowicz S. Ultrafast and accurate sequence alignment and clustering of viral genomes. Nat Methods. https://doi.org/10.1038/s41592-025-02701-7