unikmer is a toolkit for nucleic acid k-mer analysis,
providing functions
including set operation k-mers (sketch) optional with
TaxIds but without count information.
K-mers are either encoded (k<=32) or hashed (k<=64, using ntHash v1) into uint64,
and serialized in binary file with extension .unik.
TaxIds can be assigned when counting k-mers from genome sequences,
and LCA (Lowest Common Ancestor) is computed during set opertions
including computing union, intersecton, set difference, unique and
repeated k-mers.
Related projects:
kmers provides bit-packed k-mers methods for this tool.
unik provides k-mer serialization methods for this tool.
count Generate k-mers (sketch) from FASTA/Q sequences
Information
info Information of binary files
num Quickly inspect the number of k-mers in binary files
Format conversion
view Read and output binary format to plain text
dump Convert plain k-mer text to binary format
encode Encode plain k-mer texts to integers
decode Decode encoded integers to k-mer texts
Set operations
concat Concatenate multiple binary files without removing duplicates
inter Intersection of k-mers in multiple binary files
common Find k-mers shared by most of the binary files
union Union of k-mers in multiple binary files
diff Set difference of k-mers in multiple binary files
Split and merge
sort Sort k-mers to reduce the file size and accelerate downstream analysis
split Split k-mers into sorted chunk files
tsplit Split k-mers according to TaxId
merge Merge k-mers from sorted chunk files
Subset
head Extract the first N k-mers
sample Sample k-mers from binary files
grep Search k-mers from binary files
filter Filter out low-complexity k-mers
rfilter Filter k-mers by taxonomic rank
Searching on genomes
locate Locate k-mers in genome
map Mapping k-mers back to the genome and extract successive regions/subsequences
Misc
autocompletion Generate shell autocompletion script
version Print version information and check for update
Binary file
K-mers (represented in uint64 in RAM ) are serialized in 8-Byte
(or less Bytes for shorter k-mers in compact format,
or much less Bytes for sorted k-mers) arrays and
optionally compressed in gzip format with extension of .unik.
TaxIds are optionally stored next to k-mers with 4 or less bytes.
Compression ratio comparison
No TaxIds stored in this test.
label
encoded-kmera
gzip-compressedb
compact-formatc
sortedd
comment
plain
plain text
gzip
✔
gzipped plain text
unik.default
✔
✔
gzipped encoded k-mers in fixed-length byte array
unik.compat
✔
✔
✔
gzipped encoded k-mers in shorter fixed-length byte array
unik.sorted
✔
✔
✔
gzipped sorted encoded k-mers
a One k-mer is encoded as uint64 and serialized in 8 Bytes.
b K-mers file is compressed in gzip format by default,
users can switch on global option -C/--no-compress to output non-compressed file.
c One k-mer is encoded as uint64 and serialized in 8 Bytes by default.
However few Bytes are needed for short k-mers, e.g., 4 Bytes are enough for
15-mers (30 bits). This makes the file more compact with smaller file size,
controled by global option -c/--compact .
d One k-mer is encoded as uint64, all k-mers are sorted and compressed
using varint-GB algorithm.
In all test, flag --canonical is ON when running unikmer count.
unikmer: a versatile toolkit for k-mers with taxonomic information
Documents: https://bioinf.shenwei.me/unikmer/
unikmeris a toolkit for nucleic acid k-mer analysis, providing functions including set operation k-mers (sketch) optional with TaxIds but without count information.K-mers are either encoded (k<=32) or hashed (k<=64, using ntHash v1) into
uint64, and serialized in binary file with extension.unik.TaxIds can be assigned when counting k-mers from genome sequences, and LCA (Lowest Common Ancestor) is computed during set opertions including computing union, intersecton, set difference, unique and repeated k-mers.
Related projects:
Table of Contents
Using cases
Installation
Downloading executable binary files.
Via Bioconda

Commands
Usages
Counting
Information
Format conversion
Set operations
Split and merge
Subset
Searching on genomes
Misc
Binary file
K-mers (represented in
uint64in RAM ) are serialized in 8-Byte (or less Bytes for shorter k-mers in compact format, or much less Bytes for sorted k-mers) arrays and optionally compressed in gzip format with extension of.unik. TaxIds are optionally stored next to k-mers with 4 or less bytes.Compression ratio comparison
No TaxIds stored in this test.
plaingzipunik.defaultunik.compatunik.sorteduint64and serialized in 8 Bytes.-C/--no-compressto output non-compressed file.uint64and serialized in 8 Bytes by default. However few Bytes are needed for short k-mers, e.g., 4 Bytes are enough for 15-mers (30 bits). This makes the file more compact with smaller file size, controled by global option-c/--compact.uint64, all k-mers are sorted and compressed using varint-GB algorithm.--canonicalis ON when runningunikmer count.Quick Start
Support
Please open an issue to report bugs, propose new functions or ask for help.
License
MIT License