kmindex

kmindex is a tool for indexing and querying sequencing samples. It is built on top of kmtricks.

Given a databank $D = \{S_1, ..., S_n\}$ , with each $S_{i}$ being any genomic dataset (genome or raw reads), kmindex allows to compute the percentage of shared k-mers between a query $Q$ and each $S \in D$ . It supports multiple datasets and allows searching for each sub-index $D_i \in G = \{D_1,...,D_m\}$ . Queries benefit from the findere algorithm. In a few words, findere allows to reduce the false positive rate at query time by querying $(s + z)$ -mers instead of $s$ -mers, which are the indexed words, usually called $k$ -mers. kmindex is a tool for querying sequencing samples indexed using kmtricks.

Indexing/Querying example (can be tested in the examples directoy):

Index a dataset:

kmindex build --fof fof1.txt --run-dir D1_index --index ./G --register-as D --hard-min 2 --kmer-size 25 --nb-cell 1000000

Query the index:

kmindex query --index ./G --fastx query.fasta --zvalue 3

Full documentation is available at https://tlemane.github.io/kmindex

Citation Lemane, Téo, et al. “Indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets with kmindex and ORA“ Nature Computational Science 4.2 (2024): 104-109.

Pre-print paper is available on bioRxiv