目录

kmindex

License kmindex release dockerhub anaconda

kmindex is a tool for indexing and querying sequencing samples. It is built on top of kmtricks.

Given a databank D={S1,...,Sn}D = \{S_1, ..., S_n\}, with each SiS_i being any genomic dataset (genome or raw reads), kmindex allows to compute the percentage of shared k-mers between a query QQ and each SDS \in D. It supports multiple datasets and allows searching for each sub-index DiG={D1,...,Dm}D_i \in G = \{D_1,...,D_m\}. Queries benefit from the findere algorithm. In a few words, findere allows to reduce the false positive rate at query time by querying (s+z)(s+z)-mers instead of ss-mers, which are the indexed words, usually called kk-mers. kmindex is a tool for querying sequencing samples indexed using kmtricks.

Indexing/Querying example (can be tested in the examples directoy):

  1. Index a dataset:

    kmindex build --fof fof1.txt --run-dir D1_index --index ./G --register-as D --hard-min 2 --kmer-size 25 --nb-cell 1000000
  2. Query the index:

    kmindex query --index ./G --fastx query.fasta --zvalue 3

Full documentation is available at https://tlemane.github.io/kmindex

Citation Lemane, Téo, et al. “Indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets with kmindex and ORA“ Nature Computational Science 4.2 (2024): 104-109.

Pre-print paper is available on bioRxiv

关于

为高效检索构建 k-mer 索引结构。

10.6 MB
邀请码