Taxor: Fast and space-efficient taxonomic classification of long reads with hierarchical interleaved XOR filters
Citation
Ulrich, J. U., & Renard, B. Y. (2024). Fast and space-efficient taxonomic classification of long reads with hierarchical interleaved XOR filters. Genome Research, gr-278623. doi: 10.1101/gr.278623.123
Taxor is a taxonomic classification and profiling tool that efficiently classifies DNA sequences against large sets of genomic reference sequences. Taxor stores k-mers in an optimized hierarchical interleaved XOR filter (HIXF) index and combines k-mer similarity and genome coverage information for precise taxonomic classification and profiling. It features:
Low false positive rates for k-mer matching
NCBI taxonomy integration
Open canonical syncmers as k-mer selection scheme for improved downsampling
classification with binning and taxonomic profiling
read reassignment EM algorithm for multi-matching reads
advanced filtration of search results
taxonomic and sequence abundance reports with genome size correction
Benchmarking results based on simulated and real long-read data sets demonstrate that Taxor enables more precise taxonomic classification and profiling of microbial populations while having a smaller memory footprint than other tools.
Installation
The easiest way is to install Taxor via Conda.
However, you can also build Taxor on your own using the following commands. Just make sure that you have installed CMake (>=3.16) and GCC (>= 10).
git clone https://github.com/JensUweUlrich/Taxor.git
cd Taxor
mkdir build
cd build
cmake ../src
cmake --build . --config Release
Pre-built databases
Users can easily build custom databases as described below or use the following pre-built database index files
Generate the taxonomic profile from search results
Taxor build
taxor-build - Creates and HIXF index of a given set of fasta files
==================================================================
DESCRIPTION
Creates an HIXF index using either k-mers or syncmers
OPTIONS
Basic options:
-h, --help
Prints the help page.
-hh, --advanced-help
Prints the help page including advanced options.
--version
Prints the version information.
--copyright
Prints the copyright/license information.
--export-help (std::string)
Export the help page information. Value must be one of [html, man].
Main options:
--input-file (std::string)
tab-separated-value file containing taxonomy information and reference file names
--input-sequence-dir (std::string)
directory containing the fasta reference files Default: .
--output-filename (std::string)
A file name for the resulting index. Default: .
--kmer-size (signed 32 bit integer)
size of kmers used for index construction Default: 20. Value must be in range [1,30].
--syncmer-size (signed 32 bit integer)
size of syncmer used for index construction Default: 10. Value must be in range [1,26].
--threads (signed 32 bit integer)
The number of threads to use. Default: 1. Value must be in range [1,32].
--use-syncmer
enable using syncmers for smaller index size
input-file
This file contains all relevant information about the organisms in the database, which will be indexed. All values are tab-separated and the file should have following columns:
Column 1: Assembly accession: the assembly accession.version reported in this field is
a unique identifier for the set of sequences in this particular version of
the genome assembly.
Column 2: Taxonomy ID: the NCBI taxonomy identifier for the organism from which the
genome assembly was derived. The NCBI Taxonomy Database is a curated
classification and nomenclature for all of the organisms in the public
sequence databases. The taxonomy record can be retrieved from the NCBI
Taxonomy resource:
https://www.ncbi.nlm.nih.gov/taxonomy/
Column 3: FTP path: the path to the directory on the NCBI genomes FTP site from which
data for this genome assembly can be downloaded
Column 4: Organism name
Column 5: Taxonomy string
Column 6: Taxonomy ID string
A two-line example of such a file is provided below. You can easily create such a file by following the preprocessing steps described in the Usage section.
input-sequence-dir
Path to the directory containing fasta files (compressed) of organisms listed in the tab-separated file explained above. The file stem of the fasta files needs to match the last directory path string of the FTP path in column 3 of the input file (e.g. GCF_000002495.2_MG8)
output-filename
Path to the output file containing the hierarchical interleaved XOR filter index of the reference sequences and taxonomy information for the profiling step.
kmer-size
Size of k-length-substrings used for pseudo-mapping. When using syncmers for downsampling, the kmer-size has to be even-numbered because of using open canonical syncmers. The maximum supported k-mer size is 30.
syncmer-size
Size of the substrings used for selecting a k-mer for pseudo-mapping. The syncmer-size also has to be even-numbered because of the usage of open canonical syncmers. This number needs to be smaller than the k-mer size and the maximum supported size is 26.
use-syncmer
Switch that enables the usage of syncmers for downsampling of k-mers.
threads
Number of threads used for computing the hierarchical structure and building the HIXF index.
Taxor search
taxor-search - Queries a file of DNA sequences against an HIXF index
====================================================================
DESCRIPTION
Query sequences against the taxor HIXF index structure
OPTIONS
Basic options:
-h, --help
Prints the help page.
-hh, --advanced-help
Prints the help page including advanced options.
--version
Prints the version information.
--copyright
Prints the copyright/license information.
--export-help (std::string)
Export the help page information. Value must be one of [html, man].
Main options:
--index-file (std::string)
taxor index file containing HIXF index and reference sequence information
--query-file (std::string)
file containing sequences to query against the index Default: .
--output-file (std::string)
A file name for the resulting output. Default: .
--threads (unsigned 8 bit integer)
The number of threads to use. Default: 1. Value must be in range [1,32].
--percentage (double)
If set, this threshold is used instead of the k-mer/syncmer models. Default: -1. Value must be in range
[0,1].
--error-rate (double)
Expected error rate of reads that will be queried Default: 0.04. Value must be in range [0,1].
index-file
Path to the file containing the hierarchical interleaved XOR filter index of the reference sequences and taxonomy information for the profiling step.
query-file
Path to a fast(a/q) file containing sequenced reads of a sample, which shall be taxonomically classified. This file can be gzip compressed.
output-file
Path to the output file containing results of the classification step. This file is tab-separated with 10 columns per line.
Column 1 : read identifier
Column 2 : Assembly Accession ID of the matching reference
Column 3 : Organism name of the matching reference
Column 4 : Taxonomy ID of the matching reference
Column 5 : Cumulative length of the matching reference sequence
Column 6 : Sequence length of the queried read
Column 7 : Overall number of k-mers (syncmers) generated from the queried read
Column 8 : Number of k-mers (syncmers) of the query read that match with the reference sequence
Column 9 : Taxonomy string of the matched reference
Column 10 : Taxonomy ID string of the matched reference
threads
Number of threads used for querying the sequences in the input file against the HIXF index.
percentage
Can be used to define the minimum percentage of k-mers/syncmers that need to match a reference. By default we use the k-mer model from Blanca et al. or empircally computed values for determining the thesholds for reporting a match.
error-rate
For more accurate classification of reads we are calculating the expected number of mutated k-mers for each read prefix based on the expected sequencing error rate. Than a confidence interval for the mutated k-mers is calculated as described by Blanca et al. and the minimum number of matching k-mers is calculated based on the upper bound of the confidence interval. The significance level of the confidence interval is set to 95% by default. When using synmcers, we are using emprically calculated minimum numbers of matching syncmers for a given error rate and k-mer length.
Taxor profile
taxor-profile - Taxonomic profiling of a sample by giving read matching results of Taxor search
===============================================================================================
DESCRIPTION
Taxonomic profiling of the given read set
OPTIONS
Basic options:
-h, --help
Prints the help page.
-hh, --advanced-help
Prints the help page including advanced options.
--version
Prints the version information.
--copyright
Prints the copyright/license information.
--export-help (std::string)
Export the help page information. Value must be one of [html, man].
Main options:
--search-file (std::string)
taxor search file containing results of read querying against the HIXF index
--cami-report-file (std::string)
output file reporting genomic abundances in CAMI profiling format. This is the relative genome abundance in
terms of the genome copy number for the respective TAXID in the overall sample. Note that this is not
identical to the relative abundance in terms of assigned base pairs.
--seq-abundance-file (std::string)
output file reporting sequence abundance in CAMI profiling format (including unclassified reads). This is
the relative sequence abundance in terms of sequenced base pairs for the respective TAXID in the overall
sample. Note that this is not identical to the genomic abundance in terms of genome copy number for the
respective TAXID. Default: .
--binning-file (std::string)
output file reporting read to taxa assignments in CAMI binning format
--sample-id (std::string)
Identifier of the analyzed sample
--threads (unsigned 8 bit integer)
The number of threads to use. Default: 1. Value must be in range [1,32].
search-file
Path to the output file of the search step containing results of the classification. This file is tab-separated with 10 columns per line as described above.
cami-report-file
Output file reporting genomic abundances in CAMI profiling format. This is the relative genome abundance or taxonomic abundance in terms of the genome copy number for the respective TAXID in the overall sample.
seq-abundance-file
Output file reporting sequence abundance in CAMI profiling format (including unclassified reads). This is the relative sequence abundance in terms of sequenced base pairs for the respective TAXID in the overall sample.
binning-file
Output file reporting read to taxon assignments in CAMI binning format.
sample-id
String that identifies the analyzed sample.
threads
Number of threads used for taxonomic profiling.
Usage
First download the reference sequences and taxonomy dump of the sequences from the NCBI using genome_updater.
Finally, the query result file is used as input for taxonomic profiling, which has three output files containing taxonomic abundances and sequence abundances in CAMI report format as well as a binning file with final read to reference assignments.
Taxor: Fast and space-efficient taxonomic classification of long reads with hierarchical interleaved XOR filters
Citation
Ulrich, J. U., & Renard, B. Y. (2024). Fast and space-efficient taxonomic classification of long reads with hierarchical interleaved XOR filters. Genome Research, gr-278623. doi: 10.1101/gr.278623.123
Table of contents
Description
Taxor is a taxonomic classification and profiling tool that efficiently classifies DNA sequences against large sets of genomic reference sequences. Taxor stores k-mers in an optimized hierarchical interleaved XOR filter (HIXF) index and combines k-mer similarity and genome coverage information for precise taxonomic classification and profiling. It features:
Benchmarking results based on simulated and real long-read data sets demonstrate that Taxor enables more precise taxonomic classification and profiling of microbial populations while having a smaller memory footprint than other tools.
Installation
The easiest way is to install Taxor via Conda.
However, you can also build Taxor on your own using the following commands. Just make sure that you have installed CMake (>=3.16) and GCC (>= 10).
Pre-built databases
Users can easily build custom databases as described below or use the following pre-built database index files
Commands
Taxor build
input-file
This file contains all relevant information about the organisms in the database, which will be indexed. All values are tab-separated and the file should have following columns:
A two-line example of such a file is provided below. You can easily create such a file by following the preprocessing steps described in the Usage section.
input-sequence-dir
Path to the directory containing fasta files (compressed) of organisms listed in the tab-separated file explained above. The file stem of the fasta files needs to match the last directory path string of the FTP path in column 3 of the input file (e.g. GCF_000002495.2_MG8)
output-filename
Path to the output file containing the hierarchical interleaved XOR filter index of the reference sequences and taxonomy information for the profiling step.
kmer-size
Size of k-length-substrings used for pseudo-mapping. When using syncmers for downsampling, the kmer-size has to be even-numbered because of using open canonical syncmers. The maximum supported k-mer size is 30.
syncmer-size
Size of the substrings used for selecting a k-mer for pseudo-mapping. The syncmer-size also has to be even-numbered because of the usage of open canonical syncmers. This number needs to be smaller than the k-mer size and the maximum supported size is 26.
use-syncmer
Switch that enables the usage of syncmers for downsampling of k-mers.
threads
Number of threads used for computing the hierarchical structure and building the HIXF index.
Taxor search
index-file
Path to the file containing the hierarchical interleaved XOR filter index of the reference sequences and taxonomy information for the profiling step.
query-file
Path to a fast(a/q) file containing sequenced reads of a sample, which shall be taxonomically classified. This file can be gzip compressed.
output-file
Path to the output file containing results of the classification step. This file is tab-separated with 10 columns per line.
threads
Number of threads used for querying the sequences in the input file against the HIXF index.
percentage
Can be used to define the minimum percentage of k-mers/syncmers that need to match a reference. By default we use the k-mer model from Blanca et al. or empircally computed values for determining the thesholds for reporting a match.
error-rate
For more accurate classification of reads we are calculating the expected number of mutated k-mers for each read prefix based on the expected sequencing error rate. Than a confidence interval for the mutated k-mers is calculated as described by Blanca et al. and the minimum number of matching k-mers is calculated based on the upper bound of the confidence interval. The significance level of the confidence interval is set to 95% by default. When using synmcers, we are using emprically calculated minimum numbers of matching syncmers for a given error rate and k-mer length.
Taxor profile
search-file
Path to the output file of the search step containing results of the classification. This file is tab-separated with 10 columns per line as described above.
cami-report-file
Output file reporting genomic abundances in CAMI profiling format. This is the relative genome abundance or taxonomic abundance in terms of the genome copy number for the respective TAXID in the overall sample.
seq-abundance-file
Output file reporting sequence abundance in CAMI profiling format (including unclassified reads). This is the relative sequence abundance in terms of sequenced base pairs for the respective TAXID in the overall sample.
binning-file
Output file reporting read to taxon assignments in CAMI binning format.
sample-id
String that identifies the analyzed sample.
threads
Number of threads used for taxonomic profiling.
Usage
First download the reference sequences and taxonomy dump of the sequences from the NCBI using genome_updater.
Then, unpack the taxonomy dump and create a tab-separated-values file using the Linux command cut and taxonkit.
Now we can build the hierarchical interleaved XOR filter (HIXF) index of the reference sequences and the NCBI taxonomy.
Then, we query the sample fastq file against the index allowing in this case a sequencing error rate of 15%.
Finally, the query result file is used as input for taxonomic profiling, which has three output files containing taxonomic abundances and sequence abundances in CAMI report format as well as a binning file with final read to reference assignments.