A post-processing tool for Kraken 2 read classifications. Run Kraken 2 once and re-classify the reads with any confidence score stringency of your choice afterwards, saving you lots of compute time. Creates Kraken 2 style report and read classification files.
For additional insight into your Kraken 2 classifications, try out KrakMeOpen - a downstream analysis toolkit for Kraken 2 classification quality metrics.
Installation
StringMeUp is available to install through conda. Simply run the following command to install it:
The confidence score (CS) for a given read R classified to a given node J is calculated by dividing the number of k-mers that hit any node in the clade rooted at node J (N) by the total number of k-mers that were queried against the database (M). Any k-mer with an ambiguous nucleotide is not queried against the database, and is thus not part of M.
CS = N / M
If the CS for a given read R at a given node J is equal to or larger than the specified cutoff, read R is classified to node J. If not, the CS of read R is calculated for the parent of node J. This is repeated until the CS >= CS cutoff or until we reach the root of the taxonomy. If the CS < CS cutoff at the root, the read is deemed unclassified.
Reclassifying Kraken 2 output
To reclassify reads classified by Kraken 2 with a confidence cutoff of 0.1:
original_classifications.kraken2 is the output file from Kraken 2 that contain the read-by-read classifications.
names.dmp and nodes.dmp are the same NCBI taxonomy files used for the building of the database that was used to produce the classifications in original_classifications.kraken2.
This command would output a Kraken 2 style report to stdout. Adding --output_report <FILE> would save the report in a file.
To save the read-by-read classifications, add --output_classifications <FILE> to the command.
To save a verbose version of the read-by-read classifications, add --output_verbose <FILE> to the command. The verbose version of the read-by-read classifications will contain the following columns:
Column
Explanation
READ_ID
The ID of the read
READ_LENGTH
The length of the read (same as Kraken 2 output)
MINIMIZER_HIT_GROUPS*
The number of minimizer hit groups found during Kraken 2 classification*
TAX_LVL_MOVES
How many levels in the taxonomy that the read moved during reclassification
ORIGINAL_TAXID
The taxID that the read was classified to originally
NEW_TAXID
The taxID that the read was reclassified to
ORIGINAL_CONFIDENCE
The original confidence score
NEW_CONFIDENCE
The confidence score at the taxID that the read was reclassified to
MAX_CONFIDENCE
The maximum confidence that the read can have
ORIGINAL_TAX_LVL
The taxonomic rank of the orignally classified taxID
NEW_TAX_LVL
The taxonomic rank of the reclassified taxID
ORIGINAL_NAME
The scientific name of the original taxID
NEW_NAME
The scientific name of the reclassified taxID
KMER_STRING
The k-mer string (same as Kraken 2 output)
*: Is only present if the forked version of Kraken 2 was used for initial classification.
Reclassifying with minimum hit groups
This option requires an input file that was produced with my fork of Kraken 2.
Add --minimum_hit_groups <INT> to the command. A read can only be considered classified if the number of minimizer hit groups is at or above the minimum_hit_groups setting.
StringMeUp
A post-processing tool for Kraken 2 read classifications. Run Kraken 2 once and re-classify the reads with any confidence score stringency of your choice afterwards, saving you lots of compute time. Creates Kraken 2 style report and read classification files.
For additional insight into your Kraken 2 classifications, try out KrakMeOpen - a downstream analysis toolkit for Kraken 2 classification quality metrics.
Installation
StringMeUp is available to install through conda. Simply run the following command to install it:
conda install -c conda-forge -c bioconda stringmeupUsage
A good start is to run
stringmeup --help.About the confidence score
The confidence score (CS) for a given read R classified to a given node J is calculated by dividing the number of k-mers that hit any node in the clade rooted at node J (N) by the total number of k-mers that were queried against the database (M). Any k-mer with an ambiguous nucleotide is not queried against the database, and is thus not part of M.
CS = N / M
If the CS for a given read R at a given node J is equal to or larger than the specified cutoff, read R is classified to node J. If not, the CS of read R is calculated for the parent of node J. This is repeated until the CS >= CS cutoff or until we reach the root of the taxonomy. If the CS < CS cutoff at the root, the read is deemed unclassified.
Reclassifying Kraken 2 output
To reclassify reads classified by Kraken 2 with a confidence cutoff of 0.1:
stringmeup --names <names.dmp> --nodes <nodes.dmp> 0.1 <original_classifications.kraken2>Where:
This command would output a Kraken 2 style report to stdout. Adding
--output_report <FILE>would save the report in a file.To save the read-by-read classifications, add
--output_classifications <FILE>to the command.To save a verbose version of the read-by-read classifications, add
--output_verbose <FILE>to the command. The verbose version of the read-by-read classifications will contain the following columns:*: Is only present if the forked version of Kraken 2 was used for initial classification.
Reclassifying with minimum hit groups
This option requires an input file that was produced with my fork of Kraken 2.
Add
--minimum_hit_groups <INT>to the command. A read can only be considered classified if the number of minimizer hit groups is at or above the minimum_hit_groups setting.