StringMeUp

A post-processing tool for Kraken 2 read classifications. Run Kraken 2 once and re-classify the reads with any confidence score stringency of your choice afterwards, saving you lots of compute time. Creates Kraken 2 style report and read classification files.

For additional insight into your Kraken 2 classifications, try out KrakMeOpen - a downstream analysis toolkit for Kraken 2 classification quality metrics.

Installation

StringMeUp is available to install through conda. Simply run the following command to install it:

conda install -c conda-forge -c bioconda stringmeup

Usage

A good start is to run stringmeup --help.

About the confidence score

The confidence score (CS) for a given read R classified to a given node J is calculated by dividing the number of k-mers that hit any node in the clade rooted at node J (N) by the total number of k-mers that were queried against the database (M). Any k-mer with an ambiguous nucleotide is not queried against the database, and is thus not part of M.

CS = N / M

If the CS for a given read R at a given node J is equal to or larger than the specified cutoff, read R is classified to node J. If not, the CS of read R is calculated for the parent of node J. This is repeated until the CS >= CS cutoff or until we reach the root of the taxonomy. If the CS < CS cutoff at the root, the read is deemed unclassified.

Reclassifying Kraken 2 output

To reclassify reads classified by Kraken 2 with a confidence cutoff of 0.1:

stringmeup --names <names.dmp> --nodes <nodes.dmp> 0.1 <original_classifications.kraken2>

Where:

original_classifications.kraken2 is the output file from Kraken 2 that contain the read-by-read classifications.
names.dmp and nodes.dmp are the same NCBI taxonomy files used for the building of the database that was used to produce the classifications in original_classifications.kraken2.

This command would output a Kraken 2 style report to stdout. Adding --output_report <FILE> would save the report in a file.

To save the read-by-read classifications, add --output_classifications <FILE> to the command.

To save a verbose version of the read-by-read classifications, add --output_verbose <FILE> to the command. The verbose version of the read-by-read classifications will contain the following columns:

Column	Explanation
READ_ID	The ID of the read
READ_LENGTH	The length of the read (same as Kraken 2 output)
MINIMIZER_HIT_GROUPS*	The number of minimizer hit groups found during Kraken 2 classification*
TAX_LVL_MOVES	How many levels in the taxonomy that the read moved during reclassification
ORIGINAL_TAXID	The taxID that the read was classified to originally
NEW_TAXID	The taxID that the read was reclassified to
ORIGINAL_CONFIDENCE	The original confidence score
NEW_CONFIDENCE	The confidence score at the taxID that the read was reclassified to
MAX_CONFIDENCE	The maximum confidence that the read can have
ORIGINAL_TAX_LVL	The taxonomic rank of the orignally classified taxID
NEW_TAX_LVL	The taxonomic rank of the reclassified taxID
ORIGINAL_NAME	The scientific name of the original taxID
NEW_NAME	The scientific name of the reclassified taxID
KMER_STRING	The k-mer string (same as Kraken 2 output)

*: Is only present if the forked version of Kraken 2 was used for initial classification.

Reclassifying with minimum hit groups

This option requires an input file that was produced with my fork of Kraken 2.

Add --minimum_hit_groups <INT> to the command. A read can only be considered classified if the number of minimizer hit groups is at or above the minimum_hit_groups setting.