The output is saved in the directory res_dir. Inside you will find the file genome_annotation with the annotation
in the same format as in the gene classification. More information on the other files can be found here.
Where all/genomes/dir is a directory, and all fasta files inside the directory will be classified.
Finally, you can find some databases to classify genomes (gtdb_30.stagDB in the examples) here.
Schematic depiction of the STAG workflow.
(a) Example taxonomic tree alongside
thirteen (partial) 16S sequences in a multiple sequence alignment (MSA). Four positions in the
MSA are highlighted for the information they contain to distinguish the different clades shown.
For example, at position 455, a ‘G’ distinguishes Enterobacteriaceae from Erwiniaceae. To
leverage this information, STAG trains one LASSO logistic regression classifier for each node in
the tree; coefficients corresponding to aligned bases are shown in b, c and d. For example, a ‘C’ at
position 648 is learnt to facilitate discrimination of Escherichia coli from Escherichia albertii.
To annotate a new sequence, it is first aligned to the MSA constructed during training. Second, the
sequence is classified along the tree, following the path with the highest posterior probabilities
(as returned by the node classifiers). Finally, the taxonomic lineage of the new sequence is
inferred from the probabilities accrued in the previous step; this in particular entails the decision
of which ranks not to assign.
This tool is design to classify metagenomic sequences (marker genes, genomes and amplicon reads) using a Hierarchical Taxonomic Classifier.
Please check also the wiki for more information.
Dependencies
The stag classifier requires:
If you have conda, you can install all the dependencies in
conda_env_stag.yaml. See Installation wiki for more info.Installation
Note: in the following examples we assume that the python script
stagis in the system path.Execution
Taxonomically annotate gene sequences
Given a fasta file (let’s say
unknown_seq.fasta), you can find the taxonomy annotation of these sequences using:The output is:
You can either create a database (see Create a database), or use one that we already compiled:
Taxonomically annotate genomes
Given a fasta file (let’s say
unknown_genome.fasta), you can find the taxonomy annotation of this genome with:The output is saved in the directory
res_dir. Inside you will find the filegenome_annotationwith the annotation in the same format as in the gene classification. More information on the other files can be found here.To classify multiple genomes, you can use:
Where
all/genomes/diris a directory, and all fasta files inside the directory will be classified.Finally, you can find some databases to classify genomes (
gtdb_30.stagDBin the examples) here.Schematic depiction of the STAG workflow.
(a) Example taxonomic tree alongside thirteen (partial) 16S sequences in a multiple sequence alignment (MSA). Four positions in the MSA are highlighted for the information they contain to distinguish the different clades shown. For example, at position 455, a ‘G’ distinguishes Enterobacteriaceae from Erwiniaceae. To leverage this information, STAG trains one LASSO logistic regression classifier for each node in the tree; coefficients corresponding to aligned bases are shown in b, c and d. For example, a ‘C’ at position 648 is learnt to facilitate discrimination of Escherichia coli from Escherichia albertii.
To annotate a new sequence, it is first aligned to the MSA constructed during training. Second, the sequence is classified along the tree, following the path with the highest posterior probabilities (as returned by the node classifiers). Finally, the taxonomic lineage of the new sequence is inferred from the probabilities accrued in the previous step; this in particular entails the decision of which ranks not to assign.