CONSTAX (CONSensus TAXonomy) is a tool, written in Python 3, for improved taxonomic resolution of environmental fungal ITS sequences. Briefly, CONSTAX compares the taxonomic classifications obtained from RDP Classifier, UTAX or BLAST, and SINTAX and merges them into an improved consensus taxonomy using a 2 out of 3 rule (e.g. If an OTU is classified as taxon A by RDP and UTAX/BLAST and taxon B by SINTAX, taxon A will be used in the consensus taxonomy) and the classification p-value to break the ties (e.g. when 3 different classification are obtained for the same OTU). This tool also produces summary classification outputs that are useful for downstream analyses. In summary, our results demonstrate that independent taxonomy assignment tools classify unique members of the fungal community, and greater classification power (proportion of assigned operational taxonomic units at a given taxonomic rank) is realized by generating consensus taxonomy of available classifiers with CONSTAX.
CONSTAX v.2 improves upon v.1 with the following features:
Updated software requirements, including Python 3 and Java 8.
Compatibility with SILVA-formatted databases
Streamlined command-line implementation
BLAST classification option, due to legacy status of UTAX
git clone https://github.com/rdpstaff/RDPTools.git
cd RDPTools
git submodule init AlignmentTools ReadSeq classifier TaxonomyTree
git submodule update
sed -i 's/1.5/1.6/' AlignmentTools/nbproject/project.properties ReadSeq/nbproject/project.properties classifier/nbproject/project.properties
sed -i 's/basedir="."/basedir="." xmlns:unless="ant:unless"/' classifier/build.xml
sed -i 's/name="download-traindata" unless="offline"/name="download-traindata" unless="skip_td_download"/' classifier/build.xml
sed -i 's+move file="${dist.dir}/data.tgz"+move unless:set="skip_td_download" file="${dist.dir}/data.tgz"+' classifier/build.xml
cd classifier
ant jar -Dskip_td_download=true
cp dist/classifier.jar ../
BLAST installation
From Bioconda
conda install -c bioconda blast
From NCBI
Download the BLAST executables from here. The ncbi-blast-<version>+-x64-<system>64.tar.gz file works fine.
Unzip with tar -xzvf ncbi-blast-<version>+-x64-<system>64.tar.gz (replace version and system).
Add the blastn and makeblastdb executables to your path. You can do this by moving them to your bin directory.
CONSTAX installation
Clone the repository: git clone https://github.com/liberjul/CONSTAXv2.git
Make constax.sh executable.
cd CONSTAXv2
chmod +x constax.sh
ln -s constax.sh constax
Datasets
Dependent on where your sequences originate, you will need to have an appropriate database with which to classify them.
For Fungi or all Eukaryotes, the UNITE database is preferred.
For Bacteria and Archaea, we recommend the SILVA database. The SILVA_XXX_SSURef_tax_silva.fasta.gz file can be gunzip-ped and used.
Note: SILVA taxonomy is not assigned by Linnean ranks (Kingdom, Phylum, etc.), so instead placeholder ranks 1-n are used. Also, the size of the SILVA database means that a server/cluster is required to train the classifier (128GB RAM for RDP). If you have a computer with 32GB of RAM, you may be able to train using the UNITE database. If you cannot train locally for UNITE, the RDP files can be downloaded from here. The genus_wordConditionalProbList.txt.gz file should be gunzip-ped after downloading.
We have included a script for filtering the databases, which can create a Bacteria-only database, for example. The -k or –keyword argument is a substring of the record header.
usage: constax [-h] [-c CONF] [-n NUM_THREADS] [-m MHITS] [-e EVALUE] [-p P_IDEN] [-d DB] [-f TRAINFILE] [-i INPUT] [-o OUTPUT] [-x TAX] [-t] [-b]
[--select_by_keyword SELECT_BY_KEYWORD] [-s] [--make_plot] [--check] [--mem MEM] [--sintax_path SINTAX_PATH] [--utax_path UTAX_PATH]
[--rdp_path RDP_PATH] [--constax_path CONSTAX_PATH] [--pathfile PATHFILE] [--isolates ISOLATES] [--isolates_query_coverage ISOLATES_QUERY_COVERAGE]
[--isolates_percent_identity ISOLATES_PERCENT_IDENTITY] [--high_level_db HIGH_LEVEL_DB] [--high_level_query_coverage HIGH_LEVEL_QUERY_COVERAGE]
[--high_level_percent_identity HIGH_LEVEL_PERCENT_IDENTITY] [-v]
optional arguments:
-h, --help show this help message and exit
-c CONF, --conf CONF Classification confidence threshold (default: 0.8)
-n NUM_THREADS, --num_threads NUM_THREADS
Number of threads to use for parallel computing steps (default: 1)
-m MHITS, --mhits MHITS
Maximum number of BLAST hits to use, for use with -b option (default: 10)
-e EVALUE, --evalue EVALUE
Maximum expect value of BLAST hits to use, for use with -b option (default: 1.0)
-p P_IDEN, --p_iden P_IDEN
Minimum proportion identity of BLAST hits to use, for use with -b option (default: 0.0)
-d DB, --db DB Database to train classifiers, in FASTA format (default: )
-f TRAINFILE, --trainfile TRAINFILE
Path to which training files will be written (default: ./training_files)
-i INPUT, --input INPUT
Input file in FASTA format containing sequence records to classify (default: otus.fasta)
-o OUTPUT, --output OUTPUT
Output directory for classifications (default: ./outputs)
-x TAX, --tax TAX Directory for taxonomy assignments (default: ./taxonomy_assignments)
-t, --train Complete training if specified (default: False)
-b, --blast Use BLAST instead of UTAX if specified (default: False)
--select_by_keyword SELECT_BY_KEYWORD
Takes a keyword argument and --input FASTA file to produce a filtered database with headers containing the keyword with name --output (default:
False)
-s, --conservative If specified, use conservative consensus rule (2 False = False winner) (default: False)
--consistent If specified, show if the consensus taxonomy is consistent with the real hierarchical taxonomy (default: False)
--make_plot If specified, run R script to make plot of classified taxa (default: False)
--check If specified, runs checks but stops before training or classifying (default: False)
--mem MEM Memory available to use for RDP, in MB. 32000MB recommended for UNITE, 128000MB for SILVA (default: 32000)
--sintax_path SINTAX_PATH
Path to USEARCH/VSEARCH executable for SINTAX classification (default: False)
--utax_path UTAX_PATH
Path to USEARCH executable for UTAX classification (default: False)
--rdp_path RDP_PATH Path to RDP classifier.jar file (default: False)
--constax_path CONSTAX_PATH
Path to CONSTAX scripts (default: False)
--pathfile PATHFILE File with paths to SINTAX, UTAX, RDP, and CONSTAX executables (default: pathfile.txt)
--isolates ISOLATES FASTA formatted file of isolates to use BLAST against (default: False)
--isolates_query_coverage ISOLATES_QUERY_COVERAGE
Threshold of sequence query coverage to report isolate matches (default: 75)
--isolates_percent_identity ISOLATES_PERCENT_IDENTITY
Threshold of aligned sequence percent identity to report isolate matches (default: 1)
--high_level_db HIGH_LEVEL_DB
FASTA database file of representative sequences for assignment of high level taxonomy (default: False)
--high_level_query_coverage HIGH_LEVEL_QUERY_COVERAGE
Threshold of sequence query coverage to report high-level taxonomy matches (default: 75)
--high_level_percent_identity HIGH_LEVEL_PERCENT_IDENTITY
Threshold of aligned sequence percent identity to report high-level taxonomy matches (default: 1)
-v, --version Display version and exit (default: False)
If using a database for the first time, you will need to use the -t or --train flag to train the classifiers on the dataset.
In the directory with your OTU/zOTU/ASV/ESV FASTA file:
The classification results are in the output directory. The file consensus_taxonomy.txt can be read in to R for microbiome analysis.
-c, --conf=0.8
Classification confidence threshold, used by each classifier (0,1]. Increase for improved specificity, reduced sensitivity.
-n, --num_threads=1
Number of threads to use for parallelization. Maximum classification speed at about 32 threads. Training only uses 1 thread.
-m, --mhits=10
Maximum number of BLAST hits to use, for use with -b option. When classifying with BLAST, this many hits are kept. Confidence for a given taxa is based on the proportion of these hits agree with that taxa. 5 works well for UNITE, 20 with SILVA (standard, not NR).
-e, --evalue=1
Maximum expect value of BLAST hits to use, for use with -b option. When classifying with BLAST, only hits under this expect value threshold are used. Decreasing will increase specificity, but decrease sensitivity at high taxonomic ranks.
-p, --p_iden=0.8
Minimum proportion identity of BLAST hits to use, for use with -b option. Minimum proportion of conserve bases to keep hit.
-d, --db
Database to train classifiers. UNITE and SILVA formats are supported. See Datasets.
-f, --trainfile=./training_files
Path to which training files will be written.
-i, --input=otus.fasta
Input file in FASTA format containing sequence records to classify.
-o, --output=./outputs
Output directory for classifications.
-x, --tax=./taxonomy_assignments
Directory for taxonomy assignments.
-t, --train
Complete training if specified. Cannot run classification without training files present.
-b, --blast
Use BLAST instead of UTAX if specified. If installed with conda, this in the option that will work by default. UTAX is available from USEARCH. BLAST classification generally performs better with faster training, similar classification speed, and greater accuracy.
--select_by_keyword
Takes a keyword argument and --input FASTA file to produce a filtered database with headers containing the keyword with name --output. Helpful for limiting search database to Bacteria, Archaea, Fungi, or other group.
--conservative
If specified, use conservative consensus rule (2 null = null winner). Works better for SILVA to use this option.
--make_plot
If specified, run R script to make plot of classified taxa.
--check
If specified, runs checks but stops before training or classifying.
--mem
Memory available to use for RDP, in MB. 32000MB recommended for UNITE, 128000MB for SILVA.
--sintax_path
Path to USEARCH/VSEARCH executable for SINTAX classification. Can also be vsearch if already on path.
--utax_path
Path to USEARCH executable for UTAX classification.
--rdp_path
Path to RDP classifier.jar file, or classifier if on path from RDPTools conda install.
--constax_path
Path to CONSTAX scripts.
--pathfile
File with paths to SINTAX, UTAX, RDP, and CONSTAX executables.
--isolates
FASTA formatted file of isolates to use BLAST against.
CONSTAXv2
CONSTAX (CONSensus TAXonomy) is a tool, written in Python 3, for improved taxonomic resolution of environmental fungal ITS sequences. Briefly, CONSTAX compares the taxonomic classifications obtained from RDP Classifier, UTAX or BLAST, and SINTAX and merges them into an improved consensus taxonomy using a 2 out of 3 rule (e.g. If an OTU is classified as taxon A by RDP and UTAX/BLAST and taxon B by SINTAX, taxon A will be used in the consensus taxonomy) and the classification p-value to break the ties (e.g. when 3 different classification are obtained for the same OTU). This tool also produces summary classification outputs that are useful for downstream analyses. In summary, our results demonstrate that independent taxonomy assignment tools classify unique members of the fungal community, and greater classification power (proportion of assigned operational taxonomic units at a given taxonomic rank) is realized by generating consensus taxonomy of available classifiers with CONSTAX.
CONSTAX v.2 improves upon v.1 with the following features:
Developed by
CONSTAX v.1 was authored by:
Documentation
See constax.readthedocs.io
Dependencies
Installation
One step, for Linux, WSL, or MacOS systems
This uses conda installations, including vsearch instead of usearch. You can modify
SINTAXPATHin pathfile.txt if you have a usearch installation.Custom installation, and installation for Windows
USEARCH installation
gunzip usearch<version>.gzto unzip.chmod +x usearch<version>to make the file executable.RDP installation
BLAST installation
From Bioconda
conda install -c bioconda blastFrom NCBI
tar -xzvf ncbi-blast-<version>+-x64-<system>64.tar.gz(replace version and system).blastnandmakeblastdbexecutables to your path. You can do this by moving them to yourbindirectory.CONSTAX installation
git clone https://github.com/liberjul/CONSTAXv2.gitconstax.shexecutable.Datasets
Dependent on where your sequences originate, you will need to have an appropriate database with which to classify them.
For Fungi or all Eukaryotes, the UNITE database is preferred.
For Bacteria and Archaea, we recommend the SILVA database. The SILVA_XXX_SSURef_tax_silva.fasta.gz file can be
gunzip-ped and used.Note: SILVA taxonomy is not assigned by Linnean ranks (Kingdom, Phylum, etc.), so instead placeholder ranks 1-n are used. Also, the size of the SILVA database means that a server/cluster is required to train the classifier (128GB RAM for RDP). If you have a computer with 32GB of RAM, you may be able to train using the UNITE database. If you cannot train locally for UNITE, the RDP files can be downloaded from here. The
genus_wordConditionalProbList.txt.gzfile should begunzip-ped after downloading.We have included a script for filtering the databases, which can create a Bacteria-only database, for example. The -k or –keyword argument is a substring of the record header.
Running CONSTAX
If using a database for the first time, you will need to use the
-tor--trainflag to train the classifiers on the dataset.In the directory with your OTU/zOTU/ASV/ESV FASTA file:
The classification results are in the output directory. The file
consensus_taxonomy.txtcan be read in to R for microbiome analysis.Classification confidence threshold, used by each classifier (0,1]. Increase for improved specificity, reduced sensitivity.
Number of threads to use for parallelization. Maximum classification speed at about 32 threads. Training only uses 1 thread.
Maximum number of BLAST hits to use, for use with -b option. When classifying with BLAST, this many hits are kept. Confidence for a given taxa is based on the proportion of these hits agree with that taxa. 5 works well for UNITE, 20 with SILVA (standard, not NR).
Maximum expect value of BLAST hits to use, for use with -b option. When classifying with BLAST, only hits under this expect value threshold are used. Decreasing will increase specificity, but decrease sensitivity at high taxonomic ranks.
Minimum proportion identity of BLAST hits to use, for use with -b option. Minimum proportion of conserve bases to keep hit.
Database to train classifiers. UNITE and SILVA formats are supported. See Datasets.
Path to which training files will be written.
Input file in FASTA format containing sequence records to classify.
Output directory for classifications.
Directory for taxonomy assignments.
Complete training if specified. Cannot run classification without training files present.
Use BLAST instead of UTAX if specified. If installed with conda, this in the option that will work by default. UTAX is available from USEARCH. BLAST classification generally performs better with faster training, similar classification speed, and greater accuracy.
Takes a keyword argument and
--inputFASTA file to produce a filtered database with headers containing the keyword with name--output. Helpful for limiting search database to Bacteria, Archaea, Fungi, or other group.If specified, use conservative consensus rule (2 null = null winner). Works better for SILVA to use this option.
If specified, run R script to make plot of classified taxa.
If specified, runs checks but stops before training or classifying.
Memory available to use for RDP, in MB. 32000MB recommended for UNITE, 128000MB for SILVA.
Path to USEARCH/VSEARCH executable for SINTAX classification. Can also be
vsearchif already on path.Path to USEARCH executable for UTAX classification.
Path to RDP
classifier.jarfile, orclassifierif on path from RDPTools conda install.Path to CONSTAX scripts.
File with paths to SINTAX, UTAX, RDP, and CONSTAX executables.
FASTA formatted file of isolates to use BLAST against.
FASTA database file of representative sequences for assignment of high level taxonomy. The SILVA NR99 database for SSU/16S/18S sequences and the UNITE Eukayotes database.