Contig Annotation Tool (CAT) and Bin Annotation Tool (BAT) are pipelines for the taxonomic classification of long DNA sequences and metagenome assembled genomes (MAGs / bins) of both known and (highly) unknown microorganisms, as generated by contemporary metagenomics studies. The core algorithm of both programs involves gene calling, mapping of predicted ORFs against a protein database, and voting-based classification of the entire contig / MAG based on classification of the individual ORFs. CAT and BAT can be run from intermediate steps if files are formated appropriately.
A paper describing the algorithm together with extensive benchmarks can be found at https://doi.org/10.1186/s13059-019-1817-x. If you use CAT or BAT in your research, it would be great if you could cite us:
von Meijenfeldt F.A.B., Arkhipova K., Cambuy D.D., Coutinho F.H., Dutilh B.E. Robust taxonomic classification of uncharted microbial sequences and bins with CAT and BAT. Genome Biology. 2019;20:217.
Read Annotation Tool (RAT) estimates the taxonomic composition of metagenomes using CAT and BAT output. A manuscript describing RAT with benchmarks can be found at https://doi.org/10.1038/s41467-024-47155-1. If you use RAT in your research, it would be great if you could cite:
Hauptfeld, E., Pappas, N., van Iwaarden, S., Snoek B.L., Aldas-Vargas A., Dutilh B.E., von Meijenfeldt F.A.B. Integrating taxonomic signals from MAGs and contigs improves read annotation and taxonomic profiling of metagenomes. Nature Communications 15, 3373 (2024).
von Meijenfeldt F.A.B., Arkhipova K., Cambuy D.D., Coutinho F.H., Dutilh B.E. Robust taxonomic classification of uncharted microbial sequences and bins with CAT and BAT. Genome Biology. 2019;20:217.
CAT, BAT, and RAT have been thoroughly tested on Linux systems, and should run on macOS as well.
Installation
No installation is required. You can run CAT, BAT and RAT by supplying the absolute path:
$ ./CAT_pack/CAT_pack --help
Alternatively, if you add the files in the CAT_pack directory to your $PATH variable, you can run CAT, BAT, and RAT from anywhere:
$ CAT_pack --version
Getting started
To get started with CAT/BAT/RAT, you will have to get the database files on your system.
You can either download preconstructed database files, or generate them yourself.
Downloading preconstructed database files
To download the database files, find the most recent version on tbb.bio.uu.nl/tina/CAT_pack_prepare/, download and extract, and you are ready to go!
For NCBI nr:
$ wget tbb.bio.uu.nl/tina/CAT_pack_prepare/20240422_CAT_nr.tar.gz
$ tar -xvzf 20240422_CAT_nr.tar.gz
For GTDB:
$ wget tbb.bio.uu.nl/tina/CAT_pack_prepare/20231120_CAT_gtdb.tar.gz # release 214
$ tar -xvzf 20231120_CAT_gtdb.tar.gz
Creating a fresh NCBI nr or GTDB database yourself
Instead of using the preconstructed database, you can construct a fresh database yourself. The download module can be used to download and process raw data, in preparation for building a new CAT pack database.
This will ensure that all input dependencies are met and correctly formatted for CAT_pack prepare.
Currently, two databases are supported, NCBI’s nr and the Genome Taxonomy Database (GTDB) proteins.
NCBI non-redundant protein database (nr)
$ CAT_pack download --db nr -o path/to/nr_data_dir
Will download the fasta file with the protein sequences, their mapping to a taxid, and the taxonomy information from NCBI’s ftp site.
The files required to build a CAT pack database are provided by the GTDB downloads page.
CAT_pack download fetches the necessary files and does some additional processing to get them ready for CAT_pack prepare:
The taxonomy information from GTDB is transformed into NCBI style nodes.dmp and names.dmp files.
Protein sequences are extracted from gtdb_proteins_aa_reps.tar.gz and are subjected to a round of deduplication.
The deduplication reduces the redundancy in the DIAMOND database, thus simplifying the alignment process.
Exact duplicate sequences are identified based on a combination of the MD5sum of the protein sequences and their length.
Only one representative sequence is kept, with all duplicates encoded in the fasta header.
This information is later used by CAT_pack prepare to assign the LCA of the protein sequence appropriately in the .fastaid2LCAtaxid file.
The mapping of all protein sequences to their respective taxonomy is created.
In addition, the newick formatted trees of Bacteria and Archaea are downloaded and - artificially - concatenated under a single root node, to produce an all.tree file.
This file is not used by the CAT pack but may come in handy for downstream analyses.
When the download and processing of the files is finished successfully you can build a CAT pack database with CAT_pack prepare.
For all command line options available see
$ CAT_pack download -h
and
$ CAT_pack prepare -h
Creating a custom database
For a custom CAT pack database, you must have the following input ready before you launch a CAT_pack prepare run.
A fasta file containing all protein sequences you want to include in your database.
A names.dmp file that contains mappings of taxids to their ranks and scientific names.
The format must be the same as the NCBI standard names.dmp (uses \t|\t as field separator).
An example looks like this:
1 | root | scientific name |
2 | Bacteria | scientific name |
562 | Escherichia coli | scientific name |
A nodes.dmp file that describes the child-parent relationship of the nodes in the taxonomy tree and their (official) rank.
The format must be the same as the NCBI standard nodes.dmp (uses \t|\t as the field separator).
An example looks like this:
1 | 1 | root |
2 | 1 | superkingdom |
1224 | 2 | phylum |
1236 | 1224 | class |
91437 | 1236 | order |
543 | 91347 | family |
561 | 543 | genus |
562 | 561 | species |
A 2-column, tab-separated file containing the mapping of each sequence in the fasta file to a taxid in the taxonomy.
This file must contain the header accession.version taxid.
Once all of the above requirements are met you can run CAT_pack prepare.
All the input needs to be explicitly specified for CAT_pack prepare to work, for example:
The two subdirs db and tax are created that contain all necessary files.
The nodes.dmp and names.dmp in the tax directory are copied from their original location.
This is to ensure that the -t flag of CAT, BAT, and RAT work.
The default prefix is <YYYY-MM-DD>_CAT_pack. You can customize it with the --common_prefix option.
For all command line options available see
$ CAT_pack prepare -h
Running CAT/BAT/RAT.
The database files are needed in subsequent CAT/BAT/RAT runs. They only need to be generated/downloaded once or whenever you want to update the database.
To run CAT/BAT/RAT, respectively:
$ CAT_pack contigs # Runs CAT.
$ CAT_pack bins # Runs BAT.
$ CAT_pack reads # Runs RAT.
Getting help.
If you are unsure what options a program has, you can always add --help to a command. This is a great way to get you started with CAT, BAT, or RAT.
If you are unsure about what input files are required, you can just run CAT/BAT/RAT, as the appropriate error messages are generated if formatting is incorrect.
Taxonomic annotation of contigs or MAGs with CAT and BAT
After you have got the database files on your system, you can run CAT to annotate your contig set:
Multiple output files and a log file will be generated. The final classification files will be called out.CAT.ORF2LCA.txt and out.CAT.contig2classification.txt.
Alternatively, if you already have a predicted proteins fasta file and/or an alignment table for example from previous runs, you can supply them to CAT, which will then skip the steps that have already been done and start from there:
The headers in the predicted proteins fasta file must look like this >{contig}_{ORFnumber}, so that CAT can couple contigs to ORFs. The alignment file must be tab-seperated, with queried ORF in the first column, protein accession number in the second, and bit-score in the 12th.
Multiple output files and a log file will be generated. The final classification files will be called out.BAT.ORF2LCA.txt and out.BAT.bin2classification.txt.
Similarly to CAT, BAT can be run from intermidate steps if gene prediction and alignment have already been carried out once:
If you have previously run CAT on the set of contigs from which the MAGs originate, you can use the previously predicted protein and alignment files to classify the MAGs.
$ CAT_pack contigs -c {contigs fasta} -d {database folder} -t {taxonomy folder}
$ CAT_pack bins -b {bin folder} -d {database folder} -t {taxonomy folder} -p {predicted proteins fasta from contig run} -a {alignment file from contig run}
This is a great way to run both CAT and BAT on a set of MAGs without needing to do protein prediction and alignment twice!
Interpreting the output files
The ORF2LCA output looks like this:
ORF
number of hits (r: 10)
lineage
bit-score
contig_1_ORF1
7
1;131567;2;1783272
574.7
Where the lineage is the full taxonomic lineage of the classification of the ORF, and the bit-score the top-hit bit-score that is assigned to the ORF for voting. The BAT ORF2LCA output file has an extra column where ORFs are linked to the MAG in which they are found.
The contig2classification and bin2classification output looks like this:
Where the lineage scores represent the fraction of bit-score support for each classification. contig_2 has two classifications. This can happen if the f parameter is chosen below 0.5. For an explanation of the starred classification, see Marking suggestive taxonomic assignments with an asterisk.
To add names to the taxids in either output file, run:
If you have named a CAT or BAT classification file with official names, you can get a summary of the classification, where total length and number of ORFs supporting a taxon are calculated for contigs, and the number of MAGs per encountered taxon for MAG classification:
CAT_pack summarise currently does not support classification files wherein some contigs / MAGs have multiple classifications (as contig_2 above).
Marking suggestive taxonomic assignments with an asterisk
When we want to confidently go down to the lowest taxonomic level possible for a classification, an important assumption is that on that level conflict between classifications could have arisen. Namely, if there were conflicting classifications, the algorithm would have made the classification more conservative by moving up a level. Since it did not, we can trust the low-level classification. However, it is not always possible for conflict to arise, because in some cases no other sequences from the clade are present in the database. This is true for example for the family Dehalococcoidaceae, which in our databases is the sole representative of the order Dehalococcoidales. Thus, here we cannot confidently state that an classification on the family level is more correct than an classification on the order level. For these cases, CAT and BAT mark the lineage with asterisks, starting from the lowest level classification up to the level where conflict could have arisen because the clade contains multiple taxa with database entries. The user is advised to examine starred taxa more carefully, for example by analysing sequence identity between predicted ORFs and hits, or move up the lineage to a confident classification (i.e. the first classification without an asterisk).
If you do not want the asterisks in your output files, you can add the --no_stars flag to CAT or BAT.
Optimising running time, RAM, and disk usage
CAT and BAT may take a while to run, and may use quite a lot of RAM and disk space. Depending on what you value most, you can tune CAT and BAT to maximize one and minimize others. The classification algorithm itself is fast and is friendly on memory and disk space. The most expensive step is alignment with DIAMOND, hence tuning alignment parameters will have the highest impact:
The -n / --nproc argument allows you to choose the number of cores to deploy.
You can choose to run DIAMOND in sensitive mode with the --sensitive flag. This will increase sensitivity but will make alignment considerably slower.
Setting the --block_size parameter lower will decrease memory and temporary disk space usage. Setting it higher will increase performance.
For high memory machines, it is adviced to set --index_chunks to 1 (currently the default). This parameter has no effect on temprary disk space usage.
You can specify the location of temporary DIAMOND files with the --tmpdir argument.
Examples
Getting help for running the prepare utility:
$ CAT_pack prepare --help
Run CAT on a contig set with default parameter settings deploying 16 cores for DIAMOND alignment. Name the contig classification output with official names, and create a summary:
Run BAT on the set of MAGs that was binned from these contigs, reusing the protein predictions and DIAMOND alignment file generated previously during the contig classification:
Run BAT on the set of MAGs with custom parameter settings, suppressing verbosity and not writing a log file. Next, add names to the ORF2LCA output file:
BAT will output any taxonomic signal with at least 1% support. Low scoring diverging signals are clear signs of contamination!
Estimating the microbial composition with RAT
RAT estimates the taxonomic composition of metagenomes by integrating taxonomic signals from MAGs, contigs, and reads. RAT has been added to the CAT pack from version 6.0.
To use RAT, you need the CAT pack database files (see Getting started for more information).
RAT makes an integrated profile using MAGs/bins, contigs, and reads. To specify which elements should be integrated, use the --mode argument. Possible letters for --mode are m (for MAGs), c (for contigs), and r (for reads). All combinations of the three letters are possible, except r alone.
To run RAT’s complete workflow, specify the mode, read files, contig files, bin folder, and database files:
Currently, RAT supports single read files as well as paired-end read files. Interlaced read files are currently not supported. RAT will run CAT and BAT on the contigs and MAGs, will map the reads back to the contigs, and then try to annotate any unmapped reads separately.
If you already have a sorted mapping file, you can supply it and RAT will skip the mapping step:
Similarly, if a previous RAT run crashed after the unmapped reads have already been aligned to the database with diamond, you can supply the intermediate files to continue the run:
CAT, BAT, and RAT
Introduction
Contig Annotation Tool (CAT) and Bin Annotation Tool (BAT) are pipelines for the taxonomic classification of long DNA sequences and metagenome assembled genomes (MAGs / bins) of both known and (highly) unknown microorganisms, as generated by contemporary metagenomics studies. The core algorithm of both programs involves gene calling, mapping of predicted ORFs against a protein database, and voting-based classification of the entire contig / MAG based on classification of the individual ORFs. CAT and BAT can be run from intermediate steps if files are formated appropriately.
A paper describing the algorithm together with extensive benchmarks can be found at https://doi.org/10.1186/s13059-019-1817-x. If you use CAT or BAT in your research, it would be great if you could cite us:
Read Annotation Tool (RAT) estimates the taxonomic composition of metagenomes using CAT and BAT output. A manuscript describing RAT with benchmarks can be found at https://doi.org/10.1038/s41467-024-47155-1. If you use RAT in your research, it would be great if you could cite:
To cite the code itself:
Dependencies and where to get them
Python 3, https://www.python.org/.
DIAMOND, https://github.com/bbuchfink/diamond.
Prodigal, https://github.com/hyattpd/Prodigal.
RAT further requires (not needed for CAT and BAT):
BWA, https://github.com/lh3/bwa.
SAMtools, http://www.htslib.org/download/.
CAT, BAT, and RAT have been thoroughly tested on Linux systems, and should run on macOS as well.
Installation
No installation is required. You can run CAT, BAT and RAT by supplying the absolute path:
Alternatively, if you add the files in the CAT_pack directory to your
$PATHvariable, you can run CAT, BAT, and RAT from anywhere:Getting started
To get started with CAT/BAT/RAT, you will have to get the database files on your system. You can either download preconstructed database files, or generate them yourself.
Downloading preconstructed database files
To download the database files, find the most recent version on tbb.bio.uu.nl/tina/CAT_pack_prepare/, download and extract, and you are ready to go!
For NCBI nr:
For GTDB:
Creating a fresh NCBI nr or GTDB database yourself
Instead of using the preconstructed database, you can construct a fresh database yourself. The
downloadmodule can be used to download and process raw data, in preparation for building a new CAT pack database. This will ensure that all input dependencies are met and correctly formatted forCAT_pack prepare.Currently, two databases are supported, NCBI’s nr and the Genome Taxonomy Database (GTDB) proteins.
NCBI non-redundant protein database (nr)
Will download the fasta file with the protein sequences, their mapping to a taxid, and the taxonomy information from NCBI’s ftp site.
Genome Taxonomy Database (GTDB) proteins
The files required to build a CAT pack database are provided by the GTDB downloads page.
CAT_pack downloadfetches the necessary files and does some additional processing to get them ready forCAT_pack prepare:nodes.dmpandnames.dmpfiles.gtdb_proteins_aa_reps.tar.gzand are subjected to a round of deduplication. The deduplication reduces the redundancy in the DIAMOND database, thus simplifying the alignment process. Exact duplicate sequences are identified based on a combination of the MD5sum of the protein sequences and their length. Only one representative sequence is kept, with all duplicates encoded in the fasta header. This information is later used byCAT_pack prepareto assign the LCA of the protein sequence appropriately in the.fastaid2LCAtaxidfile.rootnode, to produce anall.treefile. This file is not used by the CAT pack but may come in handy for downstream analyses.When the download and processing of the files is finished successfully you can build a CAT pack database with
CAT_pack prepare.For all command line options available see
and
Creating a custom database
For a custom CAT pack database, you must have the following input ready before you launch a
CAT_pack preparerun.A fasta file containing all protein sequences you want to include in your database.
A
names.dmpfile that contains mappings of taxids to their ranks and scientific names. The format must be the same as the NCBI standardnames.dmp(uses\t|\tas field separator).An example looks like this:
nodes.dmpfile that describes the child-parent relationship of the nodes in the taxonomy tree and their (official) rank. The format must be the same as the NCBI standardnodes.dmp(uses\t|\tas the field separator).An example looks like this:
For more information on the
nodes.dmpandnames.dmpfiles, see the NCBI taxdump_readme.txt.accession.version taxid.An example looks like this
Once all of the above requirements are met you can run
CAT_pack prepare. All the input needs to be explicitly specified forCAT_pack prepareto work, for example:will create an
output_dirthat will look like thisNotes:
dbandtaxare created that contain all necessary files.nodes.dmpandnames.dmpin thetaxdirectory are copied from their original location. This is to ensure that the-tflag of CAT, BAT, and RAT work.<YYYY-MM-DD>_CAT_pack. You can customize it with the--common_prefixoption.For all command line options available see
Running CAT/BAT/RAT.
The database files are needed in subsequent CAT/BAT/RAT runs. They only need to be generated/downloaded once or whenever you want to update the database.
To run CAT/BAT/RAT, respectively:
Getting help.
If you are unsure what options a program has, you can always add
--helpto a command. This is a great way to get you started with CAT, BAT, or RAT.If you are unsure about what input files are required, you can just run CAT/BAT/RAT, as the appropriate error messages are generated if formatting is incorrect.
Taxonomic annotation of contigs or MAGs with CAT and BAT
After you have got the database files on your system, you can run CAT to annotate your contig set:
Multiple output files and a log file will be generated. The final classification files will be called
out.CAT.ORF2LCA.txtandout.CAT.contig2classification.txt.Alternatively, if you already have a predicted proteins fasta file and/or an alignment table for example from previous runs, you can supply them to CAT, which will then skip the steps that have already been done and start from there:
The headers in the predicted proteins fasta file must look like this
>{contig}_{ORFnumber}, so that CAT can couple contigs to ORFs. The alignment file must be tab-seperated, with queried ORF in the first column, protein accession number in the second, and bit-score in the 12th.To run BAT on a set of MAGs:
Alternatively, BAT can be run on a single MAG:
Multiple output files and a log file will be generated. The final classification files will be called
out.BAT.ORF2LCA.txtandout.BAT.bin2classification.txt.Similarly to CAT, BAT can be run from intermidate steps if gene prediction and alignment have already been carried out once:
If you have previously run CAT on the set of contigs from which the MAGs originate, you can use the previously predicted protein and alignment files to classify the MAGs.
This is a great way to run both CAT and BAT on a set of MAGs without needing to do protein prediction and alignment twice!
Interpreting the output files
The ORF2LCA output looks like this:
Where the lineage is the full taxonomic lineage of the classification of the ORF, and the bit-score the top-hit bit-score that is assigned to the ORF for voting. The BAT ORF2LCA output file has an extra column where ORFs are linked to the MAG in which they are found.
The contig2classification and bin2classification output looks like this:
Where the lineage scores represent the fraction of bit-score support for each classification. contig_2 has two classifications. This can happen if the f parameter is chosen below 0.5. For an explanation of the starred classification, see Marking suggestive taxonomic assignments with an asterisk.
To add names to the taxids in either output file, run:
This will show you that for example contig_1 is classified as Terrabacteria group. To only get official rank (i.e. superkingdom, phylum, …):
Or, alternatively:
If you have named a CAT or BAT classification file with official names, you can get a summary of the classification, where total length and number of ORFs supporting a taxon are calculated for contigs, and the number of MAGs per encountered taxon for MAG classification:
CAT_pack summarisecurrently does not support classification files wherein some contigs / MAGs have multiple classifications (as contig_2 above).Marking suggestive taxonomic assignments with an asterisk
When we want to confidently go down to the lowest taxonomic level possible for a classification, an important assumption is that on that level conflict between classifications could have arisen. Namely, if there were conflicting classifications, the algorithm would have made the classification more conservative by moving up a level. Since it did not, we can trust the low-level classification. However, it is not always possible for conflict to arise, because in some cases no other sequences from the clade are present in the database. This is true for example for the family Dehalococcoidaceae, which in our databases is the sole representative of the order Dehalococcoidales. Thus, here we cannot confidently state that an classification on the family level is more correct than an classification on the order level. For these cases, CAT and BAT mark the lineage with asterisks, starting from the lowest level classification up to the level where conflict could have arisen because the clade contains multiple taxa with database entries. The user is advised to examine starred taxa more carefully, for example by analysing sequence identity between predicted ORFs and hits, or move up the lineage to a confident classification (i.e. the first classification without an asterisk).
If you do not want the asterisks in your output files, you can add the
--no_starsflag to CAT or BAT.Optimising running time, RAM, and disk usage
CAT and BAT may take a while to run, and may use quite a lot of RAM and disk space. Depending on what you value most, you can tune CAT and BAT to maximize one and minimize others. The classification algorithm itself is fast and is friendly on memory and disk space. The most expensive step is alignment with DIAMOND, hence tuning alignment parameters will have the highest impact:
-n / --nprocargument allows you to choose the number of cores to deploy.--sensitiveflag. This will increase sensitivity but will make alignment considerably slower.--block_sizeparameter lower will decrease memory and temporary disk space usage. Setting it higher will increase performance.--index_chunksto 1 (currently the default). This parameter has no effect on temprary disk space usage.--tmpdirargument.Examples
Getting help for running the prepare utility:
Run CAT on a contig set with default parameter settings deploying 16 cores for DIAMOND alignment. Name the contig classification output with official names, and create a summary:
Run BAT on the set of MAGs that was binned from these contigs, reusing the protein predictions and DIAMOND alignment file generated previously during the contig classification:
Run the contig classification algorithm again with custom parameter settings, and name the output with all names in the lineage, excluding the scores:
Run BAT on the set of MAGs with custom parameter settings, suppressing verbosity and not writing a log file. Next, add names to the ORF2LCA output file:
Identifying contamination/mis-binned contigs within a MAG
We often use the combination of CAT / BAT to explore possible contamination within a MAG.
Contigs that have a different taxonomic signal than the MAG classification are probably contamination.
Alternatively, you can look at contamination from the MAG perspective, by setting the f parameter to a low value:
BAT will output any taxonomic signal with at least 1% support. Low scoring diverging signals are clear signs of contamination!
Estimating the microbial composition with RAT
RAT estimates the taxonomic composition of metagenomes by integrating taxonomic signals from MAGs, contigs, and reads. RAT has been added to the CAT pack from version 6.0. To use RAT, you need the CAT pack database files (see Getting started for more information).
RAT makes an integrated profile using MAGs/bins, contigs, and reads. To specify which elements should be integrated, use the
--modeargument. Possible letters for--modearem(for MAGs),c(for contigs), andr(for reads). All combinations of the three letters are possible, exceptralone. To run RAT’s complete workflow, specify the mode, read files, contig files, bin folder, and database files:Currently, RAT supports single read files as well as paired-end read files. Interlaced read files are currently not supported. RAT will run CAT and BAT on the contigs and MAGs, will map the reads back to the contigs, and then try to annotate any unmapped reads separately. If you already have a sorted mapping file, you can supply it and RAT will skip the mapping step:
If CAT and/or BAT have already been run on your data, you can supply the output files to RAT to skip the CAT and BAT runs:
Similarly, if a previous RAT run crashed after the unmapped reads have already been aligned to the database with diamond, you can supply the intermediate files to continue the run:
After a RAT run is finished, you can run add_names on the abundance files (only for RAT runs with nr database):
Similar to CAT and BAT, the paths to all dependencies can be supplied via an argument:
Output files
The RAT output consists of:
rin--mode).