ConDiGA (Contigs Directed Gene Annotation) is an accurate taxonomic annotation pipeline from metagenomic data to construct accurate protein sequence databases for deep metaproteomic coverage.
Setting up ConDiGA
Option 1: Installing ConDiGA using conda (recommended)
After setting up, run the following command to ensure that condiga is working.
condiga --help
Usage
Usage: condiga [OPTIONS]
ConDiGA: Contigs directed gene annotation for accurate protein sequence
database construction in metaproteomics.
Options:
-c, --contigs PATH path to the contigs file [required]
-ta, --taxa PATH path to the taxonomic classification results
file [required]
-g, --genes PATH path to the genes file [required]
-cov, --coverages PATH path to the contig coverages file
[required]
-as, --assembly-summary PATH path to the assembly_summary.txt file
[required]
-ra, --rel-abundance FLOAT RANGE
minimum relative abundance cut-off
[default: 0.0001; 0<=x<=1]
-gc, --genome-coverage FLOAT RANGE
minimum genome coverage cut-off [default:
0.001; 0<=x<=1]
-mt, --map-threshold FLOAT RANGE
minimum mapping length threshold cut-off
[default: 0.5; 0<=x<=1]
-t, --nthreads INTEGER number of threads to use [default: 8]
-o, --output PATH path to the output folder [required]
--help Show this message and exit.
Preprocessing
Before running ConDiGA, you have to process your data as follows.
Step 1: Assemble reads into contigs
You have to assemble your reads into contigs using MEGAHIT as follows. Currently, ConDiGA only supports MEGAHIT assemblies.
Now, you can run the convert command to convert your result to a form that can be used as input to condiga. The result will be saved to the Kraken folder. Currently, convert supports results from Kraken2, Kaiju and BLAST.
NOTE: Since, different annotation tools output results in different formats, you have to format the annotation results using covert which will output the result in a standard format readable by ConDiGA.
Step 3: Obtain coverage of contigs
You can use CoverM to get the coverage values of contigs as follows.
You can predict the genes in the contigs using MetaGeneMark as follows. You will find the nucleotide and amino acid sequences of the predicted genes in a file named final.contigs.fa.lst.
gmhmmp -m MetaGeneMark_v1.mod final.contigs.fa
Step 5: Download assembly_summary.txt file.
You can download the assembly summary file for bacteria from NCBI as follows.
The output of ConDiGA will contain the following main files and folders.
genes.species.mapped.xlsx contains the gene annotation results
all_genes.fna contains nucleotide sequences of the predicted genes
all_genes.faa contains amino acid sequences of the predicted genes
all_genes.output contains minimap2 mapping results for the predicted genes
Assemblies contains FASTA files of the downloaded reference genomes
Issues and Questions
If you have any questions, issues or suggestions, please post them under ConDiGA Issues.
Contributing to ConDiGA
Are you interested in contributing to the ConDiGA project? If so, you can check out the contributing guidelines in CONTRIBUTING.md.
Acknowledgement
The ConDiGA logo was generated using DALL·E 3 from OpenAI with the following prompt.
Create an icon that visually represents the concept of contigs directed gene annotation for a tool logo ensuring the background is completely transparent.
Wu, E., Mallawaarachchi, V., Zhao, J. et al. Contigs directed gene annotation (ConDiGA) for accurate protein sequence database construction in metaproteomics. Microbiome 12, 58 (2024). https://doi.org/10.1186/s40168-024-01775-3
@article{Wu2024,
author={Wu, Enhui and Mallawaarachchi, Vijini and Zhao, Jinzhi and Yang, Yi and Liu, Hebin and Wang, Xiaoqing and Shen, Chengpin and Lin, Yu and Qiao, Liang},
title={Contigs directed gene annotation (ConDiGA) for accurate protein sequence database construction in metaproteomics},
journal={Microbiome},
year={2024},
month={Mar},
day={19},
volume={12},
number={1},
pages={58},
abstract={Microbiota are closely associated with human health and disease. Metaproteomics can provide a direct means to identify microbial proteins in microbiota for compositional and functional characterization. However, in-depth and accurate metaproteomics is still limited due to the extreme complexity and high diversity of microbiota samples. It is generally recommended to use metagenomic data from the same samples to construct the protein sequence database for metaproteomic data analysis. Although different metagenomics-based database construction strategies have been developed, an optimization of gene taxonomic annotation has not been reported, which, however, is extremely important for accurate metaproteomic analysis.},
issn={2049-2618},
doi={10.1186/s40168-024-01775-3},
url={https://doi.org/10.1186/s40168-024-01775-3}
}
NOTE: The database created by ConDiGA is described as MD3 in the manuscript.
Also, please cite the following tools used by ConDiGA, the assembler and the relevant taxonomic annotation tool used to obtain the results.
Zhu W, Lomsadze A, Borodovsky M. Ab initio gene identification in metagenomic sequences. Nucleic acids research, 38 (12): 132-132 (2010). https://doi.org/10.1093/nar/gkq275
ConDiGA: Contigs Directed Gene Annotation
ConDiGA (Contigs Directed Gene Annotation) is an accurate taxonomic annotation pipeline from metagenomic data to construct accurate protein sequence databases for deep metaproteomic coverage.
Setting up ConDiGA
Option 1: Installing ConDiGA using conda (recommended)
You can install ConDiGA from Bioconda at https://anaconda.org/bioconda/condiga. Make sure you have
condainstalled.Option 2: Installing ConDiGA using pip
You can install ConDiGA from PyPI at https://pypi.org/project/condiga/. Make sure you have
pipinstalled.Note: If you use pip to setup ConDiGA, you will have to install Minimap2 and TaxonKit manually and add it to your system path. Irrespective of the package manager, if you want to use Kaiju results, you have to download and setup the NCBI taxdump database for TaxonKit.
Test the setup
After setting up, run the following command to ensure that
condigais working.Usage
Preprocessing
Before running ConDiGA, you have to process your data as follows.
Step 1: Assemble reads into contigs
You have to assemble your reads into contigs using MEGAHIT as follows. Currently, ConDiGA only supports MEGAHIT assemblies.
Step 2: Taxonomically annotate contigs
Next, you have to perform taxonomic annotation on your contigs. You can use any tool such as Kraken2, Kaiju or even BLAST.
As an example, let’s run Kraken2 as follows.
$DBNAMEis the path to your Kraken database.Now, you can run the
convertcommand to convert your result to a form that can be used as input tocondiga. The result will be saved to theKrakenfolder. Currently,convertsupports results from Kraken2, Kaiju and BLAST.NOTE: Since, different annotation tools output results in different formats, you have to format the annotation results using
covertwhich will output the result in a standard format readable by ConDiGA.Step 3: Obtain coverage of contigs
You can use CoverM to get the coverage values of contigs as follows.
Step 4: Predict genes in contigs
You can predict the genes in the contigs using MetaGeneMark as follows. You will find the nucleotide and amino acid sequences of the predicted genes in a file named
final.contigs.fa.lst.Step 5: Download
assembly_summary.txtfile.You can download the assembly summary file for bacteria from NCBI as follows.
Running ConDiGA
Once you have preprocessed your data and obtained all the necessary files, you can run
condigaas follows.Output
The output of ConDiGA will contain the following main files and folders.
genes.species.mapped.xlsxcontains the gene annotation resultsall_genes.fnacontains nucleotide sequences of the predicted genesall_genes.faacontains amino acid sequences of the predicted genesall_genes.outputcontainsminimap2mapping results for the predicted genesAssembliescontains FASTA files of the downloaded reference genomesIssues and Questions
If you have any questions, issues or suggestions, please post them under ConDiGA Issues.
Contributing to ConDiGA
Are you interested in contributing to the ConDiGA project? If so, you can check out the contributing guidelines in CONTRIBUTING.md.
Acknowledgement
The ConDiGA logo was generated using DALL·E 3 from OpenAI with the following prompt.
Citation
ConDiGA is published in Microbiome at DOI: 10.1186/s40168-024-01775-3.
If you use ConDiGA in your work, please as
NOTE: The database created by ConDiGA is described as MD3 in the manuscript.
Also, please cite the following tools used by ConDiGA, the assembler and the relevant taxonomic annotation tool used to obtain the results.