COG(Cluster of Orthologous Genes) is a database that plays an important role in the annotation, classification, and analysis of microbial gene function.
Functional annotation, classification, and analysis of each gene in newly sequenced bacterial genomes using the COG database is a common task.
However, there was no COG functional classification command line software that is easy-to-use and capable of producing publication-ready figures.
Therefore, I developed COGclassifier to fill this need.
COGclassifier can automatically perform the processes from searching query sequences into the COG database, to annotation and classification of gene functions, to generation of publication-ready figures (See figure below).
Fig.1: Barchart of COG funcitional category classification result for E.coli
Fig.2: Piechart of COG funcitional category classification result for E.coli
Installation
Python 3.9 or later is required for installation. Installation of RPS-BLAST(ncbi-blast+) is also necessary.
Tab-delimited plain text file with descriptions of COG functional categories The categories form four functional groups: 1. INFORMATION STORAGE AND PROCESSING 2. CELLULAR PROCESSES AND SIGNALING 3. METABOLISM 4. POORLY CHARACTERIZED Columns: 1. Functional category ID (one letter) 2. Functional group (1-4, as above) 3. Hexadecimal RGB color associated with the functional category 4. Functional category description Each line corresponds to one functional category. The order of the categories is meaningful (reflects a hierarchy of functions; determines the order of display)
Tab-delimited plain text file with COG descriptions Columns: 1. COG ID 2. COG functional category (could include multiple letters in the order of importance) 3. COG name 4. Gene name associated with the COG (optional) 5. Functional pathway associated with the COG (optional) 6. PubMed ID, associated with the COG (multiple entries are semicolon-separated; optional) 7. PDB ID of the structure associated with the COG (multiple entries are semicolon-separated; optional) Each line corresponds to one COG. The order of the COGs is arbitrary (displayed in the lexicographic order)
“cddid.tbl.gz” contains summary information about the CD models in this
distribution, which are part of the default “cdd” search database and are
indexed in NCBI’s Entrez database. This is a tab-delimited text file, with a
single row per CD model and the following columns:
PSSM-Id (unique numerical identifier) CD accession (starting with ‘cd’, ‘pfam’, ‘smart’, ‘COG’, ‘PRK’ or “CHL’) CD “short name” CD description PSSM-Length (number of columns, the size of the search model)
Run query sequences RPS-BLAST against COG database [Default: E-value = 1e-2].
Best-hit (=lowest e-value) blast results are extracted and used in next functional classification step.
3. Classify query sequences into COG functional category
From best-hit results, extract relationship between query sequences and COG functional category as described below.
Best-hit results -> CDD ID
CDD ID -> COG ID (From cddid.tbl.gz)
COG ID -> COG Functional Category Letter (From cog-24.def.tab)
If functional category with multiple letters exists, first letter is treated as functional category
(e.g. COG4862 has multiple letters KTN. A letter K is treated as functional category).
Using the above information, the number of query sequences classified into each COG functional category is calculated and
functional annotation and classification results are output.
Usage
Basic Command
COGclassifier -i [protein fasta file] -o [output directory]
Options
$ COGclassifier --help
Usage: COGclassifier [OPTIONS]
A tool for classifying prokaryote protein sequences into COG functional category
╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * --infile -i Input query protein fasta file [required] │
│ * --outdir -o Output directory [required] │
│ --download_dir -d Download COG & CDD resources directory [default: /home/user/.cache/cogclassifier_v2] │
│ --thread_num -t RPS-BLAST num_thread parameter [default: MaxThread - 1] │
│ --evalue -e RPS-BLAST e-value parameter [default: 0.01] │
│ --quiet -q No print log on screen │
│ --version -v Print version information │
│ --help -h Show this message and exit. │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Example Command
Click here to download example protein fasta files.
cog_count_barchart.[png|html] Barchart of COG funcitional category classification result. COGclassifier uses Altair visualization library for plotting charts.
cog_count_piechart.[png|html] Piechart of COG funcitional category classification result. Functional category with percentages less than 1% don’t display letter on piechart.
Customize Charts
COGclassifier also provides barchart & piechart plotting API/CLI to customize charts appearence.
See notebooks and command below for details.
plot_cog_count_barchart
$ plot_cog_count_barchart --help
Usage: plot_cog_count_barchart [OPTIONS]
Plot COGclassifier count barchart figure
╭─ Options ───────────────────────────────────────────────────────────────────────────────────╮
│ * --infile -i Input COG count result file ('cog_count.tsv') [required] │
│ * --outfile -o Output barchart figure file (*.png|*.svg|*.html) [required] │
│ --width Figure pixel width [default: 440] │
│ --height Figure pixel height [default: 340] │
│ --bar_width Figure bar width [default: 15] │
│ --y_limit Y-axis max limit value │
│ --percent_style Plot percent style instead of number count │
│ --sort Enable descending sort by number count │
│ --dpi Figure DPI [default: 100] │
│ --help -h Show this message and exit. │
╰─────────────────────────────────────────────────────────────────────────────────────────────╯
COGclassifier
Table of Contents
Overview
COG(Cluster of Orthologous Genes) is a database that plays an important role in the annotation, classification, and analysis of microbial gene function. Functional annotation, classification, and analysis of each gene in newly sequenced bacterial genomes using the COG database is a common task. However, there was no COG functional classification command line software that is easy-to-use and capable of producing publication-ready figures. Therefore, I developed COGclassifier to fill this need. COGclassifier can automatically perform the processes from searching query sequences into the COG database, to annotation and classification of gene functions, to generation of publication-ready figures (See figure below).
Fig.1: Barchart of COG funcitional category classification result for E.coli
Fig.2: Piechart of COG funcitional category classification result for E.coli
Installation
Python 3.9 or lateris required for installation. Installation of RPS-BLAST(ncbi-blast+) is also necessary.Install bioconda package:
Install PyPI stable package:
Workflow
Description of COGclassifier’s automated workflow. This workflow was created based in part on cdd2cog.
1. Setup COG & CDD resources
Download & load 4 required COG & CDD files from FTP site.
cog-24.fun.tab(https://ftp.ncbi.nih.gov/pub/COG/COG2024/data/cog-24.fun.tab)Descriptions of COG functional categories.
This resource file is included in the package as
cog_func_category.tsv.Show more information
cog-24.def.tab(https://ftp.ncbi.nih.gov/pub/COG/COG2024/data/cog-24.def.tab)COG descriptions such as ‘COG ID’, ‘COG functional category’, ‘COG name’, etc…
This resource file is included in the package as
cog_definition.tsv.Show more information
cddid.tbl.gz(https://ftp.ncbi.nih.gov/pub/mmdb/cdd/)Summary information about the CD(Conserved Domain) model.
Show more information
Cog_LE.tar.gz(https://ftp.ncbi.nih.gov/pub/mmdb/cdd/little_endian/)COG database, a part of CDD(Conserved Domain Database), for RPS-BLAST search.
2. RPS-BLAST search against COG database
Run query sequences RPS-BLAST against COG database [Default: E-value = 1e-2]. Best-hit (=lowest e-value) blast results are extracted and used in next functional classification step.
3. Classify query sequences into COG functional category
From best-hit results, extract relationship between query sequences and COG functional category as described below.
cddid.tbl.gz)cog-24.def.tab)cog-24.fun.tab)Using the above information, the number of query sequences classified into each COG functional category is calculated and functional annotation and classification results are output.
Usage
Basic Command
Options
Example Command
Click here to download example protein fasta files.
Output Contents
rpsblast.tsv(example)RPS-BLAST against COG database result (format =
outfmt 6).cog_classify.tsv(example)Query sequences classified into COG functional category result.
This file contains all classified query sequences and associated COG information.
Table of detailed tsv format information (9 columns)
cog_count.tsv(example)Count classified sequences per COG functional category result.
Table of detailed tsv format information (5 columns)
cogclassifier.log(example)COGclassifier log file.
cog_count_barchart.[png|html]Barchart of COG funcitional category classification result.
COGclassifier uses
Altairvisualization library for plotting charts.cog_count_piechart.[png|html]Piechart of COG funcitional category classification result.
Functional category with percentages less than 1% don’t display letter on piechart.
Customize Charts
COGclassifier also provides barchart & piechart plotting API/CLI to customize charts appearence. See notebooks and command below for details.
plot_cog_count_barchart
plot_cog_count_piechart