Cerberus transforms raw sequencing (i.e. genomic, transcriptomics, metagenomics, metatranscriptomic) data into knowledge. It is a start to finish python code for versatile analysis of the Functional Ontology Assignments for Metagenomes (FOAM), KEGG, CAZy/dbCAN, VOG, pVOG, PHROG, COG, and a variety of other databases including user customized databases via Hidden Markov Models (HMM) for functional annotation for complete metabolic analysis across the tree of life (i.e., bacteria, archaea, phage, viruses, eukaryotes, and whole ecosystems). Cerberus also provides automatic differential statistics using DESeq2/EdgeR, pathway enrichments with GAGE, and pathway visualization with Pathview R.
Art by Andra Buchan
Installing Cerberus
Option 1) Mamba
Mamba install from bioconda with all dependencies:
Linux/OSX-64
Install mamba using conda
conda install mamba
[!NOTE]
Make sure you install mamba in your base conda environment unless you have OSX with ARM architecture (M1/M2 Macs). Follow the OSX-ARM instructions below if you have a Mac with ARM architecture.
[!NOTE]
Mamba is the fastest installer. Anaconda or miniconda can be slow. Also, install mamba from conda not from pip. The pip mamba doesn’t work for install.
Option 2) Anaconda - Linux/OSX-64 Only
Anaconda install from bioconda with all dependencies:
Raw read data from any sequencing platform (Illumina, PacBio, or Oxford Nanopore)
Assembled contigs, as MAGs, vMAGs, isolate genomes, or a collection of contigs
Amino acid fasta (.faa), previously called pORFs
We offer customization, including running all databases together, individually or specifying select databases. For example, if a user wants to run prokaryotic or eukaryotic-specific KOfams, or an individual database alone such as dbCAN, both are easily customized within Cerberus.
In QC mode, raw reads are quality controlled via FastQC prior and post trim FastQC. Raw reads are then trimmed via data type; if the data is Illumina or PacBio, fastp is called, otherwise it assumes the data is Oxford Nanopore, then PoreChop is utilized.
If Illumina reads are utilized, an optional bbmap step to remove the phiX174 genome is available or user provided contaminate genome. Phage phiX174 is a common contaminant within the Illumina platform as their library spike-in control. We highly recommend this removal if viral analysis is conducted, as it would provide false positives to ssDNA microviruses within a sample.
We include a --skip_decon option to skip the filtration of phiX174, which may remove common k-mers that are shared in ssDNA phages.
In the formatting and gene prediction stage, contigs and genomes are checked for N repeats. These N repeats are removed by default.
We impute contig/genome statistics (e.g., N50, N90, max contig) via our custom module Metaome Stats.
Scaffold annotation is not recommended due to N’s providing ambiguous annotation.
Both Prodigal and FragGeneScanRs can be used via our --super option, and we recommend using FragGeneScanRs for samples rich in eukaryotes.
FragGeneScanRs found more ORFs and KOs than Prodigal for a stimulated eukaryote rich metagenome. HMMER searches against the above databases via user specified bitscore and e-values or our minimum defaults (i.e., bitscore = 25, e-value = 1 x 10-9 ).
Input file formats
From any NextGen sequencing technology (from Illumina, PacBio, Oxford Nanopore)
type 1 raw reads (.fastq format)
type 2 nucleotide fasta (.fasta, .fa, .fna, .ffn format), assembled raw reads into contigs
type 3 protein fasta (.faa format), assembled contigs which genes are converted to amino acid sequence
Output Files
If an output directory is given, that folder will be created where all files are stored.
If no output directory is specified, the ‘results_cerberus’ subfolder will be created in the current directory.
Gage/Pathview R analysis provided as separate scripts within R.
Visualization of Outputs
We use Plotly to visualize the data
Once the program is finished running, the html reports containing the visuals will be saved to the last step of the pipeline.
The HTML files require plotly.js to be present. One has been provided in the package and is saved to the report folder.
Annotation Rules
Rule 1 is for finding high quality matches across databases. It is a score pre-filtering module for pORFs thresholds: which states that each pORF match to an HMM is recorded by default or a user-selected cut-off (i.e., e-value/bit scores) per database independently, or across all default databases (e.g, finding best hit), or per user specification of the selected database.
Rule 2 is to avoid missing genes encoding proteins with dual domains that are not overlapping. It is imputed for non-overlapping dual domain module pORF threshold: if two HMM hits are non-overlapping from the same database, both are counted as long as they are within the default or user selected score (i.e., e-value/bit scores).
Rule 3 is to ensure overlapping dual domains are not missed. This is the dual independent overlapping domain module for convergent binary domain pORFs. If two domains within a pORF are overlapping <10 amino acids (e.g, COG1 and COG4) then both domains are counted and reported due to the dual domain issue within a single pORF. If a function hits multiple pathways within an accession, both are counted, in pathway roll-up, as many proteins function in multiple pathways.
Rule 4 is the equal match counter to avoid missing high quality matches within the same protein. This is an independent accession module for a single pORF: if both hits within the same database have equal values for both e-value and bit score but are different accessions from the same database (e.g., KO1 and KO3) then both are reported.
Rule 5 is the ‘winner take all’ match rule for providing the best match. It is computed as the winner takes all module for overlapping pORFs: if two HMM hits are overlapping (>10 amino acids) from the same database the lowest resulting e-value and highest bit score wins.
Rule 6 is to avoid partial or fractional hits being counted. This ensures that only whole discrete integer counting (e.g., 0, 1, 2 to n) are computed and that partial or fractional counting is excluded.
Quick start examples
Genome examples
All databases
conda activate cerberus
cerberus.py --prodigal lambda.fna --hmm ALL --dir_out lambda_dir
NOTE: You can pick any single database you want for your analysis including KOFam_all, COG, VOG, PHROG, CAZy or specific KO databases for eukaryotes and prokaryotes (KOFam_eukaryote or KOFam_prokaryote).
[!NOTE]
The KEGG database contains KOs related to Human disease. It is possible that these will show up in the results, even when analyzing microbes. eggNOG and FunGene database are coming soon. If you want a custom HMM build please let us know by email or leaving an issue.
Custom Database
To run a custom database, you need a HMM containing the protein family of interest and a metadata sheet describing the HMM required for look-up tables and downstream analysis. For the metadata information you need an ID that matches the HMM and a function or hierarchy. See example below.
Example Metadata sheet
ID
Function
HMM1
Sugarase
HMM2
Coffease
Cerberus Options
[!Important]
If the Cerberus environment is not used, make sure the dependencies are in PATH or specified in the config file.
Run cerberus.py with the options required for your project.
Usage of cerberus.py:
[!Note]
The following are different options/arguments to modify the execution of Cerberus.
• Setup arguments:
Argument/Option
Function [Default]
Usage Format
Accepted format
Example (Type as one line)
--setup
Setup additional dependencies [False]
--setup
N/A
cerberus.py --setup
--update
Update downloaded databases [False]
--update
N/A
cerberus.py --update
--list-db
List available and downloaded databases [False]
--list-db
N/A
cerberus.py --list-db
--download
Downloads selected HMMs. Use the option --list-db for a list of available databases, default is to download all available databases
--download [DOWNLOAD ...]
--download [.HMM FILE]
--download path/to/example/directory.hmm
--uninstall
Remove downloaded databases and FragGeneScan+ [False]
Eukaryote nucleotide sequence (includes other viruses, works all around for everything)
--fraggenescan FRAGGENESCAN [FRAGGENESCAN ...]
Sequence file
=>1
--fraggenescan FILE1 FILE2...
--super
Run sequence in both--prodigal and --fraggenescan modes
--super SUPER [SUPER ...]
Sequence file
=>1
--super FILE1 FILE2...
--prodigalgv
Giant virus nucleotide sequence
--prodigalgv PRODIGALGV [PRODIGALGV ...]
Sequence file
=>1
--prodigalgv FILE1 FILE2...
--phanotate
Phage sequence
--phanotate PHANOTATE [PHANOTATE ...]
Sequence file
=>1
--phanotate FILE1 FILE2...
--protein or --amino
Protein Amino Acid sequence
--protein PROTEIN [PROTEIN ...] or --amino PROTEIN [PROTEIN ...]
Sequence file
=>1
--protein FILE1 FILE2... or --amino FILE1 FILE2...
--hmmer-tsv
Annotations tsv file from HMMER (experimental)
--hmmer-tsv HMMER_TSV [HMMER_TSV ...]
Sequence file
=>1
--hmmer-tsv FILE1 FILE2...
--class
path to a tsv file which has class information for the samples. If this file is included, scripts will be included to run Pathview in R
--class CLASS
Path to TSV file
1
`–class TSV_FILE1
--illumina
Specifies that the given FASTQ files are from Illumina
--illumina
N/A
N/A
cerberus.py --illumina
--nanopore
Specifies that the given FASTQ files are from Nanopore
--nanopore
N/A
N/A
cerberus.py --nanopore
--pacbio
Specifies that the given FASTQ files are from PacBio
--pacbio
N/A
N/A
cerberus.py --pacbio
• Output options:
Argument/Option
Function [DEFAULT]
Usage Format
Accepted format
# Options Accepted
Example (Type as one line)
--dir-out
path to output directory, defaults to “results-cerberus” in current directory. [./results-cerberus]
--dir-out DIR_OUT
output file path
1
--dir-out path/to/output/file
--replace
Flag to replace existing files. [False]
--replace
cerberus.py option
N/A
cerberus.py --replace
--keep
Flag to keep temporary files. [False]
--keep
cerberus.py option
N/A
cerberus.py --keep
--tmpdir
Temp directory for RAY (experimental) [system tmp dir]
--tmpdir TMPDIR
cerberus.py option
1
--tmpdir TEMPFILE1
• Database options:
Argument/Option
Function [DEFAULT]
Usage Format
Accepted format
# Options Accepted
Example (Type as one line)
--hmm
A list of databases for HMMER. Use the option --list-db for a list of available databases [KOFam_all]
--hmm HMM [HMM ...]
cerberus.py option
=>1
cerberus.py --hmm DATABASE1 DATABASE2...
--db-path
Path to folder of databases [Default: under the library path of Cerberus]
--db-path DB_PATH
path to databases folder
1
--db-path path/to/databases/folder
• Optional Arguments:
Argument/Option
Function [DEFAULT]
Usage Format
Accepted format
# Options Accepted
Example (Type as one line)
--scaffolds
Sequences are treated as scaffolds [False]
--scaffolds
cerberus.py option
N/A
cerberus.py --scaffolds
--minscore
Score cutoff for parsing HMMER results [60]
--minscore MINSCORE
whole integer value
1
cerberus.py --minscore 50
--evalue
E-value cutoff for parsing HMMER results [1e-09]
--evalue EVALUE
E-value
1
cerberus.py --evalue [E-value]
--skip-decon
Skip decontamination step. [False]
--skip-decon
cerberus.py option
N/A
cerberus.py --skip-decon
--skip-pca
Skip PCA. [False]
--skip-pca
cerberus.py option
N/A
cerberus.py --skip-pca
--cpus
Number of CPUs to use per task. System will try to detect available CPUs if not specified [Auto Detect]
--cpus CPUS
whole integer value
1
cerberus.py --cpus 16
--chunker
Split files into smaller chunks, in Megabytes [Disabled by default]
--chunker CHUNKER
whole integer value
1
cerberus.py --chunker 300
--grouped
Group multiple fasta files into a single file before processing. When used with --chunker (see above) can improve speed
--grouped
cerberus.py option
N/A
cerberus.py --grouped
--version or -v
show the version number and exit
--version or -v
cerberus.py option
N/A
cerberus.py --version
-h or --help
show this help message and exit
-h or --help
cerberus.py option
N/A
cerberus.py -h
--adapters
FASTA File containing adapter sequences for trimming
--adapters ADAPTERS
FASTA file
1
cerberus.py --adapters /path/to/FASTA/file
--qc_seq
FASTA File containing control sequences for decontamination
--qc_seq QC_SEQ
FASTA file
1
cerberus.py --qc_seq /path/to/FASTA/file
[!NOTE]
Arguments/options that start with -- can also be set in a config file (specified via -c). Config file syntax allows: key=value, flag=true, stuff=[a,b,c] (for details, see syntax. In general, command-line values override config file values which override defaults.
OUTPUTS (/final folder)
File Extension
Description Summary
Cerberus Update Version
.gff
General Feature Format
1.3
.gbk
GenBank Format
1.3
.fna
Nucleotide FASTA file of the input contig sequences.
1.3
.faa
Protein FASTA file of the translated CDS/ORFs sequences.
1.3
.ffn
FASTA Feature Nucleotide file, the Nucleotide sequence of translated CDS/ORFs.
1.3
.html
Summary statistics and/or visualizations, in step 10 folder
1.3
.txt
Statistics relating to the annotated features found.
1.3
level.tsv
Various levels of hierachical steps that is tab-separated file from various databases
1.3
rollup.tsv
All levels of hierachical steps that is tab-separated file from various databases
1.3
.tsv
Final Annotation summary, Tab-separated file of all features from various databases
1.3
GAGE / PathView
After processing the HMM files, Cerberus calculates a KO (KEGG Orthology) counts table from KEGG/FOAM for processing through GAGE and PathView.
GAGE is recommended for pathway enrichment followed by PathView for visualize the metabolic pathways. A “class” file is required through the --class option to run this analysis.
[!Tip]
As we are unsure which comparisons you want to make thus, you have to make a class.tsv so the code will know the comparisons you want to make.
For example (class.tsv):
Sample
Class
1A
rhizobium
1B
non-rhizobium
The output is saved under the step_10-visualizeData/combined/pathview folder. Also, at least 4 samples need to be used for this type of analysis.
GAGE and PathView also require internet access to be able to download information from a database.
Cerberus will save a bash script run_pathview.sh in the step_10-visualizeData/combined/pathview directory along with the KO Counts tsv files and the class file for running manualy in case Cerberus was run on a cluster without access to the internet.
Multiprocessing / Multi-Computing with RAY
Cerberus uses Ray for distributed processing. This is compatible with both multiprocessing on a single node (computer) or multiple nodes in a cluster.
Cerberus has been tested on a cluster using Slurm.
[!Important]
A script has been included to facilitate running Cerberus on Slurm. To use Cerberus on a Slurm cluster, setup your slurm script and run it using sbatch.
sbatch example_script.sh
example script:
#!/usr/bin/env bash
#SBATCH --job-name=test-job
#SBATCH --nodes=3
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=128MB
#SBATCH -e slurm-%j.err
#SBATCH -o slurm-%j.out
#SBATCH --mail-type=END,FAIL,REQUEUE
echo "====================================================="
echo "Start Time : $(date)"
echo "Submit Dir : $SLURM_SUBMIT_DIR"
echo "Job ID/Name : $SLURM_JOBID / $SLURM_JOB_NAME"
echo "Node List : $SLURM_JOB_NODELIST"
echo "Num Tasks : $SLURM_NTASKS total [$SLURM_NNODES nodes @ $SLURM_CPUS_ON_NODE CPUs/node]"
echo "======================================================"
echo ""
# Load any modules or resources here
conda activate cerberus
# source the slurm script to initialize the Ray worker nodes
source ray-slurm-cerberus.sh
# run Cerberus
cerberus.py --prodigal [input_folder] --illumina --dir_out [out_folder]
echo ""
echo "======================================================"
echo "End Time : $(date)"
echo "======================================================"
echo ""
DESeq2 and Edge2 Type I errors
Both edgeR and DeSeq2 R have the highest sensitivity when compared to other algorithms that control type-I error when the FDR was at or below 0.1. EdgeR and DESeq2 all perform fairly well in simulation and via data splitting (so no parametric assumptions). Typical benchmarks will show limma having stronger FDR control across all types of datasets (it’s hard to beat the moderated t-test), and edgeR and DESeq2 having higher sensitivity for low counts (makes sense as limma has to filter these out / down-weight them to use the normal model on log counts). Further information about type I errors are present from Mike Love’s vignette here.
Contributing to Cerberus and Fungene
Cerberus as a community resource as recently acquired FunGene, we welcome contributions of other experts expanding annotation of all domains of life (viruses, bacteria, archaea, eukaryotes). Please send us an issue on our Cerberus GitHub open an issue; or email us we will fully annotate your genome, add suggested pathways/metabolisms of interest, make custom HMMs to be added to Cerberus and FunGene.
Copyright
This is copyrighted by University of North Carolina at Charlotte, Jose L Figueroa III, Eliza Dhungal, Madeline Bellanger, Cory R Brouwer and Richard Allen White III. All rights reserved. Cerberus is a bioinformatic tool that can be distributed freely for academic use only. Please contact us for commerical use. The software is provided “as is” and the copyright owners or contributors are not liable for any direct, indirect, incidental, special, or consequential damages including but not limited to, procurement of goods or services, loss of use, data or profits arising in any way out of the use of this software.
Citing Cerberus
If you are publishing results obtained using Cerberus, please cite:
Publication
Figueroa III JL, Dhungel E, Bellanger M, Brouwer CR, White III RA. 2024.
Cerberus: distributed highly parallelized HMM-based processing for robust functional annotation across the tree of life. Bioinformatics
Pre-print
Figueroa III JL, Dhungel E, Brouwer CR, White III RA. 2023.
Cerberus: distributed highly parallelized HMM-based processing for robust functional annotation across the tree of life. bioRxiv
Welcome to Cerberus
About
Cerberus transforms raw sequencing (i.e. genomic, transcriptomics, metagenomics, metatranscriptomic) data into knowledge. It is a start to finish python code for versatile analysis of the Functional Ontology Assignments for Metagenomes (FOAM), KEGG, CAZy/dbCAN, VOG, pVOG, PHROG, COG, and a variety of other databases including user customized databases via Hidden Markov Models (HMM) for functional annotation for complete metabolic analysis across the tree of life (i.e., bacteria, archaea, phage, viruses, eukaryotes, and whole ecosystems). Cerberus also provides automatic differential statistics using DESeq2/EdgeR, pathway enrichments with GAGE, and pathway visualization with Pathview R.
Installing Cerberus
Option 1) Mamba
Linux/OSX-64
OSX-ARM (M1/M2)
Option 2) Anaconda - Linux/OSX-64 Only
Option 3) Manual with conda/mamba from Github
Brief Overview
General Info
We offer customization, including running all databases together, individually or specifying select databases. For example, if a user wants to run prokaryotic or eukaryotic-specific KOfams, or an individual database alone such as dbCAN, both are easily customized within Cerberus.
In QC mode, raw reads are quality controlled via FastQC prior and post trim FastQC. Raw reads are then trimmed via data type; if the data is Illumina or PacBio, fastp is called, otherwise it assumes the data is Oxford Nanopore, then PoreChop is utilized.
If Illumina reads are utilized, an optional bbmap step to remove the phiX174 genome is available or user provided contaminate genome. Phage phiX174 is a common contaminant within the Illumina platform as their library spike-in control. We highly recommend this removal if viral analysis is conducted, as it would provide false positives to ssDNA microviruses within a sample.
We include a
--skip_deconoption to skip the filtration of phiX174, which may remove common k-mers that are shared in ssDNA phages.In the formatting and gene prediction stage, contigs and genomes are checked for N repeats. These N repeats are removed by default.
We impute contig/genome statistics (e.g., N50, N90, max contig) via our custom module Metaome Stats.
Contigs can be converted to pORFs using Prodigal, FragGeneScanRs , and Prodigal-gv as specified by user preference.
Scaffold annotation is not recommended due to N’s providing ambiguous annotation.
Both Prodigal and FragGeneScanRs can be used via our
--superoption, and we recommend using FragGeneScanRs for samples rich in eukaryotes.FragGeneScanRs found more ORFs and KOs than Prodigal for a stimulated eukaryote rich metagenome. HMMER searches against the above databases via user specified bitscore and e-values or our minimum defaults (i.e., bitscore = 25, e-value = 1 x 10-9 ).
Input file formats
Output Files
Visualization of Outputs
Annotation Rules
Quick start examples
Genome examples
All databases
Only KEGG/FOAM all
Only KEGG/FOAM prokaryotic centric
Only KEGG/FOAM eukaryotic centric
Only Viral/Phage databases
Custom HMM
Illumina data
Bacterial, Archaea and Bacteriophage metagenomes/metatranscriptomes
Eukaryotes and Viruses metagenomes/metatranscriptomes
Nanopore data
Eukaryotes
PacBio data
Eukaryotes
SUPER (both methods)
Prerequisites and dependencies
Available from Bioconda - external tool list
Cerberus databases
All pre-formatted databases are present at OSF.
Database sources
Custom Database
To run a custom database, you need a HMM containing the protein family of interest and a metadata sheet describing the HMM required for look-up tables and downstream analysis. For the metadata information you need an ID that matches the HMM and a function or hierarchy. See example below.
Example Metadata sheet
Cerberus Options
cerberus.pywith the options required for your project.Usage of
cerberus.py:• Setup arguments:
--setup--setupcerberus.py --setup--update--updatecerberus.py --update--list-db--list-dbcerberus.py --list-db--download--list-dbfor a list of available databases, default is to download all available databases--download [DOWNLOAD ...]--download [.HMM FILE]--download path/to/example/directory.hmm--uninstall--uninstallcerberus.py --uninstall• Input File Arguments:
-cor--config-c CONFIGor--config CONFIG-c path/to/config/file--prodigal--prodigal PRODIGAL [PRODIGAL ...]--prodigal FILE1 FILE2...--fraggenescan--fraggenescan FRAGGENESCAN [FRAGGENESCAN ...]--fraggenescan FILE1 FILE2...--super--prodigaland--fraggenescanmodes--super SUPER [SUPER ...]--super FILE1 FILE2...--prodigalgv--prodigalgv PRODIGALGV [PRODIGALGV ...]--prodigalgv FILE1 FILE2...--phanotate--phanotate PHANOTATE [PHANOTATE ...]--phanotate FILE1 FILE2...--proteinor--amino--protein PROTEIN [PROTEIN ...]or--amino PROTEIN [PROTEIN ...]--protein FILE1 FILE2...or--amino FILE1 FILE2...--hmmer-tsv--hmmer-tsv HMMER_TSV [HMMER_TSV ...]--hmmer-tsv FILE1 FILE2...--class--class CLASS--illumina--illuminacerberus.py --illumina--nanopore--nanoporecerberus.py --nanopore--pacbio--pacbiocerberus.py --pacbio• Output options:
--dir-out--dir-out DIR_OUT--dir-out path/to/output/file--replace--replacecerberus.pyoptioncerberus.py --replace--keep--keepcerberus.pyoptioncerberus.py --keep--tmpdir--tmpdir TMPDIRcerberus.pyoption--tmpdir TEMPFILE1• Database options:
--hmm--list-dbfor a list of available databases [KOFam_all]--hmm HMM [HMM ...]cerberus.pyoptioncerberus.py --hmm DATABASE1 DATABASE2...--db-path--db-path DB_PATH--db-path path/to/databases/folder• Optional Arguments:
--scaffolds--scaffoldscerberus.pyoptioncerberus.py --scaffolds--minscore--minscore MINSCOREcerberus.py --minscore 50--evalue--evalue EVALUEcerberus.py --evalue [E-value]--skip-decon--skip-deconcerberus.pyoptioncerberus.py --skip-decon--skip-pca--skip-pcacerberus.pyoptioncerberus.py --skip-pca--cpus--cpus CPUScerberus.py --cpus 16--chunker--chunker CHUNKERcerberus.py --chunker 300--grouped--chunker(see above) can improve speed--groupedcerberus.pyoptioncerberus.py --grouped--versionor-v--versionor-vcerberus.pyoptioncerberus.py --version-hor--help-hor--helpcerberus.pyoptioncerberus.py -h--adapters--adapters ADAPTERScerberus.py --adapters /path/to/FASTA/file--qc_seq--qc_seq QC_SEQcerberus.py --qc_seq /path/to/FASTA/fileOUTPUTS (/final folder)
GAGE / PathView
After processing the HMM files, Cerberus calculates a KO (KEGG Orthology) counts table from KEGG/FOAM for processing through GAGE and PathView. GAGE is recommended for pathway enrichment followed by PathView for visualize the metabolic pathways. A “class” file is required through the
--classoption to run this analysis.For example (class.tsv):
The output is saved under the step_10-visualizeData/combined/pathview folder. Also, at least 4 samples need to be used for this type of analysis.
run_pathview.shin the step_10-visualizeData/combined/pathview directory along with the KO Counts tsv files and the class file for running manualy in case Cerberus was run on a cluster without access to the internet.Multiprocessing / Multi-Computing with RAY
example script:
DESeq2 and Edge2 Type I errors
Both edgeR and DeSeq2 R have the highest sensitivity when compared to other algorithms that control type-I error when the FDR was at or below 0.1. EdgeR and DESeq2 all perform fairly well in simulation and via data splitting (so no parametric assumptions). Typical benchmarks will show limma having stronger FDR control across all types of datasets (it’s hard to beat the moderated t-test), and edgeR and DESeq2 having higher sensitivity for low counts (makes sense as limma has to filter these out / down-weight them to use the normal model on log counts). Further information about type I errors are present from Mike Love’s vignette here.
Contributing to Cerberus and Fungene
Cerberus as a community resource as recently acquired FunGene, we welcome contributions of other experts expanding annotation of all domains of life (viruses, bacteria, archaea, eukaryotes). Please send us an issue on our Cerberus GitHub open an issue; or email us we will fully annotate your genome, add suggested pathways/metabolisms of interest, make custom HMMs to be added to Cerberus and FunGene.
Copyright
This is copyrighted by University of North Carolina at Charlotte, Jose L Figueroa III, Eliza Dhungal, Madeline Bellanger, Cory R Brouwer and Richard Allen White III. All rights reserved. Cerberus is a bioinformatic tool that can be distributed freely for academic use only. Please contact us for commerical use. The software is provided “as is” and the copyright owners or contributors are not liable for any direct, indirect, incidental, special, or consequential damages including but not limited to, procurement of goods or services, loss of use, data or profits arising in any way out of the use of this software.
Citing Cerberus
If you are publishing results obtained using Cerberus, please cite:
Publication
Figueroa III JL, Dhungel E, Bellanger M, Brouwer CR, White III RA. 2024. Cerberus: distributed highly parallelized HMM-based processing for robust functional annotation across the tree of life. Bioinformatics
Pre-print
Figueroa III JL, Dhungel E, Brouwer CR, White III RA. 2023.
Cerberus: distributed highly parallelized HMM-based processing for robust functional annotation across the tree of life. bioRxiv
CONTACT
The informatics point-of-contact for this project is Dr. Richard Allen White III.
If you have any questions or feedback, please feel free to get in touch by email.
Dr. Richard Allen White III
Jose Luis Figueroa III
Or open an issue.