Cayman (Carbohydrate active enzymes profiling of metagenomes) is a command-line profiling tool for profiling CAZyme abundances in metagenomic datasets. It takes as input (preferably) cleaned – quality-filtered and host-filtered – metagenomic shotgun reads and produces a matrix of CAZyme
Reads-Per-Kilobase-Million (RPKM) abundances for your sample. Cayman makes heavy use of the functional profiling library gqlib.
Prerequisites
Dependencies
python>=3.7,<3.11
bwa
The following python libraries need to be installed
numpy
pandas
pysam
intervaltree
gqlib>=2.14.3 (which should take care of all python library requirements)
pyhmmer (for protein set annotation)
You will need a bwa installation. One way – if you didn’t install cayman via bioconda or if you’re not using a container – would be to use conda env create -f environment.yml using the provided environment.yml.
Metagenomics reference datasets and CAZyme catalogues
Cayman uses prevalence-filtered reference data sets from the Global Microbial Gene Catalog (GMGC). We annotated these datasets with our dedicated CAZyme annotation method (cf. Ducarmon & Karcher et al.). The filtered GMGC datasets and their CAZyme annotations can be downloaded from Zenodo.
Prior to your first profiling run, you will have to build a bwa index from the respective GMGC reference dataset.
$ bwa index -p <index_name> [-b blocksize] /path/to/dataset
If you have enough memory available, setting -b to a higher value than the default (10000000), e.g. 100000000, may speed up the index generation.
PyPI: pip install cayman (note that you still require your own bwa installation)
Docker: docker pull docker://ghcr.io/zellerlab/cayman:latest (or build your own with the supplied Dockerfile)
HPC container aficionado? – here’s a Singularity recipe (but you can also just use docker://ghcr.io/zellerlab/cayman:latest)
Dev? git clone https://github.com/zellerlab/cayman && cd cayman && pip install . (also requires a bwa installation)
Typical installation time is a couple minutes. This mostly depends on the availability of the bioconda repository (for conda installation), the github container registry (pulling the container), PyPI (installation via pip / dependency installation from source code), and/or github.com (installation from source code.)
Running Cayman
Cayman can be run from the command line as follows:
Attention: As of version 0.10.0, cayman profiling is invoked with cayman profile instead of cayman.
Read files need to be in fastq format (best with fastq or fq file ending) and can be gzip compressed.
The <input_options> parameters depend on the library layout of your samples:
Paired-end data can be specified with -1 </path/to/reads1> -2 </path/to/reads2>. Each read will be counted as 0.5.
Single-end data can be specified with --singles </path/to/reads>. Each read will be counted as 1.
Orphaned reads, i.e. paired-end reads that have lost their mate during an upstream quality control step, can be specified with --orphans </path/to/orphans>. Each read will be counted as 0.5.
Samples comprising multiple fastq files (e.g. from multiple lanes) can be provided as space-separated lists. In the case of paired-end reads, ensure that the order of the files matches (e.g. -1 sampleX_lane1_R1.fq sampleX_lane2_R1.fq -2 sampleX_lane1_R2.fq sampleX_lane2_R2.fq)!
The choice of assigning an unpaired read set to be “true” single-end reads or orphan reads influences the read count distribution.
A read pair gets assigned a count of 2 x 0.5 = 1 (as both reads of a pair are derived from the same sequenced nucleic acid fragment.)
An orphan read gets assigned a count of 1 x 0.5 = 0.5.
A read from a single-end library gets assigned a count of 1.
</path/to/annotation_db> is the path to a 4-column text file containing the reference domain annotation. (using the bed4 format: contig,start,end,domain-type). This contains all the CAZy domain annotations for all ORFs in our gene catalog.
</path/to/bwa_index> refers to the path to the gene catalog bwa index.
Optional parameters
--out_prefix is a string that will be prepended to the output files (default: "cayman"). If you want to store the output in a specific folder, then provide a path such as "/path/to/folder/some_prefix". Without "some_prefix", the output files will be hidden as they start with a ..
--min_identity is the minimum sequence identity level (default: 0.97) for an alignment of your read to a CAZyme domain to be included.
--min_seqlen is the minimum alignment length (actually aligned bases without soft/hard-clipping) to be included (default: 45[bp]).
--cpus_for_alignment the number of cpus to use for alignment (default: 1).
--db_format [DEPRECATED] determines the format of the cazy annotation db. This can either be hmmer (comma-separated with 1-based coordinates) or bed (tab-separated with 0-based start coordinate and 1-based end coordinate). As of v0.10.2, this parameter is no longer necessary and is only included to maintain backwards-compatibility with existing scripts.
Running with test data
A test dataset can be downloaded from Zenodo. Those are 1 million paired-end reads derived from SRA record SRR7658598. On a system with 16GB RAM and 4 CPU cores, this dataset can be processed within 5 minutes.
Results
<out_prefix>.cazy.txt contains the CAZy profile of the sample
The first line is the header, followed by the counts of the total aligned reads and filtered reads.
The following lines contain the counts for each CAZy family present in the sample: family name (feature), unique counts, unique counts rpkm-normalised, unique counts + ambiguous counts, unique counts + ambiguous counts rpkm-normalised.
<out_prefix>.gene_counts.txt contains the gene profiles of the sample. The format is identical to the CAZy profiles, featuring are the detected genes from the respective gene catalogue.
<out_prefix>.aln_stats.txt contains statistics on the alignments in the sample.
Annotating protein sets with Cayman hmms
The default hmm_database can be obtained from Zenodo.
Cayman (Carbohydrate active enzymes profiling of metagenomes) is a command-line profiling tool for profiling CAZyme abundances in metagenomic datasets. It takes as input (preferably) cleaned – quality-filtered and host-filtered – metagenomic shotgun reads and produces a matrix of CAZyme Reads-Per-Kilobase-Million (RPKM) abundances for your sample. Cayman makes heavy use of the functional profiling library
gqlib.Prerequisites
Dependencies
The following python libraries need to be installed
You will need a
bwainstallation. One way – if you didn’t installcaymanvia bioconda or if you’re not using a container – would be to useconda env create -f environment.ymlusing the provided environment.yml.Metagenomics reference datasets and CAZyme catalogues
Cayman uses prevalence-filtered reference data sets from the Global Microbial Gene Catalog (GMGC). We annotated these datasets with our dedicated CAZyme annotation method (cf. Ducarmon & Karcher et al.). The filtered GMGC datasets and their CAZyme annotations can be downloaded from Zenodo.
Prior to your first profiling run, you will have to build a bwa index from the respective GMGC reference dataset.
If you have enough memory available, setting
-bto a higher value than the default (10000000), e.g.100000000, may speed up the index generation.Installation
Cayman can most easily be installed via
conda install -c bioconda caymanpip install cayman(note that you still require your ownbwainstallation)docker pull docker://ghcr.io/zellerlab/cayman:latest(or build your own with the supplied Dockerfile)docker://ghcr.io/zellerlab/cayman:latest)git clone https://github.com/zellerlab/cayman && cd cayman && pip install .(also requires abwainstallation)Typical installation time is a couple minutes. This mostly depends on the availability of the bioconda repository (for conda installation), the github container registry (pulling the container), PyPI (installation via pip / dependency installation from source code), and/or github.com (installation from source code.)
Running Cayman
Cayman can be run from the command line as follows:
Attention: As of version 0.10.0, cayman profiling is invoked with
cayman profileinstead ofcayman.Mandatory parameters
<input_options>Read files need to be in fastq format (best with
fastqorfqfile ending) and can be gzip compressed.The
<input_options>parameters depend on the library layout of your samples:-1 </path/to/reads1> -2 </path/to/reads2>. Each read will be counted as0.5.--singles </path/to/reads>. Each read will be counted as1.--orphans </path/to/orphans>. Each read will be counted as0.5.Samples comprising multiple fastq files (e.g. from multiple lanes) can be provided as space-separated lists. In the case of paired-end reads, ensure that the order of the files matches (e.g.
-1 sampleX_lane1_R1.fq sampleX_lane2_R1.fq -2 sampleX_lane1_R2.fq sampleX_lane2_R2.fq)!The choice of assigning an unpaired read set to be “true” single-end reads or orphan reads influences the read count distribution.
2 x 0.5 = 1(as both reads of a pair are derived from the same sequenced nucleic acid fragment.)1 x 0.5 = 0.5.1.</path/to/annotation_db>is the path to a 4-column text file containing the reference domain annotation. (using the bed4 format: contig,start,end,domain-type). This contains all the CAZy domain annotations for all ORFs in our gene catalog.</path/to/bwa_index>refers to the path to the gene catalog bwa index.Optional parameters
--out_prefixis a string that will be prepended to the output files (default:"cayman"). If you want to store the output in a specific folder, then provide a path such as"/path/to/folder/some_prefix". Without"some_prefix", the output files will be hidden as they start with a..--min_identityis the minimum sequence identity level (default: 0.97) for an alignment of your read to a CAZyme domain to be included.--min_seqlenis the minimum alignment length (actually aligned bases without soft/hard-clipping) to be included (default: 45[bp]).--cpus_for_alignmentthe number of cpus to use for alignment (default: 1).--db_format [DEPRECATED]determines the format of the cazy annotation db. This can either behmmer(comma-separated with 1-based coordinates) orbed(tab-separated with 0-based start coordinate and 1-based end coordinate). As of v0.10.2, this parameter is no longer necessary and is only included to maintain backwards-compatibility with existing scripts.Running with test data
A test dataset can be downloaded from Zenodo. Those are 1 million paired-end reads derived from SRA record
SRR7658598. On a system with 16GB RAM and 4 CPU cores, this dataset can be processed within 5 minutes.Results
<out_prefix>.cazy.txtcontains the CAZy profile of the sampleThe first line is the header, followed by the counts of the total aligned reads and filtered reads. The following lines contain the counts for each CAZy family present in the sample: family name (
feature), unique counts, unique counts rpkm-normalised, unique counts + ambiguous counts, unique counts + ambiguous counts rpkm-normalised.<out_prefix>.gene_counts.txtcontains the gene profiles of the sample. The format is identical to the CAZy profiles, featuring are the detected genes from the respective gene catalogue.<out_prefix>.aln_stats.txtcontains statistics on the alignments in the sample.Annotating protein sets with Cayman hmms
The default
hmm_databasecan be obtained from Zenodo.