GMSC-mapper

GMSC-mapper is a command line tool to query the Global Microbial smORFs Catalog (GMSC).

GMSC-mapper can be used to

Find query smORFs (< 100aa) homologous to Global Microbial smORFs Catalog (GMSC) by alignment.
- Support 3 types of input:
  - contigs (GMSC-mapper will predict smORFs from contigs first)
  - amino acid sequences
  - nucleotide gene sequences
Annotate query / predicted smORFs with quality, habitat and taxonomy information constructed manually in detail.

If you use this tool in a publication, please cite:

Duan, Y., Santos-Júnior, C.D., Schmidt, T.S. et al. A catalog of small proteins from the global microbiome. Nat Commun 15, 7563 (2024). https://doi.org/10.1038/s41467-024-51894-6

You can print this citation with gmsc-mapper citation.

Installation

Clone GMSC-mapper repository

git clone https://github.com/BigDataBiology/GMSC-mapper.git

Create conda environment

conda create -n gmscmapper python
conda activate gmscmapper

You will need the following dependencies:

The easiest way to install the dependencies is with conda:

conda install -c bioconda -c conda-forge mmseqs2
conda install -c bioconda -c conda-forge diamond

Once the dependencies are installed, you can install GMSC-mapper by running:

cd GMSC-mapper
pip install .

Example test

As the whole GMSC database is large and takes some minutes to process, you can test with mock datasets to check if the installation works well.

Please use GMSC-mapper as your work directory.

cd GMSC-mapper

Create GMSC database index

Default alignment tool is DIAMOND.

gmsc-mapper createdb -i ./examples/target.faa -o ./examples/ -m diamond

When input is genome contig sequences:

gmsc-mapper -i ./examples/example.fa -o ./examples_output/ --dbdir ./examples/

When input is amino acid sequences:

gmsc-mapper --aa-genes ./examples/example.faa -o ./examples_output/ --dbdir ./examples/

When input is nucleotide gene sequences:

gmsc-mapper --nt-genes ./examples/example.fna -o ./examples_output/ --dbdir ./examples/

Check another alignment tool: MMseqs2

The default alignment tool is DIAMOND, if you want to use MMseqs2 as your alignment tool, you need to create GMSC database index in MMseqs2 format.

gmsc-mapper createdb -i ./examples/target.faa -o ./examples/ -m mmseqs

After index creation, you can specify tool as mmseqs and other usage is the same as above.

gmsc-mapper -i ./examples/example.fa -o ./examples_output/ --dbdir ./examples/ --tool mmseqs

Usage

Default usage

Download GMSC database and create index

We recommend to use GMSC-mapper as your current work directory. You can directly follow the commands below.

cd GMSC-mapper

Download GMSC database

gmsc-mapper downloaddb --dbdir ./db

The default --dbdir is ./db. If you want to use custom --dbdir directory, it should be consistent with -o in the next creating database index step.

Create GMSC database index

gmsc-mapper createdb -i ./db/GMSC10.90AA.faa.gz -o ./db -m diamond

The input (-i) is the fasta file (GMSC10.90AA.faa.gz) downloaded to the dbdir (default: ./db) in the downloading step.

The default -o is ./db. If you want to use custom -o directory, it should be consistent with --dbdir in the previous download step.

GMSC Annotation

GMSC Database directory (--dbdir) and output directory (-o) can be assigned on your own. Default is ./db and ./output.

If you use GMSC-mapper as your current work directory. You can directly follow the commands below. Otherwise, you need to assign your custom --dbdir which contains database files.

cd GMSC-mapper

Input is genome contig sequences.

gmsc-mapper -i ./examples/example.fa --dbdir ./db

Input is amino acid sequences.

gmsc-mapper --aa-genes ./examples/example.faa --dbdir ./db

Input is nucleotide gene sequences.

gmsc-mapper --nt-genes ./examples/example.fna --dbdir ./db

Further usage

Habitat / taxonomy / quality / domain annotation is optional

If you don’t want to annotate habitat / taxonomy / quality you can use --no-habitat/--no-taxonomy/--no-quality / --no-domain.

gmsc-mapper -i ./examples/example.fa --dbdir ./db --no-habitat --no-taxonomy --no-quality --no-domain

Alignment tool: DIAMOND / MMseqs2 is optional

The default alignment tool is DIAMOND, if you want to use MMseqs2 as your alignment tool, you need to create GMSC database index in MMseqs2 format.

gmsc-mapper createdb -i ./db/GMSC10.90AA.faa.gz -o ./db -m mmseqs

Then you can assign --tool as mmseqs.

gmsc-mapper -i ./examples/example.fa --dbdir ./db --tool mmseqs

Output files

The output folder will contain

Outputs of smORFs prediction (predicted.filtered.smorf.faa)

A FASTA file with the sequences of the predicted smORFs. It is generated when the input file is contigs.
Complete alignment result table (alignment.out.smorfs.tsv)

A file listing all the query hits of GMSC, from Diamond or MMseqs2.

The file contains the following tab-separated columns:

qseqid: Query seq id

sseqid: Target seq id (in GMSC)

full_qseq: Query sequences

full_sseq: Target sequences (in GMSC)

qlen: Query sequences length

slen: Target sequences length

length: Alignment length

qstart: Start of alignment in query

qend: End of alignment in query

sstart: Start of alignment in target

send: End of alignment in target

bitscore: Bit score

pident: Percentage of identical matches

evalue: Expect value

qcovhsp: Query Coverage

scovhsp: Target Coverage
Total smORFs homologous to GMSC (mapped.smorfs.faa)

A FASTA file with the sequences of query/predicted smORFs homologous to GMSC.
Habitat annotation of smORFs (optional) (habitat.out.smorfs.tsv)

This file lists the habitat annotations of the query/predicted sequence, where the habitat is obtained from the sequence annotations of its homologous origin in GMSC.

There are two columns in the file:

qseqid: Query seq id

habitat: Habitat, ‘,’ separated if the sequence is from multiple habitats
Taxonomy annotation of smORFs (optional) (taxonomy.out.smorfs.tsv)

This file lists the taxonomy annotations of the query/predicted sequence, where the taxonomy is obtained from the sequence annotations of its homologous origin in GMSC.

There are two columns in the file:

qseqid: Query seq id

taxonomy: Taxonomy, ‘;’ separated between each taxonomy rank
Quality annotation of smORFs (optional) (quality.out.smorfs.tsv)

This file lists the quality annotations of the query/predicted sequence, where the quality is obtained from the sequence annotations of its homologous origin in GMSC.

qseqid: Query seq id

quality: Quality label
Conserved domain annotation of smORFs (optional) (domain.out.smorfs.tsv)

This file lists the conserved domain annotations of the query/predicted sequence, where the conserved domain is obtained from the sequence annotations of its homologous origin in GMSC.

qseqid: Query seq id

cdd: Identifiers from Conserved domain database, ‘,’ separated if the sequence is annotated with multiple conserved domains.
Summary (summary.txt)

A file providing a human-readable summary of the results.

Parameters

-i/--input: Path to the input genome contig sequence FASTA file (possibly .gz compressed).
--aa-genes: Path to the input amino acid sequence FASTA file (possibly .gz compressed).
--nt-genes: Path to the input nucleotide gene sequence FASTA file (possibly .gz compressed).
--dbdir: Path to the GMSC database directory. (default: ./db)
-o/--output: Output directory (will be created if non-existent). (default: ./output)
--tool: Sequence alignment tool (Diamond / MMseqs). (default: diamond)
-s/--sensitivity: Sensitivity. (default: –sensitive (Diamond) 5.7 (mmseqs))
--id: Minimum identity to report an alignment (range 0.0-1.0). (default: 0.0)
--cov: Minimum coverage to report an alignment (range 0.0-1.0). (default: 0.9)
-e/--evalue: Maximum e-value to report alignments. (default: 1e-05)
-t/--threads: Number of CPU threads. (default: 1)
--filter: Use this to filter <100 aa or <303 nt input sequences. (default: False)
--no-habitat: Use this if no need to annotate habitat. (default: False)
--no-taxonomy: Use this if no need to annotate taxonomy. (default: False)
--no-quality: Use this if no need to annotate quality. (default: False)
--no-domain: Use this if no need to annotate conserved domain. (default: False)
--quiet: Disable alignment console output. (default:False)

Subcommands and Parameters

Download GMSC database annotation index files

Subcommands: gmsc-mapper downloaddb

--dbdir: Path to GMSC database annotation index files. (default: ./db. If GMSC-mapper is your current work directory, the database files will be downloaded at GMSC-mapper/db)
--all: Download all database
-f: Force download even if the files exist

Create database index of Diamond and mmseqs

Subcommands: gmsc-mapper createdb

-i: Path to the GMSC FASTA file.
-o/--output: Path to database index output of Diamond and MMseqs2. (default: ./db. If GMSC-mapper is your current work directory, the database files will be created at GMSC-mapper/db)
-m/--mode: Alignment tool (Diamond / MMseqs2).
--quiet: Disable alignment console output. (default:False)