GMSC-mapper is a command line tool to query the Global Microbial smORFs Catalog (GMSC).
GMSC-mapper can be used to
Find query smORFs (< 100aa) homologous to Global Microbial smORFs Catalog (GMSC) by alignment.
Support 3 types of input:
contigs (GMSC-mapper will predict smORFs from contigs first)
amino acid sequences
nucleotide gene sequences
Annotate query / predicted smORFs with quality, habitat and taxonomy information constructed manually in detail.
If you use this tool in a publication, please cite:
Duan, Y., Santos-Júnior, C.D., Schmidt, T.S. et al. A catalog of small proteins from the global microbiome. Nat Commun 15, 7563 (2024). https://doi.org/10.1038/s41467-024-51894-6
You can print this citation with gmsc-mapper citation.
The input (-i) is the fasta file (GMSC10.90AA.faa.gz) downloaded to the dbdir (default: ./db) in the downloading step.
The default -o is ./db. If you want to use custom -o directory, it should be consistent with --dbdir in the previous download step.
GMSC Annotation
GMSC Database directory (--dbdir) and output directory (-o) can be assigned on your own. Default is ./db and ./output.
If you use GMSC-mapper as your current work directory. You can directly follow the commands below. Otherwise, you need to assign your custom --dbdir which contains database files.
Outputs of smORFs prediction (predicted.filtered.smorf.faa)
A FASTA file with the sequences of the predicted smORFs. It is generated when the input file is contigs.
Complete alignment result table (alignment.out.smorfs.tsv)
A file listing all the query hits of GMSC, from Diamond or MMseqs2.
The file contains the following tab-separated columns:
qseqid: Query seq id
sseqid: Target seq id (in GMSC)
full_qseq: Query sequences
full_sseq: Target sequences (in GMSC)
qlen: Query sequences length
slen: Target sequences length
length: Alignment length
qstart: Start of alignment in query
qend: End of alignment in query
sstart: Start of alignment in target
send: End of alignment in target
bitscore: Bit score
pident: Percentage of identical matches
evalue: Expect value
qcovhsp: Query Coverage
scovhsp: Target Coverage
Total smORFs homologous to GMSC (mapped.smorfs.faa)
A FASTA file with the sequences of query/predicted smORFs homologous to GMSC.
Habitat annotation of smORFs (optional) (habitat.out.smorfs.tsv)
This file lists the habitat annotations of the query/predicted sequence, where the habitat is obtained from the sequence annotations of its homologous origin in GMSC.
There are two columns in the file:
qseqid: Query seq id
habitat: Habitat, ‘,’ separated if the sequence is from multiple habitats
Taxonomy annotation of smORFs (optional) (taxonomy.out.smorfs.tsv)
This file lists the taxonomy annotations of the query/predicted sequence, where the taxonomy is obtained from the sequence annotations of its homologous origin in GMSC.
There are two columns in the file:
qseqid: Query seq id
taxonomy: Taxonomy, ‘;’ separated between each taxonomy rank
Quality annotation of smORFs (optional) (quality.out.smorfs.tsv)
This file lists the quality annotations of the query/predicted sequence, where the quality is obtained from the sequence annotations of its homologous origin in GMSC.
qseqid: Query seq id
quality: Quality label
Conserved domain annotation of smORFs (optional) (domain.out.smorfs.tsv)
This file lists the conserved domain annotations of the query/predicted sequence, where the conserved domain is obtained from the sequence annotations of its homologous origin in GMSC.
qseqid: Query seq id
cdd: Identifiers from Conserved domain database, ‘,’ separated if the sequence is annotated with multiple conserved domains.
Summary (summary.txt)
A file providing a human-readable summary of the results.
Parameters
-i/--input: Path to the input genome contig sequence FASTA file (possibly .gz compressed).
--aa-genes: Path to the input amino acid sequence FASTA file (possibly .gz compressed).
--nt-genes: Path to the input nucleotide gene sequence FASTA file (possibly .gz compressed).
--dbdir: Path to the GMSC database directory. (default: ./db)
-o/--output: Output directory (will be created if non-existent). (default: ./output)
--dbdir: Path to GMSC database annotation index files. (default: ./db. If GMSC-mapper is your current work directory, the database files will be downloaded at GMSC-mapper/db)
--all: Download all database
-f: Force download even if the files exist
Create database index of Diamond and mmseqs
Subcommands: gmsc-mapper createdb
-i: Path to the GMSC FASTA file.
-o/--output: Path to database index output of Diamond and MMseqs2. (default: ./db. If GMSC-mapper is your current work directory, the database files will be created at GMSC-mapper/db)
GMSC-mapper
GMSC-mapper is a command line tool to query the Global Microbial smORFs Catalog (GMSC).
GMSC-mapper can be used to
If you use this tool in a publication, please cite:
Duan, Y., Santos-Júnior, C.D., Schmidt, T.S. et al. A catalog of small proteins from the global microbiome. Nat Commun 15, 7563 (2024). https://doi.org/10.1038/s41467-024-51894-6
You can print this citation with
gmsc-mapper citation.Installation
Clone GMSC-mapper repository
Create conda environment
You will need the following dependencies:
The easiest way to install the dependencies is with conda:
Once the dependencies are installed, you can install GMSC-mapper by running:
Example test
As the whole GMSC database is large and takes some minutes to process, you can test with mock datasets to check if the installation works well.
Please use
GMSC-mapperas your work directory.Default alignment tool is DIAMOND.
The default alignment tool is DIAMOND, if you want to use MMseqs2 as your alignment tool, you need to create GMSC database index in MMseqs2 format.
After index creation, you can specify tool as mmseqs and other usage is the same as above.
Usage
Default usage
Download GMSC database and create index
We recommend to use
GMSC-mapperas your current work directory. You can directly follow the commands below.Download GMSC database
The default
--dbdiris./db. If you want to use custom--dbdirdirectory, it should be consistent with-oin the next creating database index step.Create GMSC database index
The input (
-i) is the fasta file (GMSC10.90AA.faa.gz) downloaded to the dbdir (default:./db) in the downloading step.The default
-ois./db. If you want to use custom-odirectory, it should be consistent with--dbdirin the previous download step.GMSC Annotation
GMSC Database directory (
--dbdir) and output directory (-o) can be assigned on your own. Default is./dband./output.If you use
GMSC-mapperas your current work directory. You can directly follow the commands below. Otherwise, you need to assign your custom--dbdirwhich contains database files.Further usage
Habitat / taxonomy / quality / domain annotation is optional
If you don’t want to annotate habitat / taxonomy / quality you can use
--no-habitat/--no-taxonomy/--no-quality/--no-domain.Alignment tool: DIAMOND / MMseqs2 is optional
The default alignment tool is DIAMOND, if you want to use MMseqs2 as your alignment tool, you need to create GMSC database index in MMseqs2 format.
Then you can assign
--toolas mmseqs.Output files
The output folder will contain
Outputs of smORFs prediction (predicted.filtered.smorf.faa)
A FASTA file with the sequences of the predicted smORFs. It is generated when the input file is contigs.
Complete alignment result table (alignment.out.smorfs.tsv)
A file listing all the query hits of GMSC, from Diamond or MMseqs2.
The file contains the following tab-separated columns:
qseqid: Query seq idsseqid: Target seq id (in GMSC)full_qseq: Query sequencesfull_sseq: Target sequences (in GMSC)qlen: Query sequences lengthslen: Target sequences lengthlength: Alignment lengthqstart: Start of alignment in queryqend: End of alignment in querysstart: Start of alignment in targetsend: End of alignment in targetbitscore: Bit scorepident: Percentage of identical matchesevalue: Expect valueqcovhsp: Query Coveragescovhsp: Target CoverageTotal smORFs homologous to GMSC (mapped.smorfs.faa)
A FASTA file with the sequences of query/predicted smORFs homologous to GMSC.
Habitat annotation of smORFs (optional) (habitat.out.smorfs.tsv)
This file lists the habitat annotations of the query/predicted sequence, where the habitat is obtained from the sequence annotations of its homologous origin in GMSC.
There are two columns in the file:
qseqid: Query seq idhabitat: Habitat, ‘,’ separated if the sequence is from multiple habitatsTaxonomy annotation of smORFs (optional) (taxonomy.out.smorfs.tsv)
This file lists the taxonomy annotations of the query/predicted sequence, where the taxonomy is obtained from the sequence annotations of its homologous origin in GMSC.
There are two columns in the file:
qseqid: Query seq idtaxonomy: Taxonomy, ‘;’ separated between each taxonomy rankQuality annotation of smORFs (optional) (quality.out.smorfs.tsv)
This file lists the quality annotations of the query/predicted sequence, where the quality is obtained from the sequence annotations of its homologous origin in GMSC.
qseqid: Query seq idquality: Quality labelConserved domain annotation of smORFs (optional) (domain.out.smorfs.tsv)
This file lists the conserved domain annotations of the query/predicted sequence, where the conserved domain is obtained from the sequence annotations of its homologous origin in GMSC.
qseqid: Query seq idcdd: Identifiers from Conserved domain database, ‘,’ separated if the sequence is annotated with multiple conserved domains.Summary (summary.txt)
A file providing a human-readable summary of the results.
Parameters
-i/--input: Path to the input genome contig sequence FASTA file (possibly .gz compressed).--aa-genes: Path to the input amino acid sequence FASTA file (possibly .gz compressed).--nt-genes: Path to the input nucleotide gene sequence FASTA file (possibly .gz compressed).--dbdir: Path to the GMSC database directory. (default:./db)-o/--output: Output directory (will be created if non-existent). (default:./output)--tool: Sequence alignment tool (Diamond / MMseqs). (default: diamond)-s/--sensitivity: Sensitivity. (default: –sensitive (Diamond) 5.7 (mmseqs))--id: Minimum identity to report an alignment (range 0.0-1.0). (default: 0.0)--cov: Minimum coverage to report an alignment (range 0.0-1.0). (default: 0.9)-e/--evalue: Maximum e-value to report alignments. (default: 1e-05)-t/--threads: Number of CPU threads. (default: 1)--filter: Use this to filter <100 aa or <303 nt input sequences. (default: False)--no-habitat: Use this if no need to annotate habitat. (default: False)--no-taxonomy: Use this if no need to annotate taxonomy. (default: False)--no-quality: Use this if no need to annotate quality. (default: False)--no-domain: Use this if no need to annotate conserved domain. (default: False)--quiet: Disable alignment console output. (default:False)Subcommands and Parameters
Download GMSC database annotation index files
Subcommands:
gmsc-mapper downloaddb--dbdir: Path to GMSC database annotation index files. (default:./db. IfGMSC-mapperis your current work directory, the database files will be downloaded atGMSC-mapper/db)--all: Download all database-f: Force download even if the files existCreate database index of Diamond and mmseqs
Subcommands:
gmsc-mapper createdb-i: Path to the GMSC FASTA file.-o/--output: Path to database index output of Diamond and MMseqs2. (default:./db. IfGMSC-mapperis your current work directory, the database files will be created atGMSC-mapper/db)-m/--mode: Alignment tool (Diamond / MMseqs2).--quiet: Disable alignment console output. (default:False)