In addition to mGEMS, to run the binning pipeline, you will likely
need a program that pseudoalignment and another program that estimates
an assignment probability matrix for the reads to the alignment
targets. Please see the Dependencies subsection for more details.
This will compile the mGEMS executable in the build/bin/ directory.
Dependencies
We recommend to use Themisto
(v2.0.0 or newer) for pseudoalignment and
mSWEEP (v1.3.2 or newer) for
estimating the probability matrix.
For assembling the bins output by mGEMS, we recommend
shovill for typical use-cases
but metagenomic assemblers like
MEGAHIT may perform better when
the differences between the bins are especially small (see
Supplementary Figure 2 of the mGEMS preprint). Shovill comes with an
option to use different assemblers as the backend (default is SPAdes).
mSWEEP and shovill can be easily installed from bioconda.
Usage
mGEMS
The mGEMS executable provides three commands: mGEMS, mGEMS bin, and
mGEMS extract. The first command (mGEMS) is shorthand for running both
mGEMS bin and mGEMS extract, which bin the reads in the input
pseudoalignment (mGEMS bin) and extract the binned reads from the
original mixed samples (mGEMS extract).
Tutorial — E. coli ST131 sublineages
A tutorial for reproducing the E. coli ST131 sublineage phylogenetic
tree presented in Mäklin et al. 2020 using mGEMS is available in the
docs folder of this repository.
Estimate the relative abundances with mSWEEP (reference_grouping.txt
should contain the groups the sequences in ‘example.fasta’ are
assigned to. See the mSWEEP usage instructions for details).
This will write the binned paired-end reads for all groups in the
mSWEEP_abundances.txt file in the mGEMS-out folder (compressed with
zlib).
Advanced use
You can also extract the read-to-group assignments table that mGEMS
uses internally by adding the --write-assignment-table toggle to the
call to mGEMS or mGEMS bin:
Alternatively, find and write only the read bins for “group-3”,
“group-4”, and the reads that pseudoaligned but were not assigned to
any group; skipping extracting the reads
mGEMS bin --groups group-3,group-4 --themisto-alns pseudoalignments_1.aln.gz,pseudoalignments_2.aln.gz -i reference_grouping.txt -o mGEMS-out --probs mSWEEP_probs.csv -a mSWEEP_abundances.txt --index themisto_index --write-unassigned
-r Comma-separated list of input read(s).
-i Group identifiers file used with the mSWEEP call.
--themisto-alns Comma-separated list of pseudoalignment file(s)
for the reads from themisto.
-o Output directory (must exist before running!).
--probs Comma-separated Posterior probability matrix (output from mSWEEP with
the --write-probs flag).
-a Relative abundance estimates from mSWEEP (tab-separated, 1st
column has the group names and 2nd column the estimates).
--index Themisto pseudoalignment index directory.
--groups (Optional) Which groups to extract from the input reads.
--min-abundance (Optional) Extract only groups that have a relative abundance higher than this value.
--compress (Optional) Toggle compressing the output files (default: compress)
--write-unassigned (Optional) Extract reads that pseudoaligned to a reference sequence but were not assigned to any group. (default: off)
--write-assignment-table (Optional) Write the read to group assignments table to `reads_to_groups.tsv` in the output directory. (default: off).
--unique-only (Optional) Write only the reads that are assigned to a single group.
Citation
If you use mGEMS, please cite us as “Mäklin T, Kallonen T, Alanko J et
al. Bacterial genomic epidemiology with mixed samples. Microb Genom
2021, 7:11 (https://doi.org/10.1099/mgen.0.000691)"
You should also cite the method that you used to estimate the input
probability matrix to mGEMS, which is likely to be
mSWEEP.
To cite a specific version of mGEMS, visit the releases
page and find the doi for
the version of the program that you used. Then, cite the version
(v1.1.0 in the example) as “Tommi Mäklin. (2021). PROBIC/mGEMS:
mGEMS-v1.1.0 (20 October 2021)
(v1.1.0). Zenodo. (https://doi.org/10.5281/zenodo.5583245)". Citing
the source code properly helps ensure that your analyses are
reproducible. Please also cite the
article
if you use mGEMS.
License
The source code from this project is subject to the terms of the MIT
license. A copy of the MIT license is supplied with the project, or
can be obtained at https://opensource.org/licenses/MIT.
mGEMS
Bacterial sequencing data binning on strain-level based on probabilistic taxonomic classification.
More about mGEMS in the article Bacterial genomic epidemiology with mixed samples in Microbial Genomics.
Installation
In addition to mGEMS, to run the binning pipeline, you will likely need a program that pseudoalignment and another program that estimates an assignment probability matrix for the reads to the alignment targets. Please see the Dependencies subsection for more details.
Conda
Install mGEMS from bioconda with
check that the installation succeeded by running
mGEMS binaries
Precompiled binaries are available for
Compiling from source
Requirements
Compilation
Clone the repository
enter the directory and run
This will compile the mGEMS executable in the build/bin/ directory.
Dependencies
We recommend to use Themisto (v2.0.0 or newer) for pseudoalignment and mSWEEP (v1.3.2 or newer) for estimating the probability matrix.
For assembling the bins output by mGEMS, we recommend shovill for typical use-cases but metagenomic assemblers like MEGAHIT may perform better when the differences between the bins are especially small (see Supplementary Figure 2 of the mGEMS preprint). Shovill comes with an option to use different assemblers as the backend (default is SPAdes).
mSWEEP and shovill can be easily installed from bioconda.
Usage
mGEMS
The mGEMS executable provides three commands: mGEMS, mGEMS bin, and mGEMS extract. The first command (mGEMS) is shorthand for running both mGEMS bin and mGEMS extract, which bin the reads in the input pseudoalignment (mGEMS bin) and extract the binned reads from the original mixed samples (mGEMS extract).
Tutorial — E. coli ST131 sublineages
A tutorial for reproducing the E. coli ST131 sublineage phylogenetic tree presented in Mäklin et al. 2020 using mGEMS is available in the docs folder of this repository.
Quickstart — full pipeline
Index the reference sequences
Build a Themisto index to align against.
Pseudoalign the reads
Align paired-end reads ‘reads_1.fastq.gz’ and ‘reads_2.fastq.gz’ with Themisto (note the –sort-output flag must be used!)
Estimate the relative abundances with mSWEEP (reference_grouping.txt should contain the groups the sequences in ‘example.fasta’ are assigned to. See the mSWEEP usage instructions for details).
Bin the reads and write all bins to the ‘mGEMS-out’ folder
This will write the binned paired-end reads for all groups in the mSWEEP_abundances.txt file in the mGEMS-out folder (compressed with zlib).
Advanced use
You can also extract the read-to-group assignments table that mGEMS uses internally by adding the
--write-assignment-tabletoggle to the call tomGEMSormGEMS bin:… or bin and write only the reads that are assigned to “group-3” or “group-4” by adding the ‘–groups group-3,group-4’ flag
… write the reads that pseudoaligned to a reference sequence but were not assigned to any group by adding the
--write-unassignedflag:Alternatively, find and write only the read bins for “group-3”, “group-4”, and the reads that pseudoaligned but were not assigned to any group; skipping extracting the reads
… and extract the reads when feeling like it
Accepted input flags
mGEMS accepts the following input flags
Citation
If you use mGEMS, please cite us as “Mäklin T, Kallonen T, Alanko J et al. Bacterial genomic epidemiology with mixed samples. Microb Genom 2021, 7:11 (https://doi.org/10.1099/mgen.0.000691)"
You should also cite the method that you used to estimate the input probability matrix to mGEMS, which is likely to be mSWEEP.
To cite a specific version of mGEMS, visit the releases page and find the doi for the version of the program that you used. Then, cite the version (v1.1.0 in the example) as “Tommi Mäklin. (2021). PROBIC/mGEMS: mGEMS-v1.1.0 (20 October 2021) (v1.1.0). Zenodo. (https://doi.org/10.5281/zenodo.5583245)". Citing the source code properly helps ensure that your analyses are reproducible. Please also cite the article if you use mGEMS.
License
The source code from this project is subject to the terms of the MIT license. A copy of the MIT license is supplied with the project, or can be obtained at https://opensource.org/licenses/MIT.