Trans-Kingdom Marker Gene Pipeline for Taxonomic Profiling of Human Metagenomes
If you want to detect and quantify phages, bacteria, and archaea in whole genome shotgun reads from human-derived samples, Marker-MAGu is the tool for you.
Basically, this tool uses the strategy (marker gene detection) and database (marker gene sequences) from Metaphlan4, then:
adds marker genes from 10s of thousands of phages derived from human metagenomes
tweaks the thresholds so that phages and bacteria are detected with similar specificity and sensitivity
The relative abundance of bacteria/archaea with be highly similar to Metaphlan4(default settings) if Marker-MAGu is run with --detection relaxed (33% of marker genes required for detection). With Marker-MAGu --detection default output will be a bit less sensitive and a bit more specific, using a stricter threshold (75% of marker genes required for detection). Changes in settings and databases will likely change the output
Also, as in Metaphlan4SGBs, or Species-level Genome Bins are genomically-distinct species.
Set the Marker-MAGu database directory variable MARKERMAGU_DB, e.g.:
conda env config vars set MARKERMAGU_DB=/path/to/DBs/v1.1
Container (Docker, Singularity, Podman)
Basic Instructions
Please note that, while I WAS able to get this to run using Docker/Docker Desktop on my Mac, I am not a Docker expert, and I may be unable to troubleshoot issues.
* be sure to mount your volumes/directories with the `Marker-MAGu` database as well as those with input read files
* I believe you can save environmental variables like MARKERMAGU_DB in `Docker` containers
(OPTIONAL) Database for filtering out host reads and spike-ins
You could filter unwanted sequences out upstream of this tool, but this will allow you to do it within Marker-MAGu using minimap2. The pipeline script will look for a file at filter_seqs/filter_seqs.fna which could be any fasta-formatted sequence file you want to use to remove matching reads (e.g. from host or spike-in).
Here are instructions for downloading and formatting the human genome and phiX spike-in (3 GB decompressed).
cd Marker-MAGu ### or `cd` to where you want the filter_seqs to reside
mkdir filter_seqs && cd filter_seqs
## download phiX genome and gunzip
wget https://ftp.ncbi.nlm.nih.gov/genomes/refseq/viral/Sinsheimervirus_phiX174/latest_assembly_versions/GCF_000819615.1_ViralProj14015/GCF_000819615.1_ViralProj14015_genomic.fna.gz
gunzip GCF_000819615.1_ViralProj14015_genomic.fna.gz
## download human genome and gunzip
wget https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/latest_assembly_versions/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz
gunzip GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz
## concatenate files
cat GCF_000819615.1_ViralProj14015_genomic.fna GCF_009914755.1_T2T-CHM13v2.0_genomic.fna > filter_seqs.fna
## set the MARKERMAGU_FILTER variable in your conda environment
conda env config vars set MARKERMAGU_FILTER=/path/to/filter_seqs
## optionally delete separate files
rm GCF_000819615.1_ViralProj14015_genomic.fna GCF_009914755.1_T2T-CHM13v2.0_genomic.fna
Remember to set -f True to run the filtering step.
Running the tool
Linux-only for the conda environment
You might run this as part of a bash script, do your own upstream read processing, etc, but these are the basic instructions.
Required inputs:
-r reads file(s) (.fastq)
-s sample name (no space characters)
-o output directory (may be shared with other samples)
Activate the conda environment:
conda activate marker-magu
Individual samples can be run with the python script. E.g.:
usage: markermagu [-h] -r READS [READS ...] -s SAMPLE -o OUTPUT_DIR [--version] [-t CPU] [-q QUAL]
[-f FILTER_SEQS] [--filter_dir FILTER_DIR] [--temp TEMP_DIR] [--keep KEEP] [--db DB]
[--detection {default,relaxed}]
Marker-MAGu is a read mapping pipeline which uses marker genes to detect and measure bacteria, phages,
archaea, and microeukaryotes. Version 0.4.0
options:
-h, --help show this help message and exit
REQUIRED ARGUMENTS for Marker-MAGu :
-r READS [READS ...], --reads READS [READS ...]
read file(s) in .fastq format. You can specify more than one separated by a
space
-s SAMPLE, --sample SAMPLE
Sample name. No space characters, please.
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
Output directory name. Will be created if it does not exist. Can be shared
with other samples. No space characters, please.
OPTIONAL ARGUMENTS for Marker-MAGu.:
--version show program's version number and exit
-t CPU, --cpu CPU Default: 48 -- Example: 32 -- Number of CPUs available for Marker-MAGu.
-q QUAL, --qual QUAL True or False. Remove low-quality reads with fastp?
-f FILTER_SEQS, --filter_seqs FILTER_SEQS
True or False. Remove reads aligning to sequences at
filter_seqs/filter_seqs.fna ?
--filter_dir FILTER_DIR
path to directory of sequences to filter. If not set, Marker-MAGu looks for
environmental variable MARKERMAGU_FILTER. Then, if this variable is unset, it
this is unset, DB path is assumed to be
/cmmr/apps/conda_users/u241374/envs/marker-magu/lib/python3.11/site-
packages/markermagu
--temp TEMP_DIR path of temporary directory. Default is {OUTPUT_DIR}/{SAMPLE}_temp/
--keep KEEP True of False. Keep the intermediate files, located in the temporary
directory? These can add up, so it is not recommended if space is a concern.
--db DB DB path. If not set, Marker-MAGu looks for environmental variable
MARKERMAGU_DB. Then, if this variable is unset, it this is unset, DB path is
assumed to be /cmmr/apps/conda_users/u241374/envs/marker-
magu/lib/python3.11/site-packages/markermagu
--detection {default,relaxed}
Stringency of SGB detection. "default" setting requires >=75 percent of marker
genes with at least 1 read mapped. "relaxed" setting requires >= 33.3 percent
of marker genes with at least 1 read mapped AND at least 3 marker genes
detected.
Combine Output tables in a Project Directory
The project directory should have multiple *.detected_species.tsv tables.
This command will generate the table myproject_MM1.combined_profile.tsv.
Notes for Users
Resource Usage
In my experience, the step which uses minimap2 to align reads to the marker gene database uses about 66GB of memory.
Output table
Marker-MAGu outputs a long form table, and the combined outputs (from combine_sample_tables1.R) are long form. This allows for a smaller file size and nimbler downstream analysis, in my opinion. Major bioinformatics coding languages can convert long form tables to wide form.
Please see the file Marker-MAGu_virus_DB_v1.1_metadata.tsv which is downloaded with the Marker-MAGu database. You can merge this table and any Marker-MAGu output table on the “lineage” column.
Source for virus taxonomy
Level
Source
Kingdom
‘Virus’ for all
Phylum
Genomad
Class
Genomad
Order
Genomad
Family
Genomad
Genus
Vcontact2 (arbitrary names)
Species
ANIclust (arbitrary names)
Other notes
Relative abundance values from each sample will add up to 1, so there is no “unknown fraction estimate”
Currently, Marker-MAGu only profiles phages with 4 or more unique marker genes, so small, ssDNA phages such as microviruses and inoviruses may be undetectable. I’m working on it.
Should any issues arise, please leave an issue in this GitHub Repo.
Adding your own virus hallmark genes to Marker-MAGu
A command line walkthrough for users wishing to add their own viruses to the Marker-MAGu DB. While Marker-MAGu itself is pretty simple, it can be a bit involved to make and format marker genes. But, hey, let’s do it.
Input
Multi-fasta file with genome sequences of viruses. If you have a very large file (thousands of genomes), the hmmscan and BLASTN steps may take a while.
In this walkthrough, we will use my_viruses1.fna, so we will set a variable with the stem of this file name:
MYSEQS="my_viruses1"
Package/tool versions used
It might be easiest to install these in a new conda environment. Tool versions might not matter too much.
### Parse genomes
**1) Get list of genomes with minimum number of hallmark genes**
This command will only keep genomes with 4 or more hallmark genes as is recommended for `Marker-MAGu`. See awk expression `if ($1>=4)`.
HMGS=( find indiv_genomes/ -type f -name "*__hmg.fna" | sed 's/\.fna//g' )
if [ -n "HMGS” ] ; then
for HM in HMGS;dobioawk−vgenq="{HM#indiv_genomes/}” -c fastx ‘{if (NR == 1) {print “>”genq ; print seq} else {printseq }}’ HM.fna>{HM}.concat.fna ;
done
else
echo “hallmark gene files for each genome NOT found”
fi
### Compare concatenated hallmark gene sequences against each other to dereplicate redundant sequences
NOTE: If all-vs-all BLASTN only returns self-alignments, the `anicalc.py` script will return an error. In this case, just use the `${MYSEQS}.marker_genomes_list.txt` file instead of the `${MYSEQS}.hallmark_genes.nucl.concat.exemplars1.txt` for step 3).
**1) BLASTN**
### Add marker genes to Marker-MAGu
**1) Format marker genes for `Marker-MAGu`**
This is kind of an unrefined piece of code as it does not assign meaningful taxonomical assignments at any level except species ("s\_\_"), which it merely assigns the sequence name. However, it does fulfill the requirements for `Marker-MAGu`, which is to format the header as:
*gene_name;hierarchical_taxonomy;genome_of_origin_name*
sed ‘s/[^_]*/@_&/ ; s/^@_//g'{MYSEQS}.hallmark_genes.nucl.exemplars1.fna | bioawk -c fastx ‘{ split(name, a, "_@") ; print ">"a[1]a[2]";k__Viruses|p__Unclassified_viruses|c__Unclassified_viruses|o__Unclassified_viruses|f__Unclassified_viruses|g__Unclassified_viruses|s__vOTU_"a[1]";"a[1] ; printseq}’ > ${MYSEQS}.hallmark_genes.nucl.exemplars1.fmt.fna
If you want to assign more complex taxonomy to your sequences based on some other tool, please make an Issue and I can upload some additional scripts to the repo to facilitate this.
**2) Concatenate your formatted marker genes to the existing `Marker-MAGu` database**
Marker-MAGu
Trans-Kingdom Marker Gene Pipeline for Taxonomic Profiling of Human Metagenomes
If you want to detect and quantify phages, bacteria, and archaea in whole genome shotgun reads from human-derived samples,
Marker-MAGuis the tool for you.Basically, this tool uses the strategy (marker gene detection) and database (marker gene sequences) from Metaphlan4, then:
The relative abundance of bacteria/archaea with be highly similar to
Metaphlan4(default settings) ifMarker-MAGuis run with--detection relaxed(33% of marker genes required for detection). WithMarker-MAGu --detection defaultoutput will be a bit less sensitive and a bit more specific, using a stricter threshold (75% of marker genes required for detection). Changes in settings and databases will likely change the outputAlso, as in
Metaphlan4SGBs, or Species-level Genome Bins are genomically-distinct species.Logo by Adrien Assie
Schematic
Installation
Conda steps
This will only work in Linux. If you only have access to another OS, you must use the Container steps (e.g. Docker, Singularity, Podman). See below.
conda. Tested withcondaversion 23.5.0conda create -n marker-magu -c bioconda marker-magucd Marker-MAGuorcdto where you want the database to residemkdir DBs && cd DBswget https://zenodo.org/records/8342581/files/Marker-MAGu_markerDB_v1.1.tar.gzmd5sum Marker-MAGu_markerDB_v1.1.tar.gzshould return
e0947cb1d4a3df09829e98627021e0ddtar -xvf Marker-MAGu_markerDB_v1.1.tar.gzrm Marker-MAGu_markerDB_v1.1.tar.gzMarker-MAGudatabase directory variable MARKERMAGU_DB, e.g.:conda env config vars set MARKERMAGU_DB=/path/to/DBs/v1.1Container (Docker, Singularity, Podman)
Basic Instructions
Please note that, while I WAS able to get this to run using
Docker/Docker Desktopon my Mac, I am not aDockerexpert, and I may be unable to troubleshoot issues.docker pull quay.io/biocontainers/marker-magu:0.4.0--pyhdfd78af_0Notes:
(OPTIONAL) Database for filtering out host reads and spike-ins
You could filter unwanted sequences out upstream of this tool, but this will allow you to do it within
Marker-MAGuusingminimap2. The pipeline script will look for a file atfilter_seqs/filter_seqs.fnawhich could be any fasta-formatted sequence file you want to use to remove matching reads (e.g. from host or spike-in).Here are instructions for downloading and formatting the human genome and phiX spike-in (3 GB decompressed).
Remember to set
-f Trueto run the filtering step.Running the tool
Linux-only for the conda environment
You might run this as part of a bash script, do your own upstream read processing, etc, but these are the basic instructions.
Required inputs:
-r reads file(s) (.fastq)-s sample name (no space characters)-o output directory (may be shared with other samples)Activate the conda environment:
conda activate marker-maguIndividual samples can be run with the python script. E.g.:
Basic run with 1 .fastq file:
Using multiple input .fastq files (
Marker-MAGudoesn’t used paired-end info)Using multiple input .fastq files with wildcard
With pre-filtering steps:
With relaxed detection
Help menu
Full help menu
Combine Output tables in a Project Directory
The project directory should have multiple
*.detected_species.tsvtables.Activate conda environment:
conda activate marker-maguThen, run Rscript with the project directory as the first and only argument:
This command will generate the table
myproject_MM1.combined_profile.tsv.Notes for Users
Resource Usage
In my experience, the step which uses
minimap2to align reads to the marker gene database uses about 66GB of memory.Output table
Marker-MAGuoutputs a long form table, and the combined outputs (fromcombine_sample_tables1.R) are long form. This allows for a smaller file size and nimbler downstream analysis, in my opinion. Major bioinformatics coding languages can convert long form tables to wide form.An example in
Rusing Tidyverse packages:Trove of Gut Virus Genomes
See details and download full genome database on Zenodo:
Link
Host predictions, virulence, etc. for phages
Please see the file
Marker-MAGu_virus_DB_v1.1_metadata.tsvwhich is downloaded with theMarker-MAGudatabase. You can merge this table and anyMarker-MAGuoutput table on the “lineage” column.Source for virus taxonomy
Other notes
Relative abundance values from each sample will add up to 1, so there is no “unknown fraction estimate”
Currently,
Marker-MAGuonly profiles phages with 4 or more unique marker genes, so small, ssDNA phages such as microviruses and inoviruses may be undetectable. I’m working on it.Should any issues arise, please leave an issue in this GitHub Repo.
Adding your own virus hallmark genes to Marker-MAGu
A command line walkthrough for users wishing to add their own viruses to the
Marker-MAGuDB. WhileMarker-MAGuitself is pretty simple, it can be a bit involved to make and format marker genes. But, hey, let’s do it.Input
Multi-fasta file with genome sequences of viruses. If you have a very large file (thousands of genomes), the hmmscan and BLASTN steps may take a while.
In this walkthrough, we will use
my_viruses1.fna, so we will set a variable with the stem of this file name:MYSEQS="my_viruses1"Package/tool versions used
It might be easiest to install these in a new conda environment. Tool versions might not matter too much.
HMMER database for virus hallmark gene identification
This is the HMMER database used by
Cenote-Taker 2for virus identification1) navigate to a new directory for processing data
2) download DB
wget https://zenodo.org/records/4966268/files/hmmscan_DBs.tgz3) unpack
tar -xvf hmmscan_DBs.tgzCall ORFs, Identify hallmark genes
1) call genes
make sure variable is set for your file
MYSEQS="my_viruses1"2) hmmscan virus virion and replication HMMER databases
then
3) combine tables
cut -f3 MYSEQS.allhmmscanfmt.out>{MYSEQS}.hallmark_gene_list.txt
seqkit grep -j 16 -f MYSEQS.hallmarkgenelist.txt{MYSEQS}.genes.fna > ${MYSEQS}.hallmark_genes.nucl.fna
seqkit grep -j 16 -f MYSEQS.hallmarkgenelist.txt{MYSEQS}.prot.faa > ${MYSEQS}.hallmark_genes.prot.faa
sed ‘s/[^_]*//′{MYSEQS}.hallmark_gene_list.txt | sort | uniq -c | awk ‘{if (1>=4) {print2}}’ > ${MYSEQS}.marker_genomes_list.txt
mkdir indiv_genomes
cat MYSEQS.markergenomeslist.txt∣xargs−n1−I−P16seqkitgrep−j1−−quiet−r−p""{MYSEQS}.hallmark_genes.nucl.fna -o indiv_genomes/{}_hmg.fna
HMGS=( find indiv_genomes/ -type f -name "*__hmg.fna" | sed 's/\.fna//g' ) if [ -n "HMGS” ] ; then for HM in HMGS;dobioawk−vgenq="{HM#indiv_genomes/}” -c fastx ‘{if (NR == 1) {print “>”genq ; print seq} else {printseq }}’ HM.fna>{HM}.concat.fna ; done else echo “hallmark gene files for each genome NOT found” fi
cat indiv_genomes/*_hmg.concat.fna > ${MYSEQS}.hallmark_genes.nucl.concat.fna
mkdir blast_DBs
makeblastdb -in MYSEQS.hallmarkgenes.nucl.concat.fna−dbtypenucl−outblastDBs/{MYSEQS}.hallmark_genes.nucl.concat
blastn -query MYSEQS.hallmarkgenes.nucl.concat.fna−dbblastDBs/{MYSEQS}.hallmark_genes.nucl.concat -outfmt ‘6 std qlen slen’ -max_target_seqs 10000 -perc_identity 90 -out ${MYSEQS}.hallmark_genes.nucl.concat.blastn.tsv -num_threads 16
python /path/to/Marker-MAGu/src/markermagu/utils/anicalc.py -i MYSEQS.hallmarkgenes.nucl.concat.blastn.tsv−o{MYSEQS}.hallmark_genes.nucl.concat.anicalc.tsv
python /path/to/Marker-MAGu/src/markermagu/utils/aniclust.py –fna MYSEQS.hallmarkgenes.nucl.concat.fna−−ani{MYSEQS}.hallmark_genes.nucl.concat.anicalc.tsv –out ${MYSEQS}.hallmark_genes.nucl.concat.aniclust.ANI95_TCOV85.tsv –min_ani 95 –min_tcov 85 –min_qcov 0
cut -f1 MYSEQS.hallmarkgenes.nucl.concat.aniclust.ANI95TCOV85.tsv>{MYSEQS}.hallmark_genes.nucl.concat.exemplars1.txt
bioawk -c fastx ‘{print name,seq}’ MYSEQS.hallmarkgenes.nucl.fna∣grep−F−f{MYSEQS}.hallmark_genes.nucl.concat.exemplars1.txt | awk ‘{OFS=FS=”\t”}{print “>” 1;print2}’ > ${MYSEQS}.hallmark_genes.nucl.exemplars1.fna
sed ‘s/[^_]*/@_&/ ; s/^@_//g'{MYSEQS}.hallmark_genes.nucl.exemplars1.fna | bioawk -c fastx ‘{ split(name, a, "_@") ; print ">"a[1]a[2]";k__Viruses|p__Unclassified_viruses|c__Unclassified_viruses|o__Unclassified_viruses|f__Unclassified_viruses|g__Unclassified_viruses|s__vOTU_"a[1]";"a[1] ; printseq}’ > ${MYSEQS}.hallmark_genes.nucl.exemplars1.fmt.fna
cd Marker-MAGu/DBs
mkdir v_${MYSEQS}
cat v1.1/Marker-MAGu_markerDB.fna /path/to/{MYSEQS}.hallmark_genes.nucl.exemplars1.fmt.fna > v_{MYSEQS}/Marker-MAGu_markerDB.fna