## NCBI
wget -qN --show-progress https://zenodo.org/records/15231351/files/database.tar.gz
tar -zxvf database.tar.gz
## GTDB
# wget -qN --show-progress https://zenodo.org/records/15231379/files/database.tar.gz
# tar -zxvf database.tar.gz
Index the files:
## if you encounter memory issue please consider manually lowering cpu_count or simply set cpu_count=1
cpu_count=$(python -c 'import os; print(os.cpu_count())')
diamond makedb --in database/prot.fa --db database/prot --quiet
ls database/nucl.*.fa | sort | xargs -P $cpu_count -I {} bash -c '
filename=${1%.fa*};
filename=${filename##*/};
minimap2 -x map-ont -d database/$filename.mmi ${1} 2> /dev/null;
echo "Indexed <database/$filename.fa>.";' - {}
## remove unnecessary files to save space
rm -rf database/*.fa
Run Melon
[!NOTE]
Melon takes quality-controlled long reads as input. We suggest removing low-quality raw reads before running Melon with e.g., nanoq -q 10 -l 1000 (min. quality score 10; min. read length 1,000 bp). If your sample is known to have a large proportion of human DNAs or other eukaryotes/viruses and you want to estimate the mean genome size of prokaryotes, please consider removing them via proper mapping, or enabling the simple pre-filtering module. See Run Melon with pre-filtering of non-prokaryotic reads for more details.
We provide an example file comprising 10,000 quality-controlled (processed with Porechop and nanoq) prokaryotic reads (fungal and other reads removed with minimap2), randomly selected from the R10.3 mock sample of Loman Lab Mock Community Experiments.
INFO: Estimating genome copies ...
INFO: ... found 27.375 copies of genomes (bacteria: 27.375; archaea: 0).
INFO: Assigning taxonomy ...
INFO: Reassigning taxonomy ...
INFO: ... found 8 unique species (bacteria: 8; archaea: 0).
INFO: Done.
The output file *.tsv contains the estimated genome copies for individual species, their corresponding relative abundances and gap-compressed/gap-uncompressed ANI (average nucleotide identity between marker-gene-containing reads and reference genome clusters) values:
Run Melon with pre-filtering of non-prokaryotic reads
To enable the pre-filtering module, you need to download a database of Kraken2 that includes at least human and fungi (PlusPF, PlusPFP, or their capped versions). Using the PlusPF-8 (ver. 2023-06-05, capped at 8 GB) as an example:
## https://benlangmead.github.io/aws-indexes/k2
mkdir database_kraken
wget -q --show-progress https://genome-idx.s3.amazonaws.com/kraken/k2_pluspf_08gb_20230605.tar.gz -O database_kraken/db.tar.gz
tar -zxvf database_kraken/db.tar.gz -C database_kraken
## remove temporary files to save space
rm -rf database_kraken/db.tar.gz
Run with argument -k/-db-kraken:
melon *.fa -d database -o . -k database_kraken
Miscellaneous
Statistics of influent/effluent samples with/without pre-filter
Both samples were collected from the Shatin wastewater treatment plant, sequenced with ONT SQK-NBD114 & R10.4.1 on PromethION, basecalled with Guppy v6.5.7.
no pre-filter
PlusPF-8
PlusPF-16
PlusPF
influent
read: 1,091,772
bp: 7,815,976,467
n50: 9,500
mean quality: 18.3
genome copy
1,809
1,801
1,800
1,797
species richness
2,101
2,098
2,099
2,097
number of filtered reads
-
8,312
10,212
14,988
mean genome size
4.322
4.304
4.300
4.292
real time (sec)
2,056
2,271
2,279
-
peak resident set size (GB)
10.629
10.665
17.294
-
effluent
read: 1,118,502
bp: 5,157,996,045
n50: 6,577
mean quality: 18.2
genome copy
1,348
1,336
1,331
1,315
species richness
1,704
1,697
1,700
1,696
number of filtered reads
-
16,602
29,300
54,774
mean genome size
3.826
3.789
3.757
3.715
real time (sec)
1,341
1,496
1,495
-
peak resident set size (GB)
10.893
10.902
17.067
-
Tested on MacBook Pro 2021, Apple M1 Max, 64 GB memory, macOS Sonoma. Melon v0.1.0, NCBI database (ver. RefSeq R219) and Kraken database (ver. 2023-06-05). Mean genome size is in unit of Mb. Real time and peak resident set size are measured with time.
Citation
Chen, X., Yin, X., Shi, X., Yan, W., Yang, Y., Liu, L., & Zhang, T. (2024). Melon: metagenomic long-read-based taxonomic identification and quantification using marker genes. Genome Biology, 25(1), 226. https://doi.org/10.1186/s13059-024-03363-y
Melon
Melon: metagenomic long-read-based taxonomic identification and quantification using marker genes
Quick start
Installation
Create a new conda environment and install:
Database setup
Download either the NCBI or the GTDB database:
Index the files:
Run Melon
We provide an example file comprising 10,000 quality-controlled (processed with
Porechopandnanoq) prokaryotic reads (fungal and other reads removed withminimap2), randomly selected from the R10.3 mock sample of Loman Lab Mock Community Experiments.You should see:
The output file
*.tsvcontains the estimated genome copies for individual species, their corresponding relative abundances and gap-compressed/gap-uncompressed ANI (average nucleotide identity between marker-gene-containing reads and reference genome clusters) values:The output file
*.jsoncontains the lineage and remark of each processed read.Run Melon with pre-filtering of non-prokaryotic reads
To enable the pre-filtering module, you need to download a database of Kraken2 that includes at least human and fungi (PlusPF, PlusPFP, or their capped versions). Using the PlusPF-8 (ver. 2023-06-05, capped at 8 GB) as an example:
Run with argument
-k/-db-kraken:Miscellaneous
Statistics of influent/effluent samples with/without pre-filter
Both samples were collected from the Shatin wastewater treatment plant, sequenced with ONT SQK-NBD114 & R10.4.1 on PromethION, basecalled with Guppy v6.5.7.
Tested on MacBook Pro 2021, Apple M1 Max, 64 GB memory, macOS Sonoma. Melon v0.1.0, NCBI database (ver. RefSeq R219) and Kraken database (ver. 2023-06-05). Mean genome size is in unit of Mb. Real time and peak resident set size are measured with
time.Citation
Chen, X., Yin, X., Shi, X., Yan, W., Yang, Y., Liu, L., & Zhang, T. (2024). Melon: metagenomic long-read-based taxonomic identification and quantification using marker genes. Genome Biology, 25(1), 226. https://doi.org/10.1186/s13059-024-03363-y