Argo: species-resolved profiling of antibiotic resistance genes in complex metagenomes through long-read overlapping
Introduction
Argo is a long-read-based profiler developed for environmental surveillance of antibiotic resistance genes (ARGs) with species-level resolution. It uses minimap2’s base-level alignment with GTDB to obtain raw species assignments and consolidates these assignments on a read cluster basis (determined through decomposing the read-overlap graph) by solving a set cover problem. Argo takes quality-controlled long reads (either Nanopore or PacBio) as input and returns a table listing predicted ARGs (types and subtypes), their potential hosts, and estimated abundances, expressed as ARG copies per genome (cpg), which is equivalent to ARG copies per cell (cpc), assuming each cell contains a single genome.
Argo uses SARG+ as its default database, which augments experimentally validated sequences with RefSeq sequences that share annotation evidence. However, it also accepts customized databases for compatibility with NDARO and CARD. See https://github.com/xinehc/argo-supplementary for more details.
wget -qN --show-progress https://zenodo.org/records/15356208/files/database.tar.gz
tar -xvf database.tar.gz
Index the files:
## if you encounter memory issue please consider manually lowering cpu_count or simply set cpu_count=1
cpu_count=$(python -c 'import os; print(os.cpu_count())')
diamond makedb --in database/prot.fa --db database/prot --quiet
diamond makedb --in database/sarg.fa --db database/sarg --quiet
ls database/*.*.fa | sort | xargs -P $cpu_count -I {} bash -c '
filename=${1%.fa*};
filename=${filename##*/};
minimap2 -x map-ont -d database/$filename.mmi ${1} 2> /dev/null;
echo "Indexed <database/$filename.fa>.";' - {}
## remove temporary files to save space
rm -rf database/*.fa
Run Argo
[!NOTE]
Argo by default classifies all reads that carry at least one ARG into their “most likely” lineages with ties resolved based on the estimated genome copies of species present. Since plasmid reads can have multiple hosts in a sample (e.g., NZ_OW968330.1), interpretation requires caution. --plasmid forces the splitting of ARGs by their carriers (chromosomes or plasmids), but chimeric reads and uncharacterized plasmids may interfere with the identification.
We provide an example file comprising 10,000 quality-controlled (processed with Porechop and nanoq) prokaryotic reads (fungal and other reads removed with minimap2), randomly selected from the R10.3 mock sample of Loman Lab Mock Community Experiments.
By default, Argo infers the median sequence divergence of a sample using its read overlaps and determines an adaptive identity cutoff for identifying ARGs, calculated as 90 - 2.5 * 100 * median sequence divergence. This ensures samples generated by different platforms or kits are comparable, despite their differences in read quality. However, if you want to suppress this option and use a fixed cutoff—for instance, an 80% identity cutoff and 80% subject cover—you can run:
argo *.fa -d database -o . --plasmid -i 80 -s 80
A complete list of arguments and their default values is shown below:
Usage: argo -d DIR -o DIR [-t INT] [--plasmid] [--skip-melon] [--skip-clean] [-m INT] [-e FLOAT] [-i FLOAT] [-s FLOAT] [-n INT] [-p FLOAT] [-z FLOAT] [-u INT] [-b INT] [-x FLOAT] [-y FLOAT] file [file ...]
Argo: species-resolved profiling of antibiotic resistance genes in complex metagenomes through long-read overlapping
Positional Arguments:
file Input fasta <*.fa|*.fasta> or fastq <*.fq|*.fastq> file, gzip optional <*.gz>.
Required Arguments:
-d, --db DIR Unzipped database folder, should contains <prot.fa|sarg.fa>, <nucl.*.fa|sarg.*.fa> and metadata files. (default: None)
-o, --output DIR Output folder. (default: None)
Optional Arguments:
-t, --threads INT Number of threads. [10] (default: 10)
--plasmid List ARGs carried by plasmids. (default: False)
--skip-melon Skip Melon for genome copy estimation. (default: False)
--skip-clean Skip cleaning, keep all temporary <*.tmp> files. (default: False)
Additional Arguments - Filtering:
-m INT Max. number of target sequences to report (--max-target-seqs/-k in diamond). (default: 25)
-e FLOAT Max. expected value to report alignments (--evalue/-e in diamond). (default: 1e-05)
-i FLOAT Min. identity in percentage to report alignments. If "0" then set 90 - 2.5 * 100 * median sequence divergence. (default: 0)
-s FLOAT Min. subject cover within a read cluster to report alignments. (default: 90)
-n INT Max. number of secondary alignments to report (-N in minimap2). (default: 2147483647)
-p FLOAT Min. secondary-to-primary score ratio to report secondary alignments (-p in minimap2). (default: 0.9)
-z FLOAT Min. estimated genome copies of a species to report it ARG copies and abundances. (default: 1)
-u INT Max. number of ARG-containing reads per chunk for overlapping. If "0" then use a single chunk. (default: 0)
Additional Arguments - Graph Clustering:
-b INT Terminal condition - max. iterations. (default: 1000)
-x FLOAT MCL parameter - inflation. (default: 2)
-y FLOAT MCL parameter - expansion. (default: 2)
FAQ
Does Argo work with isolates?
Yes, Argo can provide rough estimates of ARG abundances (cpg) for isolates. However, the computational time may be longer for pathogenic species (e.g., Escherichia coli, Salmonella enterica) which typically contain many copies of ARGs on their genomes and are highly redundant in GTDB.
Does Argo work with assembled contigs?
No, Argo is inherently read-based and does not work with contigs. You may consider using diamond blastx/p directly with SARG+/NDARO/CARD for ARG annotation.
Why is Argo running slowly for certain samples?
The computational time increases not only with the size of the sample but also with the number of ARG-containing reads and the redundancy of the database. If your sample contains a large proportion of Escherichia coli (see above), the computational time is likely to be longer than usual.
Citation
Chen, X., Yin, X., Xu, X., & Zhang, T. (2025). Species-resolved profiling of antibiotic resistance genes in complex metagenomes through long-read overlapping with Argo. Nature Communications, 16(1), 1744. https://doi.org/10.1038/s41467-025-57088-y
Argo
Argo: species-resolved profiling of antibiotic resistance genes in complex metagenomes through long-read overlapping
Introduction
Argo is a long-read-based profiler developed for environmental surveillance of antibiotic resistance genes (ARGs) with species-level resolution. It uses minimap2’s base-level alignment with GTDB to obtain raw species assignments and consolidates these assignments on a read cluster basis (determined through decomposing the read-overlap graph) by solving a set cover problem. Argo takes quality-controlled long reads (either Nanopore or PacBio) as input and returns a table listing predicted ARGs (types and subtypes), their potential hosts, and estimated abundances, expressed as ARG copies per genome (cpg), which is equivalent to ARG copies per cell (cpc), assuming each cell contains a single genome.
Argo uses SARG+ as its default database, which augments experimentally validated sequences with RefSeq sequences that share annotation evidence. However, it also accepts customized databases for compatibility with NDARO and CARD. See https://github.com/xinehc/argo-supplementary for more details.
Quick Start
Installation
Create a new conda environment and install:
Database setup
Download the database from Zenodo:
Index the files:
Run Argo
We provide an example file comprising 10,000 quality-controlled (processed with
Porechopandnanoq) prokaryotic reads (fungal and other reads removed withminimap2), randomly selected from the R10.3 mock sample of Loman Lab Mock Community Experiments.You should see (Argo v0.2.0 and SARG+ ver. 2025-02-11):
Output file
example.sarg.tsvlists ARG abundance estimates (cpg) by species:Output file
example.sarg.jsoncontains detailed annotation information for ARG-containing reads:Run Argo with fixed identity cutoffs
By default, Argo infers the median sequence divergence of a sample using its read overlaps and determines an adaptive identity cutoff for identifying ARGs, calculated as 90 - 2.5 * 100 * median sequence divergence. This ensures samples generated by different platforms or kits are comparable, despite their differences in read quality. However, if you want to suppress this option and use a fixed cutoff—for instance, an 80% identity cutoff and 80% subject cover—you can run:
A complete list of arguments and their default values is shown below:
FAQ
Does Argo work with isolates?
Yes, Argo can provide rough estimates of ARG abundances (cpg) for isolates. However, the computational time may be longer for pathogenic species (e.g., Escherichia coli, Salmonella enterica) which typically contain many copies of ARGs on their genomes and are highly redundant in GTDB.
Does Argo work with assembled contigs?
No, Argo is inherently read-based and does not work with contigs. You may consider using
diamond blastx/pdirectly with SARG+/NDARO/CARD for ARG annotation.Why is Argo running slowly for certain samples?
The computational time increases not only with the size of the sample but also with the number of ARG-containing reads and the redundancy of the database. If your sample contains a large proportion of Escherichia coli (see above), the computational time is likely to be longer than usual.
Citation
Chen, X., Yin, X., Xu, X., & Zhang, T. (2025). Species-resolved profiling of antibiotic resistance genes in complex metagenomes through long-read overlapping with Argo. Nature Communications, 16(1), 1744. https://doi.org/10.1038/s41467-025-57088-y