ReferenceSeeker determines closely related reference genomes following a scalable hierarchical approach combining an fast kmer profile-based database lookup of candidate reference genomes and subsequent computation of specific average nucleotide identity (ANI) values for the rapid determination of suitable reference genomes.
ReferenceSeeker computes kmer-based genome distances between a query genome and potential reference genome candidates via Mash (Ondov et al. 2016). For resulting candidates ReferenceSeeker subsequently computes (bidirectional) ANI values picking genomes meeting community standard thresholds by default (ANI >= 95 % & conserved DNA >= 69 %) (Goris, Konstantinos et al. 2007) ranked by the product of ANI and conserved DNA values to take into account both genome coverage and identity.
Custom databases can be built with local genomes. For further convenience, we provide pre-built databases with sequences from RefSeq (https://www.ncbi.nlm.nih.gov/refseq), GTDB and PLSDB copmrising the following taxa:
bacteria
archaea
fungi
protozoa
viruses
as well as plasmids.
The reasoning for subsequent calculations of both ANI and conserved DNA values is that Mash distance values correlate well with ANI values for closely related genomes, however the same is not true for conserved DNA values. A kmer fingerprint-based comparison alone cannot distinguish if a kmer is missing due to a SNP, for instance or a lack of the kmer-comprising subsequence. As DNA conservation (next to DNA identity) is very important for many kinds of analyses, e.g. reference based SNP detections, ranking potential reference genomes based on a mash distance alone is often not sufficient in order to select the most appropriate reference genomes. If desired, ANI and conserved DNA values can be computed bidirectionally.
Input & Output
Input
Path to a taxon database and a draft or finished genome in (zipped) fasta format:
$ referenceseeker ~/bacteria GCF_000013425.1.fna
Output
Tab separated lines to STDOUT comprising the following columns:
ReferenceSeeker can be installed via Conda and Git(Hub). In either case, a taxon database must be downloaded which we provide for download at Zenodo:
For more information have a look at Databases.
BioConda
The preferred way to install and run ReferenceSeeker is Conda using the Bioconda channel:
To test your installation we prepared a tiny mock database comprising 4 Salmonella spp genomes and a query assembly (SRA: SRR498276) in the tests directory:
#ID Mash Distance ANI Con. DNA Taxonomy ID Assembly Status Organism
GCF_000439415.1 0.00003 100.00 99.55 1173427 complete Salmonella enterica subsp. enterica serovar Bareilly str. CFSAN000189
GCF_900205275.1 0.01522 98.61 83.13 90370 complete Salmonella enterica subsp. enterica serovar Typhi
Usage
Usage:
usage: referenceseeker [--crg CRG] [--ani ANI] [--conserved-dna CONSERVED_DNA]
[--unfiltered] [--bidirectional] [--help] [--version]
[--verbose] [--threads THREADS]
<database> <genome>
Rapid determination of appropriate reference genomes.
positional arguments:
<database> ReferenceSeeker database path
<genome> target draft genome in fasta format
Filter options / thresholds:
These options control the filtering and alignment workflow.
--crg CRG, -r CRG Max number of candidate reference genomes to pass kmer
prefilter (default = 100)
--ani ANI, -a ANI ANI threshold (default = 0.95)
--conserved-dna CONSERVED_DNA, -c CONSERVED_DNA
Conserved DNA threshold (default = 0.69)
--unfiltered, -u Set kmer prefilter to extremely conservative values
and skip species level ANI cutoffs (ANI >= 0.95 and
conserved DNA >= 0.69
--bidirectional, -b Compute bidirectional ANI/conserved DNA values
(default = False)
Runtime & auxiliary options:
--help, -h Show this help message and exit
--version, -V show program's version number and exit
--verbose, -v Print verbose information
--threads THREADS, -t THREADS
Number of used threads (default = number of available
CPU cores)
If above mentiond RefSeq based databases do not contain sufficiently-close related genomes or are just too large, ReferenceSeeker provides auxiliary commands in order to either create databases from scratch or to expand existing ones. Therefore, a second executable referenceseeker_db accepts init and import subcommands:
Usage:
usage: referenceseeker_db [--help] [--version] {init,import} ...
Rapid determination of appropriate reference genomes.
positional arguments:
{init,import} sub-command help
init Initialize a new database
import Add a new genome to database
Runtime & auxiliary options:
--help, -h Show this help message and exit
--version, -V show program's version number and exit
If a new database should be created, use referenceseeker_db init:
usage: referenceseeker_db init [-h] [--output OUTPUT] --db DB
optional arguments:
-h, --help show this help message and exit
--output OUTPUT, -o OUTPUT
output directory (default = current working directory)
--db DB, -d DB Name of the new ReferenceSeeker database
This new database or an existing one can be used to import genomes in Fasta, GenBank or EMBL format:
usage: referenceseeker_db import [-h] --db DB --genome GENOME [--id ID]
[--taxonomy TAXONOMY]
[--status {complete,chromosome,scaffold,contig}]
[--organism ORGANISM]
optional arguments:
-h, --help show this help message and exit
--db DB, -d DB ReferenceSeeker database path
--genome GENOME, -g GENOME
Genome path [Fasta, GenBank, EMBL]
--id ID, -i ID Unique genome identifier (default sequence id of first
record)
--taxonomy TAXONOMY, -t TAXONOMY
Taxonomy ID (default = 12908 [unclassified sequences])
--status {complete,chromosome,scaffold,contig}, -s {complete,chromosome,scaffold,contig}
Assembly level (default = contig)
--organism ORGANISM, -o ORGANISM
Organism name (default = "NA")
Example:
If ReferenceSeeker is properly installed, clone this repository and change into its parent directoriy.
ReferenceSeeker has been tested against aforementioned versions.
Citation
Schwengers et al., (2020). ReferenceSeeker: rapid determination of appropriate reference genomes. Journal of Open Source Software, 5(46), 1994, https://doi.org/10.21105/joss.01994
Feedback
We highly wellcome and appreciate feedback of all kind!
So, if you run into any issues with ReferenceSeeker, we’d be happy to hear about it! Please, start the pipeline with -v (verbose) and do not hesitate to file an issue here on GitHub including as much of the following as possible:
a detailed description of the issue
the ReferenceSeeker cmd line output
a reproducible example of the issue with a small dataset that you can share (helps us identify whether the issue is specific to a particular computer, operating system, and/or dataset).
The maintenance of ReferenceSeeker is supported by deNBI. If you would like to provide (non-technical) feedback, please find a service monitoring survey here.
ReferenceSeeker: rapid determination of appropriate reference genomes
Contents
Description
ReferenceSeeker determines closely related reference genomes following a scalable hierarchical approach combining an fast kmer profile-based database lookup of candidate reference genomes and subsequent computation of specific average nucleotide identity (ANI) values for the rapid determination of suitable reference genomes.
ReferenceSeeker computes kmer-based genome distances between a query genome and potential reference genome candidates via Mash (Ondov et al. 2016). For resulting candidates ReferenceSeeker subsequently computes (bidirectional) ANI values picking genomes meeting community standard thresholds by default (ANI >= 95 % & conserved DNA >= 69 %) (Goris, Konstantinos et al. 2007) ranked by the product of ANI and conserved DNA values to take into account both genome coverage and identity.
Custom databases can be built with local genomes. For further convenience, we provide pre-built databases with sequences from RefSeq (https://www.ncbi.nlm.nih.gov/refseq), GTDB and PLSDB copmrising the following taxa:
as well as plasmids.
The reasoning for subsequent calculations of both ANI and conserved DNA values is that Mash distance values correlate well with ANI values for closely related genomes, however the same is not true for conserved DNA values. A kmer fingerprint-based comparison alone cannot distinguish if a kmer is missing due to a SNP, for instance or a lack of the kmer-comprising subsequence. As DNA conservation (next to DNA identity) is very important for many kinds of analyses, e.g. reference based SNP detections, ranking potential reference genomes based on a mash distance alone is often not sufficient in order to select the most appropriate reference genomes. If desired, ANI and conserved DNA values can be computed bidirectionally.
Input & Output
Input
Path to a taxon database and a draft or finished genome in (zipped) fasta format:
Output
Tab separated lines to STDOUT comprising the following columns:
Unidirectionally (query -> references):
Bidirectionally (query -> references [QR] & references -> query [RQ]):
Installation
ReferenceSeeker can be installed via Conda and Git(Hub). In either case, a taxon database must be downloaded which we provide for download at Zenodo:
For more information have a look at Databases.
BioConda
The preferred way to install and run ReferenceSeeker is Conda using the Bioconda channel:
GitHub
Alternatively, you can use this raw GitHub repository:
Test
To test your installation we prepared a tiny mock database comprising 4
Salmonella sppgenomes and a query assembly (SRA: SRR498276) in thetestsdirectory:Expected output:
Usage
Usage:
Examples
Installation:
Simple:
Expert: verbose output and increased output of candidate reference genomes using a defined number of threads:
Databases
ReferenceSeeker depends on databases comprising taxonomic genome informations as well as kmer hash profiles for each entry.
Pre-built
We provide pre-built databases based on public genome data hosted at Zenodo:
:
RefSeq
release: 221 (2024-01-09)
GTDB
release: v214 (2024-01-11)
Plasmids
In addition to the genome based databases, we provide the following plasmid databases based on RefSeq and PLSDB:
Custom database
If above mentiond RefSeq based databases do not contain sufficiently-close related genomes or are just too large, ReferenceSeeker provides auxiliary commands in order to either create databases from scratch or to expand existing ones. Therefore, a second executable
referenceseeker_dbacceptsinitandimportsubcommands:Usage:
If a new database should be created, use
referenceseeker_db init:This new database or an existing one can be used to import genomes in Fasta, GenBank or EMBL format:
Example:
If ReferenceSeeker is properly installed, clone this repository and change into its parent directoriy.
Dependencies
ReferenceSeeker needs the following dependencies:
ReferenceSeeker has been tested against aforementioned versions.
Citation
Feedback
We highly wellcome and appreciate feedback of all kind!
So, if you run into any issues with ReferenceSeeker, we’d be happy to hear about it! Please, start the pipeline with -v (verbose) and do not hesitate to file an issue here on GitHub including as much of the following as possible:
The maintenance of ReferenceSeeker is supported by deNBI. If you would like to provide (non-technical) feedback, please find a service monitoring survey here.