Find what NCBI RefSeq genomes match or are contained within your sequence data using Mash MinHash with a Mash sketch database of 54,925 NCBI RefSeq Genomes.
Installation
Easiest way to install refseq_masher and all its dependencies is with Conda through the BioConda channel:
conda install -c bioconda refseq_masher
Otherwise you can install refseq_masher from PyPI with pip install refseq_masher, but you would need to manually install Mash v2.0+.
Dependencies
Other than Python 3.5/3.6, the only external dependency of refseq_masher is Mash v2.0+.
Python dependencies
Pandas
NumPy
Click
pytest for running tests
Usage
If you run refseq_masher without any arguments, you should see the following usage info:
Usage: refseq_masher [OPTIONS] COMMAND [ARGS]...
Find the closest matching NCBI RefSeq genomes or the genomes contained in
your contigs or reads.
Options:
--version Show the version and exit.
-v, --verbose Logging verbosity (-v for logging warnings; -vvv for logging
debug info)
-h, --help Show this message and exit.
Commands:
contains Find the NCBI RefSeq genomes contained in...
matches Find NCBI RefSeq genome matches for an input...
refseq_masher has 2 commands:
matches for finding the closest NCBI RefSeq genome matches to your input sequences
contains for finding what RefSeq genomes are contained within your input sequences
useful for finding what genomes may be contained within your metagenomic sample
matches - find the closest matching NCBI RefSeq Genomes in your input sequences
Usage: refseq_masher matches [OPTIONS] INPUT...
Find NCBI RefSeq genome matches for an input genome fasta file
Input is expected to be one or more FASTA/FASTQ files or one or more
directories containing FASTA/FASTQ files. Files can be Gzipped.
Options:
--mash-bin TEXT Mash binary path (default="mash")
-o, --output PATH Output file path (default="-"/stdout)
--output-type [tab|csv] Output file type (tab|csv)
-n, --top-n-results INTEGER Output top N results sorted by distance in
ascending order (default=5)
-m, --min-kmer-threshold INTEGER
Mash sketch of reads: "Minimum copies of
each k-mer required to pass noise filter for
reads" (default=8)
-h, --help Show this message and exit.
Example
With the FNA.GZ file for Salmonella enterica subsp. enterica serovar Enteritidis str. CHS44:
The top match is Salmonella enterica subsp. enterica serovar Enteritidis str. CHS44 with a distance of 0.0 and 400/400 sketches matching, which is what we expected. There’s other taxonomic information available in the results table that may be useful.
contains - find what NCBI RefSeq Genomes are contained in your input sequences
If you have a metagenomic sample or maybe a sample with some contamination, you may be interested in seeing what’s in your sample. You can do this with refseq_masher contains <INPUT>.
Usage: refseq_masher contains [OPTIONS] INPUT...
Find the NCBI RefSeq genomes contained in your sequence files using Mash
Screen
Input is expected to be one or more FASTA/FASTQ files or one or more
directories containing FASTA/FASTQ files. Files can be Gzipped.
Options:
--mash-bin TEXT Mash binary path (default="mash")
-o, --output PATH Output file path (default="-"/stdout)
--output-type [tab|csv] Output file type (tab|csv)
-n, --top-n-results INTEGER Output top N results sorted by identity in
ascending order (default=0/all)
-i, --min-identity FLOAT Mash screen min identity to report
(default=0.9)
-v, --max-pvalue FLOAT Mash screen max p-value to report
(default=0.01)
-p, --parallelism INTEGER Mash screen parallelism; number of threads to
spawn (default=1)
-h, --help Show this message and exit.
Example - metagenomic a sample SAMEA1877339
For this example, we’re going to see what RefSeq genomes are contained within sample SAMEA1877340 from BioProject PRJEB1775.
Description from BioProject PRJEB1775:
Design, Setting and Patients Forty-five samples were selected from a set of fecal specimens obtained from patients with diarrhea during the 2011 outbreak of STEC O104:H4 in Germany. Samples were chosen to represent STEC-positive patients with a range of clinical conditions and colony counts together with a small number of patients with other infections (Campylobacter jejnuni, Clostridium difficile and Salmonella enterica). Samples were subjected to high-throughput sequencing on the Illumina MiSeq and HiSeq 2500, followed by bioinformatics analysis.
We’re going to download the FASTQ files for ERR260489:
Some of the top genomes contained in this sample are sorted by identity and median multiplicity are:
Bacteroides fragilis - fully contained (400/400) and high multiplicity (768)
Escherichia coli O104:H4 - fully contained (400/400) and median multiplicity of 48
Kingella kingae - fully contained (400/400) and median multiplicity of 5
Klebsiella pneumoniae - 399/400 sketches contained with median multiplicity of 4
So with Mash we are able to find that the sample contained the expected genomic data (especially E. coli O104:H4).
Legal
Copyright Government of Canada 2017
Written by: National Microbiology Laboratory, Public Health Agency of Canada
Licensed under the Apache License, Version 2.0 (the “License”); you may not use
this work except in compliance with the License. You may obtain a copy of the
License at:
Unless required by applicable law or agreed to in writing, software distributed
under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR
CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
RefSeq Masher
Find what NCBI RefSeq genomes match or are contained within your sequence data using Mash MinHash with a Mash sketch database of 54,925 NCBI RefSeq Genomes.
Installation
Easiest way to install
refseq_masherand all its dependencies is with Conda through the BioConda channel:Otherwise you can install
refseq_masherfrom PyPI withpip install refseq_masher, but you would need to manually install Mash v2.0+.Dependencies
Other than Python 3.5/3.6, the only external dependency of
refseq_masheris Mash v2.0+.Python dependencies
Usage
If you run
refseq_masherwithout any arguments, you should see the following usage info:refseq_masherhas 2 commands:matchesfor finding the closest NCBI RefSeq genome matches to your input sequencescontainsfor finding what RefSeq genomes are contained within your input sequencesmatches- find the closest matching NCBI RefSeq Genomes in your input sequencesExample
With the FNA.GZ file for Salmonella enterica subsp. enterica serovar Enteritidis str. CHS44:
Log:
Output:
Table output to standard output:
The top match is Salmonella enterica subsp. enterica serovar Enteritidis str. CHS44 with a distance of 0.0 and 400/400 sketches matching, which is what we expected. There’s other taxonomic information available in the results table that may be useful.
contains- find what NCBI RefSeq Genomes are contained in your input sequencesIf you have a metagenomic sample or maybe a sample with some contamination, you may be interested in seeing what’s in your sample. You can do this with
refseq_masher contains <INPUT>.Example - metagenomic a sample SAMEA1877339
For this example, we’re going to see what RefSeq genomes are contained within sample SAMEA1877340 from BioProject PRJEB1775.
Description from BioProject PRJEB1775:
We’re going to download the FASTQ files for ERR260489:
We’re going to run
refseq_masheragainst these FASTQ files:Log:
Output:
Some of the top genomes contained in this sample are sorted by identity and median multiplicity are:
So with Mash we are able to find that the sample contained the expected genomic data (especially E. coli O104:H4).
Legal
Copyright Government of Canada 2017
Written by: National Microbiology Laboratory, Public Health Agency of Canada
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this work except in compliance with the License. You may obtain a copy of the License at:
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Contact
Gary van Domselaar: gary.vandomselaar@phac-aspc.gc.ca