HAYSTAC: A Bayesian framework for robust and rapid species identification in high-throughput sequencing data
Introduction
haystac is a light-weight, fast, and user-friendly species identification tool. It evaluates the presence of a
particular species of interest in a metagenomic sample, and provides statistical support for the species assignment.
The method is designed to estimate the probability that a specific taxon is present in a metagenomic sample given a set
of sequencing reads and a database of reference genomes. It works equally well with both modern and ancient DNA sequence
data.
Setup
haystac can be run on either macOS or Linux based systems.
The recommended way to install haystac, and all if its dependencies, is via the mamba
package manager (a fast replacement for conda).
You can install haystac using conda, however, it will take significantly longer to install and analyses will run slower.
Install mamba
If you do not have either mamba or conda already installed, please refer to the install instructions for mambaforge.
If you have conda installed, but not mamba, then install mamba into the base environment:
conda install -n base -c conda-forge mamba
Install haystac
Then use mamba to install haystac into a new environment:
We recommend that you install haystac into a new environment to avoid dependency conflicts with other software.
Quick Start
haystac consists of three main modules:
database for building a database of reference genomes
sample for pre-processing of samples prior to analysis
analyse for analysing a sample against a database
1. Build a database
To begin using haystac we firstly need to construct a database containing all species of interest to our study. In our
preprint, we show that haystac makes robust species
identifications with genus specific databases (for prokaryotes), allowing for very fast hypothesis driven analyses.
In this example, we will build a database containing all species in the Yersinia genus, by supplying haystac with a
simple NBCI search query.
To construct an NCBI search query for your area of interest, visit the NCBI Nucleotide database and use the search feature to obtain a correctly formatted query string from
the “Search details” box. This search query can be used directly with haystac to automatically download and build
a reference database based on the accession codes present in the resultset returned by the query.
For more exhaustive analyses, you can build a database containing the 5,681 species present in the RefSeq
representative database of prokaryotic species by running:
Note: Building a database this big is not recommended on a laptop computer.
2. Prepare a sample for analysis
The second step in using haystac is to prepare a sample for analysis.
In this example, we will download an aDNA library from Rasmussen et al. (2015),
by giving haystac the SRA accession code ERR1018966.
Most published genomics papers include a BioProject code (e.g. PRJEB10885)), from which you can obtain SRA accessions for each sequencing
library.
To prepare a sample of your own, you will need either single-end or paired-end short read sequencing data in
fastq format.
For a paired-end library, you specify the location of the fastq files and the name of
the output directory. You may also choose to collapse overlapping mate pairs (e.g. for an aDNA library).
When the analysis is complete, there will be several new sub-folders in the output directory yersinia_ERR1018966/. To
determine if sample ERR1018966 contains Yersinia pestis (i.e. the plague) we can consult the spreadsheet containing
the mean posterior abundance estimates for all species in the Yersinia database (i.e.,
yersinia_ERR1018966/probabilities/ERR1018966/ERR1018966_posterior_abundance.tsv). From this, we can see that 3,266
reads were uniquely assigned to Yersinia pestis, with an overall abundance of 0.047%, and that the chi-squared test
indicates that the reads are spread evenly across the genome.
Before we can confidently conclude that ERR1018966 contains ancient Yersinia pestis, we may want to perform a
damage pattern analysis.
haystac has many features and potential uses, and we encourage you to use module help menus (e.g. haystac database --help)
to explore these options. The full user documentation is available here: https://haystac.readthedocs.io/en/master/
Reporting errors
haystac is under active development and we encourage you to report any issues you encounter via the GitHub issue
tracker.
Citation
A preprint describing haystac is available on bioRxiv:
Dimopoulos, E.A.*, Carmagnini, A.*, Velsko, I.M., Warinner, C., Larson, G., Frantz, L.A.F., Irving-Pease, E.K., 2020.
HAYSTAC: A Bayesian framework for robust and rapid species identification in high-throughput sequencing data.
bioRxiv 2020.12.16.419085. https://www.biorxiv.org/content/10.1101/2020.12.16.419085v1
HAYSTAC: A Bayesian framework for robust and rapid species identification in high-throughput sequencing data
Introduction
haystacis a light-weight, fast, and user-friendly species identification tool. It evaluates the presence of a particular species of interest in a metagenomic sample, and provides statistical support for the species assignment. The method is designed to estimate the probability that a specific taxon is present in a metagenomic sample given a set of sequencing reads and a database of reference genomes. It works equally well with both modern and ancient DNA sequence data.Setup
haystaccan be run on either macOS or Linux based systems.The recommended way to install
haystac, and all if its dependencies, is via the mamba package manager (a fast replacement for conda).You can install
haystacusingconda, however, it will take significantly longer to install and analyses will run slower.Install mamba
If you do not have either
mambaorcondaalready installed, please refer to the install instructions formambaforge.If you have
condainstalled, but notmamba, then installmambainto the base environment:Install haystac
Then use
mambato installhaystacinto a new environment:And activate the environment:
We recommend that you install
haystacinto a new environment to avoid dependency conflicts with other software.Quick Start
haystacconsists of three main modules:databasefor building a database of reference genomessamplefor pre-processing of samples prior to analysisanalysefor analysing a sample against a database1. Build a database
To begin using
haystacwe firstly need to construct a database containing all species of interest to our study. In our preprint, we show thathaystacmakes robust species identifications with genus specific databases (for prokaryotes), allowing for very fast hypothesis driven analyses.In this example, we will build a database containing all species in the Yersinia genus, by supplying
haystacwith a simple NBCI search query.To construct an NCBI search query for your area of interest, visit the NCBI Nucleotide database and use the search feature to obtain a correctly formatted query string from the “Search details” box. This search query can be used directly with
haystacto automatically download and build a reference database based on the accession codes present in the resultset returned by the query.For more exhaustive analyses, you can build a database containing the 5,681 species present in the RefSeq representative database of prokaryotic species by running:
Note: Building a database this big is not recommended on a laptop computer.
2. Prepare a sample for analysis
The second step in using
haystacis to prepare a sample for analysis.In this example, we will download an aDNA library from Rasmussen et al. (2015), by giving
haystacthe SRA accession code ERR1018966. Most published genomics papers include a BioProject code (e.g. PRJEB10885)), from which you can obtain SRA accessions for each sequencing library.To prepare a sample of your own, you will need either single-end or paired-end short read sequencing data in
fastqformat.For a paired-end library, you specify the location of the
fastqfiles and the name of the output directory. You may also choose to collapse overlapping mate pairs (e.g. for an aDNA library).By default,
haystacwill scan the supplied library, identify adapter sequences, and automatically remove them.3. Analyse a sample against a database
The third step in using
haystacis to perform an analysis of a sample against a database.Here, we will use
haystacto calculate the mean posterior abundance of all species in the Yersinia genus found within the sampleERR1018966.When the analysis is complete, there will be several new sub-folders in the output directory
yersinia_ERR1018966/. To determine if sampleERR1018966contains Yersinia pestis (i.e. the plague) we can consult the spreadsheet containing the mean posterior abundance estimates for all species in the Yersinia database (i.e.,yersinia_ERR1018966/probabilities/ERR1018966/ERR1018966_posterior_abundance.tsv). From this, we can see that 3,266 reads were uniquely assigned to Yersinia pestis, with an overall abundance of 0.047%, and that the chi-squared test indicates that the reads are spread evenly across the genome.Before we can confidently conclude that
ERR1018966contains ancient Yersinia pestis, we may want to perform a damage pattern analysis.User documentation
haystachas many features and potential uses, and we encourage you to use module help menus (e.g.haystac database --help) to explore these options. The full user documentation is available here: https://haystac.readthedocs.io/en/master/Reporting errors
haystacis under active development and we encourage you to report any issues you encounter via the GitHub issue tracker.Citation
A preprint describing
haystacis available on bioRxiv:License
MIT