HAYSTAC: A Bayesian framework for robust and rapid species identification in high-throughput sequencing data

Introduction

haystac is a light-weight, fast, and user-friendly species identification tool. It evaluates the presence of a particular species of interest in a metagenomic sample, and provides statistical support for the species assignment. The method is designed to estimate the probability that a specific taxon is present in a metagenomic sample given a set of sequencing reads and a database of reference genomes. It works equally well with both modern and ancient DNA sequence data.

Setup

haystac can be run on either macOS or Linux based systems.

The recommended way to install haystac, and all if its dependencies, is via the mamba package manager (a fast replacement for conda).

You can install haystac using conda, however, it will take significantly longer to install and analyses will run slower.

Install mamba

If you do not have either mamba or conda already installed, please refer to the install instructions for mambaforge.

If you have conda installed, but not mamba, then install mamba into the base environment:

conda install -n base -c conda-forge mamba

Install haystac

Then use mamba to install haystac into a new environment:

mamba create -c conda-forge -c bioconda -n haystac haystac

And activate the environment:

conda activate haystac

We recommend that you install haystac into a new environment to avoid dependency conflicts with other software.

Quick Start

haystac consists of three main modules:

database for building a database of reference genomes
sample for pre-processing of samples prior to analysis
analyse for analysing a sample against a database

1. Build a database

To begin using haystac we firstly need to construct a database containing all species of interest to our study. In our preprint, we show that haystac makes robust species identifications with genus specific databases (for prokaryotes), allowing for very fast hypothesis driven analyses.

In this example, we will build a database containing all species in the Yersinia genus, by supplying haystac with a simple NBCI search query.

haystac database \
    --mode build \
    --query '"Yersinia"[Organism] AND "complete genome"[All Fields]' \
    --output yersinia_db

To construct an NCBI search query for your area of interest, visit the NCBI Nucleotide database and use the search feature to obtain a correctly formatted query string from the “Search details” box. This search query can be used directly with haystac to automatically download and build a reference database based on the accession codes present in the resultset returned by the query.

For more exhaustive analyses, you can build a database containing the 5,681 species present in the RefSeq representative database of prokaryotic species by running:

haystac database \
    --mode build \
    --refseq-rep prokaryote_rep \
    --output refseq_db

Note: Building a database this big is not recommended on a laptop computer.

2. Prepare a sample for analysis

The second step in using haystac is to prepare a sample for analysis.

In this example, we will download an aDNA library from Rasmussen et al. (2015), by giving haystac the SRA accession code ERR1018966. Most published genomics papers include a BioProject code (e.g. PRJEB10885)), from which you can obtain SRA accessions for each sequencing library.

haystac sample \
    --sra ERR1018966 \
    --output ERR1018966

To prepare a sample of your own, you will need either single-end or paired-end short read sequencing data in fastq format.

For a paired-end library, you specify the location of the fastq files and the name of the output directory. You may also choose to collapse overlapping mate pairs (e.g. for an aDNA library).

haystac sample \
    --fastq-r1 /path/to/sample1_R1.fq.gz \
    --fastq-r2 /path/to/sample1_R2.fq.gz \
    --collapse True \
    --output sample1

By default, haystac will scan the supplied library, identify adapter sequences, and automatically remove them.

3. Analyse a sample against a database

The third step in using haystac is to perform an analysis of a sample against a database.

Here, we will use haystac to calculate the mean posterior abundance of all species in the Yersinia genus found within the sample ERR1018966.

haystac analyse \
    --mode abundances \
    --database yersinia_db\
    --sample ERR1018966 \
    --output yersinia_ERR1018966

When the analysis is complete, there will be several new sub-folders in the output directory yersinia_ERR1018966/. To determine if sample ERR1018966 contains Yersinia pestis (i.e. the plague) we can consult the spreadsheet containing the mean posterior abundance estimates for all species in the Yersinia database (i.e., yersinia_ERR1018966/probabilities/ERR1018966/ERR1018966_posterior_abundance.tsv). From this, we can see that 3,266 reads were uniquely assigned to Yersinia pestis, with an overall abundance of 0.047%, and that the chi-squared test indicates that the reads are spread evenly across the genome.

Before we can confidently conclude that ERR1018966 contains ancient Yersinia pestis, we may want to perform a damage pattern analysis.

haystac analyse \
    --mode abundances \
    --database yersinia_db\
    --sample ERR1018966 \
    --output yersinia_ERR1018966 \
    --mapdamage True

User documentation

haystac has many features and potential uses, and we encourage you to use module help menus (e.g. haystac database --help) to explore these options. The full user documentation is available here: https://haystac.readthedocs.io/en/master/

Reporting errors

haystac is under active development and we encourage you to report any issues you encounter via the GitHub issue tracker.

Citation

A preprint describing haystac is available on bioRxiv:

Dimopoulos, E.A.*, Carmagnini, A.*, Velsko, I.M., Warinner, C., Larson, G., Frantz, L.A.F., Irving-Pease, E.K., 2020. HAYSTAC: A Bayesian framework for robust and rapid species identification in high-throughput sequencing data. bioRxiv 2020.12.16.419085. https://www.biorxiv.org/content/10.1101/2020.12.16.419085v1

License

MIT