Determines presence/absence of sequence elements in bacterial sequence
data. Uses assemblies and/or reads as inputs.
The implementation of unitig-caller is a wrapper around the Bifrost API which formats files for use with pyseer, as well as an implementation which calls sequences using an FM-index.
Call mode builds a Bifrost DBG and calls the colours for each unitig within. Query mode queries
the colours of existing unitigs within a new population.
Simple mode finds presence of unitigs in a new population using an FM-index.
Install
Use unitig-caller if installed through pip/conda, or
python unitig_caller-runner.py if using a clone of the code.
Build a population graph to extract unitigs for GWAS with pyseer like unitig-counter (--call).
Find existing unitigs in a new population using a graph (--query).
Find existing unitigs in a new population using an index (--simple).
For 1), run --call mode.
Both 2) and 3) give the same results with different index tools, both finding unitigs so pyseer models can be applied to a new population.
For 2) Run --query mode, specifying new population input fastas file names in a text file (one file per line), with --unitigs from the original population.
For 3), run --simple mode giving the new genomes as --refs and the --unitigs from the original population.
These modes are detailed below
Generating an input file
To generate an input file for --refs or --reads, it is best to use ls to produce absolute file paths to assembly or read files.
For example:
ls -d -1 $PWD/*.fa > input.txt
This will generate a file input.txt containing the absolute file paths for all .fa files present in the current directory.
Running Call mode
This uses Bifrost Build to generate a compact coloured de Bruijn graph, and return colours of unitigs within.
--refs and --reads are .txt file listing paths of input ASSEMBLIES and READS respectively
(.fasta or .fastq), each on a new line. No header row. Can either specify both or single arguments.
NOTE: ensure reads and references are correctly assigned. Bifrost filters out kmers with coverage < 1 in READS
files to remove sequencing errors.
--kmer can be specified for the kmer size used to built the graph. By default this is 31 bp.
--graph is a pre-built bifrost graph .gfa, and --colours is its associated colours file.
For both call modes
--out is the prefix for output files.
Call mode automatically generates a .pyseer file containing unitigs found within the graph and their graph. Rtab or pyseer
formats can be specified with --rtab and --pyseer respectively.
Running Query mode
Queries existing unitigs in a Bifrost graph. This is useful when identical unitig definitions need to be used between populations, for example when using pyseer’s prediction mode.
--unitigs is .fasta file or text file with unitig sequences (one sequence per line, with header line).
--out is the prefix for output files.
Query mode automatically generates a .pyseer file containing unitigs found within the graph and their graph. Rtab or pyseer
formats can be specified with --rtab and --pyseer respectively.
Running simple mode
This uses suffix arrays (FM-index) provided by SeqAn3 to perform
string matches:
--refs is a required file listing input assemblies, the same as refs in call.
--unitigs is a required list of the unitig sequences to call. The unitigs need
to be in the first column (tab separated). A header row is assumed, so
output from pyseer etc can be directly used.
calls_pyseer.txt will contain unitig calls in seer/pyseer k-mer format.
By default FM-indexes are saved in the same location as the assembly files so that they can
be quickly loaded by subsequent runs. To turn this off use --no-save-idx.
Option reference
usage: unitig-caller [-h] (--call | --query | --simple) [--refs REFS]
[--reads READS] [--graph GRAPH] [--colours COLOURS]
[--unitigs UNITIGS] [--pyseer] [--rtab] [--out OUT]
[--kmer KMER] [--write-graph]
[--no-save-idx] [--threads THREADS] [--version]
Call unitigs in a population dataset
optional arguments:
-h, --help show this help message and exit
Mode of operation:
--call Build a DBG and call colours of unitigs within
--query Query unitig colours in reference genomes/DBG
--simple Use FM-index to make calls
Unitig-caller input/output:
--refs REFS Ref file to used to build DBG or use with --simple
--reads READS Read file to used to build DBG
--graph GRAPH Existing graph in GFA format
--colours COLOURS Existing bifrost colours file in .bfg_colors format
--unitigs UNITIGS Text or fasta file of unitigs to query (--query or --simple)
--pyseer Output pyseer format
--rtab Output rtab format
--out OUT Prefix for output [default = 'unitig_caller']
Bifrost options:
--kmer KMER K-mer size for graph building/querying [default = 31]
--write-graph Output DBG built with unitig-caller
Simple mode options:
--no-save-idx Do not save FM-indexes for reuse
Other:
--threads THREADS Number of threads to use [default = 1]
--version show program's version number and exit
Interpreting output files
Pyseer format details unitig sequences followed by the file names of the genomes in which they are found.
If a unitig is not found in any genomes, it will have no associated file names.
Holley G., Melsted, P. Bifrost – Highly parallel construction and indexing of colored and compacted de Bruijn graphs.
bioRxiv 695338 (2019). doi: https://doi.org/10.1101/695338
unitig-caller
Determines presence/absence of sequence elements in bacterial sequence data. Uses assemblies and/or reads as inputs.
The implementation of unitig-caller is a wrapper around the Bifrost API which formats files for use with pyseer, as well as an implementation which calls sequences using an FM-index.
Call mode builds a Bifrost DBG and calls the colours for each unitig within. Query mode queries the colours of existing unitigs within a new population.
Simple mode finds presence of unitigs in a new population using an FM-index.
Install
Use
unitig-callerif installed through pip/conda, orpython unitig_caller-runner.pyif using a clone of the code.With conda (recommended)
Get it from bioconda:
If you haven’t set this up, first install miniconda. Then add the correct channels:
From source
Requires
cmake,pthreads,pybind11and a C++17 compiler (e.g. gcc >=7.3), in addition to the conda requirements (seeenvironment.yml).Usage
There are three ways to use this package:
--call).--query).--simple).For 1), run
--callmode.Both 2) and 3) give the same results with different index tools, both finding unitigs so pyseer models can be applied to a new population.
For 2) Run
--querymode, specifying new population input fastas file names in a text file (one file per line), with--unitigsfrom the original population.For 3), run
--simplemode giving the new genomes as--refsand the--unitigsfrom the original population.These modes are detailed below
Generating an input file
To generate an input file for
--refsor--reads, it is best to uselsto produce absolute file paths to assembly or read files.For example:
This will generate a file
input.txtcontaining the absolute file paths for all.fafiles present in the current directory.Running Call mode
This uses Bifrost Build to generate a compact coloured de Bruijn graph, and return colours of unitigs within.
If no pre-built Bifrost graph exists
--refsand--readsare .txt file listing paths of input ASSEMBLIES and READS respectively (.fasta or .fastq), each on a new line. No header row. Can either specify both or single arguments.NOTE: ensure reads and references are correctly assigned. Bifrost filters out kmers with coverage < 1 in READS files to remove sequencing errors.
--kmercan be specified for the kmer size used to built the graph. By default this is 31 bp.If pre-built Bifrost graph exists
--graphis a pre-built bifrost graph .gfa, and--coloursis its associated colours file.For both call modes
--outis the prefix for output files.Call mode automatically generates a .pyseer file containing unitigs found within the graph and their graph. Rtab or pyseer formats can be specified with
--rtaband--pyseerrespectively.Running Query mode
Queries existing unitigs in a Bifrost graph. This is useful when identical unitig definitions need to be used between populations, for example when using pyseer’s prediction mode.
If no pre-built Bifrost graph exists
--refsand--readsare the same arguments as in--call.--kmercan be specified for the kmer size used to built the graph. By default this is 31 bp.If pre-built Bifrost graph exists
For both query modes
--unitigsis .fasta file or text file with unitig sequences (one sequence per line, with header line).--outis the prefix for output files.Query mode automatically generates a .pyseer file containing unitigs found within the graph and their graph. Rtab or pyseer formats can be specified with
--rtaband--pyseerrespectively.Running simple mode
This uses suffix arrays (FM-index) provided by SeqAn3 to perform string matches:
--refsis a required file listing input assemblies, the same asrefsincall.--unitigsis a required list of the unitig sequences to call. The unitigs need to be in the first column (tab separated). A header row is assumed, so output from pyseer etc can be directly used.calls_pyseer.txtwill contain unitig calls in seer/pyseer k-mer format.By default FM-indexes are saved in the same location as the assembly files so that they can be quickly loaded by subsequent runs. To turn this off use
--no-save-idx.Option reference
Interpreting output files
Pyseer format details unitig sequences followed by the file names of the genomes in which they are found.
If a unitig is not found in any genomes, it will have no associated file names.
Rtab format details unitig sequences, along with a presence/absence matrix in each input file (1 present, 0 not).
Citation
If you use this, please cite the Bifrost paper:
Holley G., Melsted, P. Bifrost – Highly parallel construction and indexing of colored and compacted de Bruijn graphs. bioRxiv 695338 (2019). doi: https://doi.org/10.1101/695338