Longshot is a variant calling tool for diploid genomes using long error prone reads such as Pacific Biosciences (PacBio) SMRT and Oxford Nanopore Technologies (ONT). It takes as input an aligned BAM/CRAM file and outputs a phased VCF file with variants and haplotype information. It can also genotype and phase input VCF files. It can output haplotype-separated BAM files that can be used for downstream analysis. Currently, it only calls single nucleotide variants (SNVs), but it can genotype indels if they are given in an input VCF.
Longshot has been tested using Ubuntu 16.04 and 18.04, CentOS 6.6, Manjaro Linux 17.1.11, and Mac OS 10.14.2 Mojave.
It should work on any linux-based system that has Rust and Cargo installed.
dependencies
rust >= 1.40.0
zlib >= 1.2.11
xz >= 5.2.3
clangdev >= 7.0.1
gcc >= 7.3.0
libc-dev
make
various rust dependencies (automatically managed by cargo)
(older versions may work but have not been tested)
installation
installation using Bioconda
It is recommended to install Longshot using Bioconda:
conda install longshot
This method supports Linux and Mac.
If you do not have Bioconda, you can install it with these steps:
First, install Miniconda (or Anaconda). Miniconda can be installed using the
scripts here.
The Bioconda channel can then be added using these commands:
Installation should take around 4 minutes on a typical desktop machine and will use between 400 MB (counting cargo) and 1.2 GB (counting all dependencies) of disk space.
It is recommended to add the line export PATH=$PATH:/home/$USER/.cargo/bin to the end of your ~/.bashrc file so that the longshot binary is in the PATH for future shell sessions.
usage:
After installation, execute the longshot binary as so:
Execution should take around 5 to 10 seconds on a typical desktop machine. The output can be compared to ground_truth_variants.vcf for accuracy.
command line options
$ longshot --help
Longshot: variant caller (SNVs) for long-read sequencing data
USAGE:
longshot [FLAGS] [OPTIONS] --bam <BAM/CRAM> --ref <FASTA> --out <VCF>
FLAGS:
-A, --auto_max_cov Automatically calculate mean coverage for region and set max coverage to mean_coverage +
5*sqrt(mean_coverage). (SLOWER)
-S, --stable_alignment Use numerically-stable (logspace) pair HMM forward algorithm. Is significantly slower but
may be more accurate. Tests have shown this not to be necessary for highly error prone
reads (PacBio CLR).
-F, --force_overwrite If output files (VCF or variant debug directory) exist, delete and overwrite them.
-x, --max_alignment Use max scoring alignment algorithm rather than pair HMM forward algorithm.
-n, --no_haps Don't call HapCUT2 to phase variants.
--output-ref print reference genotypes (non-variant), use this option only in combination with -v
option.
-h, --help Prints help information
-V, --version Prints version information
OPTIONS:
-b, --bam <BAM> sorted, indexed BAM file with error-prone reads (CRAM files also supported)
-f, --ref <FASTA> indexed FASTA reference that BAM file is aligned to
-o, --out <VCF> output VCF file with called variants.
-r, --region <string> Region in format <chrom> or <chrom:start-stop> in which to call variants
(1-based, inclusive).
-v, --potential_variants <VCF> Genotype and phase the variants in this VCF instead of using pileup
method to find variants. NOTES: VCF must be gzipped and tabix indexed or
contain contig information. Use with caution because excessive false
potential variants can lead to inaccurate results. Every variant is used
and only the allele fields are considered -- Genotypes, filters,
qualities etc are ignored. Indel variants will be genotyped but not
phased. Structural variants (length > 50 bp) are currently not supported.
-O, --out_bam <BAM> Write new bam file with haplotype tags (HP:i:1 and HP:i:2) for reads
assigned to each haplotype, any existing HP and PS tags are removed
-c, --min_cov <int> Minimum coverage (of reads passing filters) to consider position as a
potential SNV. [default: 6]
-C, --max_cov <int> Maximum coverage (of reads passing filters) to consider position as a
potential SNV. [default: 8000]
-q, --min_mapq <int> Minimum mapping quality to use a read. [default: 20]
-a, --min_allele_qual <float> Minimum estimated quality (Phred-scaled) of allele observation on read to
use for genotyping/haplotyping. [default: 7.0]
-y, --hap_assignment_qual <float> Minimum quality (Phred-scaled) of read->haplotype assignment (for read
separation). [default: 20.0]
-Q, --potential_snv_cutoff <float> Consider a site as a potential SNV if the original PHRED-scaled QUAL
score for 0/0 genotype is below this amount (a larger value considers
more potential SNV sites). [default: 20.0]
-e, --min_alt_count <int> Require a potential SNV to have at least this many alternate allele
observations. [default: 3]
-E, --min_alt_frac <float> Require a potential SNV to have at least this fraction of alternate
allele observations. [default: 0.125]
-L, --hap_converge_delta <float> Terminate the haplotype/genotype iteration when the relative change in
log-likelihood falls below this amount. Setting a larger value results in
faster termination but potentially less accurate results. [default:
0.0001]
-l, --anchor_length <int> Length of indel-free anchor sequence on the left and right side of read
realignment window. [default: 6]
-m, --max_snvs <int> Cut off variant clusters after this many variants. 2^m haplotypes must be
aligned against per read for a variant cluster of size m. [default: 3]
-W, --max_window <int> Maximum "padding" bases on either side of variant realignment window
[default: 50]
-I, --max_cigar_indel <int> Throw away a read-variant during allelotyping if there is a CIGAR indel
(I/D/N) longer than this amount in its window. [default: 20]
-B, --band_width <Band width> Minimum width of alignment band. Band will increase in size if sequences
are different lengths. [default: 20]
-D, --density_params <string> Parameters to flag a variant as part of a "dense cluster". Format
<n>:<l>:<gq>. If there are at least n variants within l base pairs with
genotype quality >=gq, then these variants are flagged as "dn" [default:
10:500:50]
-s, --sample_id <string> Specify a sample ID to write to the output VCF [default: SAMPLE]
--hom_snv_rate <float> Specify the homozygous SNV Rate for genotype prior estimation [default:
0.0005]
--het_snv_rate <float> Specify the heterozygous SNV Rate for genotype prior estimation [default:
0.001]
--ts_tv_ratio <float> Specify the transition/transversion rate for genotype grior estimation
[default: 0.5]
-P, --strand_bias_pvalue_cutoff <float> Remove a variant if the allele observations are biased toward one strand
(forward or reverse) according to Fisher's exact test. Use this cutoff
for the two-tailed P-value. [default: 0.01]
-d, --variant_debug_dir <path> write out current information about variants at each step of algorithm to
files in this directory
If a read has an assigned haplotype, it will get a tag HP:i:1 or HP:i:2 and tag PS:i:x where x is a phase set number of the variants it covers.
important considerations
It is highly recommended to use reads with at least 30x coverage.
It is recommended to process chromosomes separately using the --region option.
Longshot has only been tested using data from humans. Results may vary with organisms with significantly higher or lower SNV rate.
It is important to set a reasonable max read coverage cutoff (-C option) to filter out sites coinciding with genomic features such as CNVs which can be problematic for variant calling. If the -A option is used, Longshot will estimate the mean read coverage and set the max coverage to mean_cov+5*sqrt(mean_cov), which we have found to be a reasonable filter in practice for humans.
CNVs and mapping issues can result in dense clusters of false positive SNVs. Longshot will attempt to find clusters like this and mark them as “dn” in the FILTER field. The --density_params option is used to control which variants are flagged as “dn”. The default parameters have been found to be effective for human sequencing data, but this option may need to be tweaked for other organisms with SNV rates significantly different from human.
Oxford Nanopore Technology (ONT) SMS reads are now officially supported. It is recommended to use the default --strand_bias_pvalue_cutoff of 0.01 for ONT reads, since this option filters out false SNV sites prior to variant calling.
installation troubleshooting
older version of Rust
Check that the Rust version is 1.30.0 or higher:
rustc --version
If not, update Rust using this command:
rustup update
linker errors
For example:
error: linking with `cc` failed: exit code: 1
...
...
...
= note: Non-UTF-8 output: /usr/bin/ld: /home/pedge/temp/longshot/target/release/build/longshot-347f3774e75b380c/out/libhapcut2.a(common.o)(.text.fprintf_time+0x81): unresolvable H\x89\$\xe8H\x89l$\xf0H\x89\xf3L\x89d$\xf8H\x83\xec\x18H\x8bG\x10H\x89\xfdI\x89\xd4H\x89\xd6H\x8b;\xffPxH\x8bE\x10I\x8dt$\x08H\x8b{\x08\xffPxH\x8bE\x10H\x8b{\x10I\x8dt$\x10H\x8b\x1c$H\x8bl$\x08L\x8bd$\x10H\x8b@xH\x83\xc4\x18\xff\xe0f\x90H\x89\$\xe8H\x89l$\xf0H\x89\xfbL\x89d$\xf8H\x83\xec\x18H\x8bG\x10I\x89\xd4H\x89\xf5H\x89\xf7\xffPhI\x89\x04$H\x8bC\x10H\x8d}\x08\xffPhH\x8b\x1c$I\x89D$\x08H\x8bl$\x08L\x8bd$\x10H\x83\xc4\x18\xc3\x0f\x1f relocation against symbol `time@@GLIBC_2.2.5```
Your system may have multiple versions of your linker that are causing a conflict. Rustc may be calling to a different or old version of the linker. In this case, specify the linker (in linux, gcc) as follows:
rustc -vV
Note the build target after "host: ", i.e. "x86_64-unknown-linux-gnu".
\n/usr/bin/ld: BFD version 2.20.51.0.2-5.42.el6 20100205 internal error, aborting at reloc.c line 443 in bfd_get_reloc_size\n\n/usr/bin/ld: Please report this bug.\n\ncollect2: ld returned 1 exit status\n
...
...
...
Your system may have multiple versions of your linker that are causing a conflict. Rustc may be calling to a different or old version of the linker. In this case, specify the linker (in linux, gcc) as follows:
rustc -vV
Note the build target after “host: “, i.e. “x86_64-unknown-linux-gnu”.
longshot
Longshot is a variant calling tool for diploid genomes using long error prone reads such as Pacific Biosciences (PacBio) SMRT and Oxford Nanopore Technologies (ONT). It takes as input an aligned BAM/CRAM file and outputs a phased VCF file with variants and haplotype information. It can also genotype and phase input VCF files. It can output haplotype-separated BAM files that can be used for downstream analysis. Currently, it only calls single nucleotide variants (SNVs), but it can genotype indels if they are given in an input VCF.
citation
If you use Longshot, please cite the publication:
Edge, P. and Bansal, V., 2019. Longshot enables accurate variant calling in diploid genomes from single-molecule long read sequencing. Nature communications, 10(1), pp.1-10.
supported operating systems
Longshot has been tested using Ubuntu 16.04 and 18.04, CentOS 6.6, Manjaro Linux 17.1.11, and Mac OS 10.14.2 Mojave. It should work on any linux-based system that has Rust and Cargo installed.
dependencies
(older versions may work but have not been tested)
installation
installation using Bioconda
It is recommended to install Longshot using Bioconda:
This method supports Linux and Mac. If you do not have Bioconda, you can install it with these steps: First, install Miniconda (or Anaconda). Miniconda can be installed using the scripts here.
The Bioconda channel can then be added using these commands:
manual installation using apt for dependencies (Ubuntu 18.04)
If you are using Ubuntu 18.04, you can install the dependencies using apt. Then, the Rust cargo package manager is used to compile Longshot.
Installation should take around 4 minutes on a typical desktop machine and will use between 400 MB (counting cargo) and 1.2 GB (counting all dependencies) of disk space. It is recommended to add the line
export PATH=$PATH:/home/$USER/.cargo/binto the end of your~/.bashrcfile so that the longshot binary is in the PATH for future shell sessions.usage:
After installation, execute the longshot binary as so:
execution on an example dataset
The directory
example_datacontains a simulated toy dataset that can be used to test out Longshot:example_data/genome.fa)example_data/pacbio_reads_30x.bam)example_data/ground_truth_variants.vcf)Run Longshot on the example data as so:
Execution should take around 5 to 10 seconds on a typical desktop machine. The output can be compared to
ground_truth_variants.vcffor accuracy.command line options
usage examples
Call variants with default parameters:
Call variants for chromosome 1 only using the automatic max coverage cutoff:
Call variants in a 500 kb region and then output the reads into
reads.bamusing a haplotype assignment threshold of 30:If a read has an assigned haplotype, it will get a tag
HP:i:1orHP:i:2and tagPS:i:xwherexis a phase set number of the variants it covers.important considerations
--regionoption.-Coption) to filter out sites coinciding with genomic features such as CNVs which can be problematic for variant calling. If the-Aoption is used, Longshot will estimate the mean read coverage and set the max coverage tomean_cov+5*sqrt(mean_cov), which we have found to be a reasonable filter in practice for humans.--density_paramsoption is used to control which variants are flagged as “dn”. The default parameters have been found to be effective for human sequencing data, but this option may need to be tweaked for other organisms with SNV rates significantly different from human.--strand_bias_pvalue_cutoffof 0.01 for ONT reads, since this option filters out false SNV sites prior to variant calling.installation troubleshooting
older version of Rust
Check that the Rust version is 1.30.0 or higher:
If not, update Rust using this command:
linker errors
For example:
rustc -vV
mkdir .cargo nano .cargo/config
[target.] linker = “</path/to/linker>”
[target.x86_64-unknown-linux-gnu] linker = “/opt/gnu/gcc/bin/gcc”
cargo clean cargo build –release
Your system may have multiple versions of your linker that are causing a conflict. Rustc may be calling to a different or old version of the linker. In this case, specify the linker (in linux, gcc) as follows:
Note the build target after “host: “, i.e. “x86_64-unknown-linux-gnu”.
edit the config file to have these contents:
for example,
then,