A genome assembly correction and scaffolding pipeline using long reads, consisting of up to three steps:
Tigmint cuts the draft assembly at potentially misassembled regions
ntLink is then used to scaffold the corrected assembly
followed by ARKS for further scaffolding (optional extra step of scaffolding)
Credits
LongStitch was developed and designed by Lauren Coombe, Janet Li, Theodora Lo and Rene Warren.
Citing LongStitch
If you use LongStitch in your research, please cite:
Coombe L, Li JX, Lo T, Wong J, Nikolic V, Warren RL and Birol I. LongStitch: high-quality genome assembly correction and scaffolding using long reads. BMC Bioinformatics 22, 534 (2021). https://doi.org/10.1186/s12859-021-04451-7
For example, to run the default pipeline on a draft assembly draft-assembly.fa with the reads reads.fa.gz and a genome size of gsize:
longstitch run draft=draft-assembly reads=reads G=gsize
Note that specifying G is required when span=auto for Tigmint-long, and that all input sequences files should be in single-line fasta/fastq format.
The output scaffolds can be found in soft-links with the suffix longstitch-scaffolds.fa
LongStitch demo
To test your LongStitch installation and see examples of how to run the pipeline, see tests/run_longstitch_demo.sh
To run the demo script, ensure all dependencies are in your PATH, and run the bash script:
cd tests
./run_longstitch_demo.sh
Full help page
To run the LongStitch pipeline, you can use the Makefile driver script longstitch.
Usage: ./longstitch [COMMAND] [OPTION=VALUE]…
Commands:
run run default LongStitch pipeline: Tigmint, then ntLink
tigmint-ntLink-arks run full LongStitch pipeline: Tigmint, ntLink, then ARCS in kmer mode
tigmint-ntLink run Tigmint, then ntLink (Same as 'run' target)
ntLink-arks run ntLink, then run ARCS in kmer mode
General options (required):
draft draft name [draft]. File must have .fa extension
reads read name [reads]. The reads file can be uncompressed or gzipped.
Accepted read file extensions: .fq, .fq.gz, .fastq, .fastq.gz, .fa, .fa.gz, .fasta, .fasta.gz
General options (optional):
t number of threads [8]
z minimum size of contig (bp) to scaffold [1000]
out_prefix if supplied, final scaffolds will be soft-linked to <out_prefix>.scaffolds.fa
Tigmint options:
span min number of spanning molecules to be considered correctly assembled [auto]
dist maximum distance between alignments to be considered the same molecule [auto]
G haploid genome size (bp) for calculating span parameter (e.g. '3e9' for human genome). Required when span=auto [0]
longmap long read technology - used for minimap2 preset. 'ont' for nanopore, 'pb' for pacbio, 'hifi' for pacbio HiFi reads [ont]
ntLink options:
k_ntLink k-mer size for minimizers [32]
w window size for minimizers [100]
gap_fill use gap-filling feature [False]
rounds number of ntLink rounds [1]
ARCS+LINKS options:
j minimum fraction of read kmers matching a contigId (used in kmer mode) [0.05]
k_arks size of a k-mer (used in kmer mode) [20]
c minimum aligned read pairs per molecule [4]
l minimum number of links to compute scaffold [4]
a maximum link ratio between two best contain pairs [0.3]
Notes:
- by default, span is automatically calculated as 1/4 of the sequence coverage of the input long reads
- G (genome size) must be specified if span=auto
- by default, dist is automatically calculated as p5 of the input long read lengths
- Ensure that all input files are in the current working directory, making soft-links if needed
Tips for running LongStitch
Optimizing k/w for ntLink step
The default k (k_ntLink) and w (w) values for ntLink generally work well, but (depending on your input data) you may get better results by tuning these parameters
Generally, we suggest setting the k-mer and window size to values in these approximate ranges:
k_ntLink (k-mer size): 24-40
w (window size): 100-500
These values can be optimized using a grid search
For example, trying all combinations of k-mer sizes 24, 32, 40 and window sizes 100, 250, 500
Running the default pipeline or including ARKS-long
The default LongStitch pipeline consists of Tigmint-long + ntLink, but you can also run an additional scaffolding step with ARKS-long by specifying tigmint-ntLink-arks as the target in your command
Different results from these steps are expected for different input data
Some datasets will show more gains with the additional scaffolding step than others
Generally, if you want to be more conservative in terms of minimizing misassemblies and faster runtimes, the default pipeline (run, Tigmint-long + ntLink) is recommended. However, if you want to maximize scaffolding and contiguity, running the additional ARKS-long step (tigmint-ntLink-arks) is often valuable
minimap2 is used for mapping reads in the Tigmint step
To change the (-x) preset used for mapping, specify longmap=<mode>
For example, to use the nanopore mapping preset, use longmap=ont (default), or for PacBio use longmap=pb
Running LongStitch in pipelines
To change to a particular before running LongStitch, you can use the -C dir option with the longstitch command
All input files must be in the working directory for longstitch - these can either be created manually or using the longstitch make_links command
This command only requires the parameters reads_path and draft_path to be set - full paths to the reads file and draft fasta file, respectively
License
LongStitch Copyright (c) 2020 British Columbia Cancer Agency Branch. All rights reserved.
LongStitch is released under the GNU General Public License v3
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.
LongStitch
A genome assembly correction and scaffolding pipeline using long reads, consisting of up to three steps:
Credits
LongStitch was developed and designed by Lauren Coombe, Janet Li, Theodora Lo and Rene Warren.
Citing LongStitch
If you use LongStitch in your research, please cite:
Coombe L, Li JX, Lo T, Wong J, Nikolic V, Warren RL and Birol I. LongStitch: high-quality genome assembly correction and scaffolding using long reads. BMC Bioinformatics 22, 534 (2021). https://doi.org/10.1186/s12859-021-04451-7
Installing LongStitch
LongStitch is available from conda:
All dependencies for LongStitch are also available from homebrew:
Alternatively, use the latest release tarball:
Dependencies
Example command
For example, to run the default pipeline on a draft assembly
draft-assembly.fawith the readsreads.fa.gzand a genome size ofgsize:Note that specifying
Gis required whenspan=autofor Tigmint-long, and that all input sequences files should be in single-line fasta/fastq format.The output scaffolds can be found in soft-links with the suffix
longstitch-scaffolds.faLongStitch demo
To test your LongStitch installation and see examples of how to run the pipeline, see
tests/run_longstitch_demo.shTo run the demo script, ensure all dependencies are in your PATH, and run the bash script:
Full help page
To run the LongStitch pipeline, you can use the Makefile driver script
longstitch.Tips for running LongStitch
Optimizing k/w for ntLink step
k_ntLink) and w (w) values for ntLink generally work well, but (depending on your input data) you may get better results by tuning these parametersk_ntLink(k-mer size): 24-40w(window size): 100-500Running the default pipeline or including ARKS-long
tigmint-ntLink-arksas the target in your commandrun, Tigmint-long + ntLink) is recommended. However, if you want to maximize scaffolding and contiguity, running the additional ARKS-long step (tigmint-ntLink-arks) is often valuableChanging parameters for minimap2
minimap2is used for mapping reads in the Tigmint step-x) preset used for mapping, specifylongmap=<mode>longmap=ont(default), or for PacBio uselongmap=pbRunning LongStitch in pipelines
-C diroption with thelongstitchcommandlongstitch- these can either be created manually or using thelongstitch make_linkscommandreads_pathanddraft_pathto be set - full paths to the reads file and draft fasta file, respectivelyLicense
LongStitch Copyright (c) 2020 British Columbia Cancer Agency Branch. All rights reserved.
LongStitch is released under the GNU General Public License v3
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.
For commercial licensing options, please contact Patrick Rebstein (prebstein@bccancer.bc.ca).