Continuity, correctness and completeness of genome assemblies are important for
many biological projects. Long reads represent a major driver towards delivering
high-quality genomes, but not everybody can achieve the necessary coverage for good
long-read-only assemblies. Therefore, improving existing assemblies with low-coverage
long reads is a promising alternative. The improvements include correction, scaffold-
ing and gap filling. However, most tools perform only one of these tasks and the
useful information of reads that supported the scaffolding is lost when running sepa-
rate programs successively. Therefore, we propose a new tool for combined execution
of all three tasks using PacBio or Oxford Nanopore reads.
Requirements
The first list are the python requirements for gapless.py and the second list are the external programs called in gapless.sh.
The final output is linked to gapless_run/gapless.fa. Depending on the available number of cores you should change the -j parameter. 30 threads can finish a human sized genome with 30x coverage with the default 3 iterations in approximately half a day.
The pipeline essentially makes the following calls for each iteration:
{read_type} is map-pb, asm20 or map-ont depending on the type of long reads.
{read_type2} is ava-pb, asm20 --min-occ-floor=0 -X -m100 -g10000 --max-chain-skip 25 or ava-ont depending on the type of long reads.
Parameter
gapless.sh [OPTIONS] {long_reads}.fq
Parameter
Default
Description
-h-?
Display this help and exit
-i
(mandatory)
Input assembly (fasta)
-j
4
Number of threads
-n
3
Number of iterations
-o
gapless_run
Output directory (improved assembly is written to gapless.fa in this directory)
-r
Restart at the start iteration and overwrite instead of incorporat already present files
-s
1
Start iteration (Previous runs must be present in output directory)
-t
(mandatory)
Type of long reads (pb_clr,pb_hifi,nanopore)
gapless.py split [OPTIONS] {assembly}.fa
Parameter
Default
Description
-h--help
Display this help and exit
-n--minN
1
Minimum number of N’s to split at that position
-o--output
{assembly}_split.fa
File to which the split sequences should be written to
Csv file from previous steps describing the scaffolding
Intermediate and final output files in the pipeline
File
Program
Type
Information
gapless_split.fa
gapless.py split
temporary
Input assembly with the scaffolds split into contigs
gapless_split_repeats.paf
minimap2
temporary
Mapping of the split assembly to itself
gapless_reads.paf
minimap2
temporary
Mapping of the long reads to the split assembly
gapless_scaffold_paths.csv
gapless.py scaffold
intermediate
Table summarising the first and last base included for contigs and reads in the new scaffolds as well as their order and orientation for all haplotypes. Positions start at 0 and end positions are one after the last included position. The orientation is encoded with +/-. The first and last contig/read in a scaffold contain information about the distance to the next scaffold in case this information is present. Otherwise, the fields are encoded with -1. Identical phases mean that contig/reads are phased to be on the same haplotype. Negative phases mark contigs/reads identical to main haplotye (0). Empty contig/read names in combination with positive phases mark deletions.
gapless_extensions.csv
gapless.py scaffold
intermediate
Table summarising the mapping of extending reads to the assembly. How much they extend the new scaffolds, how far from the end they stop aligning (trim), the read distance to the next alignment (unmap_ext).
gapless_extending_reads.lst
gapless.py scaffold
intermediate
List of extending reads with one read name per line
gapless_stats.pdf
gapless.py scaffold
final
File containing plots with information about the run
gapless_extending_reads.paf
minimap2
temporary
All-vs.-all alignment of the extending reads
gapless_extended_scaffold_paths.csv
gapless.py extend
intermediate
Table including the extensions in the same format as gapless_scaffold_paths.csv
gapless_used_reads.lst
gapless.py extend
intermediate
List of reads inlcuded in gapless_extended_scaffold_paths.csv with one read name per line
gapless_raw.fa
gapless.py finish
temporary
Unpolished output assembly
gapless_consensus.paf
minimap2
temporary
Mapping of the long reads to the unpolished assembly
gapless.fa
racon
final
Polished output assembly
Intermediate files are required for the following steps in the pipeline, but are not removed when they are not needed anymore (in contrast to temporary files).
Log files are stored in the logs folder and the ressource usage acquired with GNU time is written to the timing folder.
FAQ
Coming soon …
Publication
Schmeing, S., Robinson, M.D. Gapless provides combined scaffolding, gap filling and assembly correction with long reads. bioRxiv (2022). https://doi.org/10.1101/2022.03.08.483466
gapless
Combined scaffolding, gap-closing and assembly correction with long reads
Table of Contents
Abstract
Continuity, correctness and completeness of genome assemblies are important for many biological projects. Long reads represent a major driver towards delivering high-quality genomes, but not everybody can achieve the necessary coverage for good long-read-only assemblies. Therefore, improving existing assemblies with low-coverage long reads is a promising alternative. The improvements include correction, scaffold- ing and gap filling. However, most tools perform only one of these tasks and the useful information of reads that supported the scaffolding is lost when running sepa- rate programs successively. Therefore, we propose a new tool for combined execution of all three tasks using PacBio or Oxford Nanopore reads.
Requirements
The first list are the python requirements for
gapless.pyand the second list are the external programs called ingapless.sh.Installation
No installation except for the requirements is necessary. The program can be directly called from its folder after downloading:
You may want to add the folder to your PATH variable to be able to call it from everywhere:
If you insert this command into
~/.bashrcit will be automatically called when you login.An alterantive is to create links to these files:
Bioconda
Gapless can also be downloaded with all python requirements in an automatic fashion via anaconda/miniconda(https://docs.conda.io/projects/continuumio-conda/en/latest/user-guide/install/index.html). However, updates will not be as frequent and the option to switch to the devel branch to get the most recent bugfixes is missing.
To add the additional software used in gapless.sh from conda use:
Quick start examples
The pipeline can be run with one of the following three commands depending on the type of long reads:
The final output is linked to
gapless_run/gapless.fa. Depending on the available number of cores you should change the-jparameter. 30 threads can finish a human sized genome with 30x coverage with the default 3 iterations in approximately half a day.The pipeline essentially makes the following calls for each iteration:
{read_type}ismap-pb,asm20ormap-ontdepending on the type of long reads.{read_type2}isava-pb,asm20 --min-occ-floor=0 -X -m100 -g10000 --max-chain-skip 25orava-ontdepending on the type of long reads.Parameter
gapless.sh [OPTIONS] {long_reads}.fq-h-?-i-j-n-o-r-s-tpb_clr,pb_hifi,nanopore)gapless.py split [OPTIONS] {assembly}.fa-h--help-n--minN-o--output{assembly}_split.fagapless.py scaffold [OPTIONS] {assembly}.fa {mapping}.paf {repeat}.paf-h--help-p--prefix{assembly}-s--stats--minLenBreak--minMapLength--minMapQ--largeGenomegapless.py extend -p {prefix} {all_vs_all}.paf-h--help-p--prefix--minLenBreakgapless.py finish [OPTIONS] -s {scaffolds}.csv {assembly}.fa {reads}.fq-h--help-f--formatfastqor read ending{reads}.fq (fasta/fastq)-H--hap--output--hap[1-9]--out[1-9]-o--output{assembly}_gapless.fa--out[1-9]-p--polishing-s--scaffoldsIntermediate and final output files in the pipeline
Intermediate files are required for the following steps in the pipeline, but are not removed when they are not needed anymore (in contrast to temporary files). Log files are stored in the
logsfolder and the ressource usage acquired with GNU time is written to thetimingfolder.FAQ
Coming soon …
Publication
Schmeing, S., Robinson, M.D. Gapless provides combined scaffolding, gap filling and assembly correction with long reads. bioRxiv (2022). https://doi.org/10.1101/2022.03.08.483466