Miniasm is a great long-read assembly tool: straight-forward, effective and very fast. However, it does not include a polishing step, so its assemblies have a high error rate – they are essentially made of stitched-together pieces of long reads.
Racon is a great polishing tool that can be used to clean up assembly errors. It’s also very fast and well suited for long-read data. However, it operates on FASTA files, not the GFA graphs that miniasm makes.
That’s where Minipolish comes in. With a single command, it will use Racon to polish up a miniasm assembly, while keeping the assembly in graph form.
It also takes care of some of the other nuances of polishing a miniasm assembly:
Adding read depth information to contigs
Fixing sequence truncation that can occur in Racon
Adding circularising links to circular contigs if not already present (so they display better in Bandage)
‘Rotating’ circular contigs between polishing rounds to ensure clean circularisation
Requirements
Minipolish assumes that you have minimap2 and Racon installed and available in your PATH. If you can run minimap2 --version and racon --version on the command line, you should be good to go!
You’ll need Python 3.6 or later to run Minipolish (check with python3 --version). The only Python package requirement is Edlib. If you don’t already have this package, it will be installed as part of the Minipolish installation process. You’ll also need pytest if you want to run Minipolish’s unit tests.
Installation
Install from source
You can install Minipolish using pip, either from a local copy:
If these installation commands aren’t working for you (e.g. an error message like Command 'pip3' not found or command 'gcc' failed with exit status 1) then check out the installation issues page on the Badread wiki page (a different tool of mine but this wiki page covers the same problems).
Run without installation
Minipolish can also be run directly from its repository by using the minipolish-runner.py script, no installation required:
If you run Minipolish this way, it’s up to you to make sure that Edlib is installed for your Python.
Method
Step 1: initial Racon polish with constituent reads
Miniasm’s assembled contigs are made up of pieces of long reads and therefore have a high error rate – probably around 90% or so, depending on the input reads.
The miniasm GFA file indicates specifically which reads contributed to each contig on the a lines. For example:
a utg000001c 0 1834c7d5-151e-d9af-fe1d-6bd9f68d355e:19-126885 + 31415
a utg000001c 31415 28c30dac-ce92-b0f7-af70-4fd95c448f5b:32-107841 + 4527
a utg000001c 35942 668e8adb-6b7d-67ba-684f-44d78fc3fe32:85-164862 + 14938
a utg000001c 50880 0ed28b14-d384-ee7b-d12f-88512a2a829c:43-218068 + 81190
Therefore, the first thing Minipolish does is to run Racon on each contig independently, only using the reads which were used to create that contig. This step is typically quite fast because it does not involve high read depths, and it can bring the percent identity up to the high 90s.
Step 2: full Racon polish rounds
Now that the assembly is in better shape, Minipolish does full Racon-polishing rounds – aligning the full read set to the whole assembly and getting a Racon consensus. The default number of polishing rounds is two, but this is configurable with the --rounds option.
Minipolish does two things here to ensure that contigs can circularise cleanly. First, it repairs sequence ends as Racon can sometimes truncate them. I.e. if Racon dropped a handful of bases from the start or end of a contig, Minipolish will put them back on. Second, it rotates (i.e. changes the starting position) circular contigs between polishing rounds. If all goes well, this means that the first base of a circular contig immediately follows the last base – clean circularisation.
Step 3: contig read depth
Minipolish finishes by doing one more read-to-assembly alignment, this time not to polish but to calculate read depths. These depths are added to the GFA line for each contig (e.g. dp:f:77.179) and they will be recognised if the graph is loaded in Bandage.
CIGARs
It is important to note here something that Minipolish does not do: change/fix the CIGAR strings indicating contig overlap. While circular contigs will be connected with an overlap-free link (i.e. a CIGAR of 0M), links between linear contigs will have overlap.
For example, if miniasm created a graph with this link…
L utg000001l + utg000020l + 77073M SD:i:86773
…then that link will have the same CIGAR in the polished assembly. However, since the sequence was polished, the overlap value (77073) will no longer be quite right.
So take CIGAR overlaps between polished contigs with a grain of salt. They will still indicate the approximate amount of overlap, not the exact amount.
Quick usage
First use minimap2 and miniasm to make an assembly, then polish it with Minipolish:
This repo contains a small Bash script (miniasm_and_minipolish.sh) to do those three steps in a single command. It takes two positional arguments: the long reads file and the number of threads:
usage: minipolish [-t THREADS] [--rounds ROUNDS]
[--minimap2-preset {map-ont,lr:hq,map-pb,map-hifi}] [--pacbio]
[--skip_initial] [-h] [--version]
reads assembly
Minipolish
Positional arguments:
reads Long reads for polishing (FASTA or FASTQ format)
assembly Miniasm assembly to be polished (GFA format)
Settings:
-t THREADS, --threads THREADS
Number of threads to use for alignment and polishing
(default: 16)
--rounds ROUNDS Number of full Racon polishing rounds (default: 2)
--minimap2-preset {map-ont,lr:hq,map-pb,map-hifi}
minimap2 preset to use: "map-ont" for Oxford Nanopore
reads with <Q20 accuracy, "lr:hq" for Oxford Nanopore
reads with Q20+ accuracy, "map-pb" for PacBio CLR or
"map-hifi" for PacBio HiFi/CCS (default: map-ont)
--pacbio Deprecated: equivalent to --minimap2-preset map-pb.
Retained for backwards compatibility.
--skip_initial Skip the initial polishing round - appropriate if the
input GFA does not have "a" lines (default: do the
initial polishing round)
Other:
-h, --help Show this help message and exit
--version Show program's version number and exit
Citation
If you use Minipolish in your research, you can cite the following paper in which it was introduced:
Table of contents
Introduction
Miniasm is a great long-read assembly tool: straight-forward, effective and very fast. However, it does not include a polishing step, so its assemblies have a high error rate – they are essentially made of stitched-together pieces of long reads.
Racon is a great polishing tool that can be used to clean up assembly errors. It’s also very fast and well suited for long-read data. However, it operates on FASTA files, not the GFA graphs that miniasm makes.
That’s where Minipolish comes in. With a single command, it will use Racon to polish up a miniasm assembly, while keeping the assembly in graph form.
It also takes care of some of the other nuances of polishing a miniasm assembly:
Requirements
Minipolish assumes that you have minimap2 and Racon installed and available in your PATH. If you can run
minimap2 --versionandracon --versionon the command line, you should be good to go!You’ll need Python 3.6 or later to run Minipolish (check with
python3 --version). The only Python package requirement is Edlib. If you don’t already have this package, it will be installed as part of the Minipolish installation process. You’ll also need pytest if you want to run Minipolish’s unit tests.Installation
Install from source
You can install Minipolish using pip, either from a local copy:
or directly from GitHub:
If these installation commands aren’t working for you (e.g. an error message like
Command 'pip3' not foundorcommand 'gcc' failed with exit status 1) then check out the installation issues page on the Badread wiki page (a different tool of mine but this wiki page covers the same problems).Run without installation
Minipolish can also be run directly from its repository by using the
minipolish-runner.pyscript, no installation required:If you run Minipolish this way, it’s up to you to make sure that Edlib is installed for your Python.
Method
Step 1: initial Racon polish with constituent reads
Miniasm’s assembled contigs are made up of pieces of long reads and therefore have a high error rate – probably around 90% or so, depending on the input reads.
The miniasm GFA file indicates specifically which reads contributed to each contig on the
alines. For example:Therefore, the first thing Minipolish does is to run Racon on each contig independently, only using the reads which were used to create that contig. This step is typically quite fast because it does not involve high read depths, and it can bring the percent identity up to the high 90s.
Step 2: full Racon polish rounds
Now that the assembly is in better shape, Minipolish does full Racon-polishing rounds – aligning the full read set to the whole assembly and getting a Racon consensus. The default number of polishing rounds is two, but this is configurable with the
--roundsoption.Minipolish does two things here to ensure that contigs can circularise cleanly. First, it repairs sequence ends as Racon can sometimes truncate them. I.e. if Racon dropped a handful of bases from the start or end of a contig, Minipolish will put them back on. Second, it rotates (i.e. changes the starting position) circular contigs between polishing rounds. If all goes well, this means that the first base of a circular contig immediately follows the last base – clean circularisation.
Step 3: contig read depth
Minipolish finishes by doing one more read-to-assembly alignment, this time not to polish but to calculate read depths. These depths are added to the GFA line for each contig (e.g.
dp:f:77.179) and they will be recognised if the graph is loaded in Bandage.CIGARs
It is important to note here something that Minipolish does not do: change/fix the CIGAR strings indicating contig overlap. While circular contigs will be connected with an overlap-free link (i.e. a CIGAR of
0M), links between linear contigs will have overlap.For example, if miniasm created a graph with this link…
…then that link will have the same CIGAR in the polished assembly. However, since the sequence was polished, the overlap value (77073) will no longer be quite right.
So take CIGAR overlaps between polished contigs with a grain of salt. They will still indicate the approximate amount of overlap, not the exact amount.
Quick usage
First use minimap2 and miniasm to make an assembly, then polish it with Minipolish:
This repo contains a small Bash script (
miniasm_and_minipolish.sh) to do those three steps in a single command. It takes two positional arguments: the long reads file and the number of threads:Full usage
Citation
If you use Minipolish in your research, you can cite the following paper in which it was introduced:
Wick RR, Holt KE. Benchmarking of long-read assemblers for prokaryote whole genome sequencing. F1000Research. 2019;8(2138).
License
GNU General Public License, version 3