This pipeline is used for plastid (chloroplast) genome assembly based on long read data, including both Nanopore and PacBio. It can easily help assemble the complex plastomes with many long repeat regions which cannot be addressed by short read data only. Short reads assembly can sometimes generate many paths due to the long repeat regions (e.g. Juncus). This pipeline is very straitforward with two mendatory arguments (reference and raw long-read data). It usually takes about 10 minutes to assemble a plastome with 16Gb memory and less than 10Gbp sequence data. Our paper is in prep. [Zhou et al. (unpublished)]. We introduced this pipeline in BAGGs workshop at UNC-Chaple Hill.
Latest updates
ptGAUL 1.0.5 release (Feb 9, 2023)
New options for ptGAUL: -o output directory; -g genome size; -c coverage.
New argument for python script: -o output directory.
Fixed the combine_gfa.py, which can be run automatically.
The basic arguments in ptGAUL.sh are 1) -r: a plastome from a closely related species (it should work for the references either from the same genus or the same family) and 2) -l: your long read data (any seuquence file in fasta, fastq, and fq.gz format).
If you run 1.0.4 version, the command in the ptGAUL_version directory. Otherwise, combine_gfa.py will not be able to run automatically.
Usage: ptGAUL.sh -r (REFERENCE FILE) -l (LONG READ FILE)
[-t threads int] [-g genome size int]
[-c coverage int] [-f filter threshold int]
[-o output directory string]
this pipeline is used for plastome assembly using long read data.
optional arguments:
-h, --help <show this help message and exit>
-r, --reference <MANDATORY: reference contigs or scaffolds in fasta format>
-l, --longreads <MANDATORY: raw long reads in fasta/fastq/fq.gz format>
-t, --threads <number of threads, default:1>
-g, --genomesize <expected genome size of plastome (bp), default:160000>
-c, --coverage <a rough coverage of data used for plastome assembly, default:50>
-f, --filtered <the raw long reads will be filtered if the lengths are less than this number (bp); default: 3000>
-o, --outputdir <output directory of results, defult is current directory>
Check your results before using it
If the edge number does not equal 1 or 3 with abnormal plastid length, You should manually check the assembled data using BANDAGE. When you confirm the edges are three, you can manually run the python script again to get the assembly results including two paths.
Once you finished msbwt run. Nmeansthreadnumber.assembled_cp is assembled plastome from ptGAUL. Change the output path of “/PATH/fmlrc/corrected.fasta”
Zhou, W., Armijos, C.E., Lee, C., Lu, R., Wang, J., Ruhlman, T.A., Jansen, R.K., Jones, A.M. and Jones, C.D., 2023. Plastid genome assembly using long‐read data. Molecular Ecology Resources, 23(6), pp.1442-1457.
If you are using fmlrc, please cite Wang, Jeremy R. and Holt, James and McMillan, Leonard and Jones, Corbin D. FMLRC: Hybrid long read error correction using an FM-index. BMC Bioinformatics, 2018. 19 (1) 50.
PlasTid Genome Assembly Using Long reads data (ptGAUL)
This pipeline is used for plastid (chloroplast) genome assembly based on long read data, including both Nanopore and PacBio. It can easily help assemble the complex plastomes with many long repeat regions which cannot be addressed by short read data only. Short reads assembly can sometimes generate many paths due to the long repeat regions (e.g. Juncus). This pipeline is very straitforward with two mendatory arguments (reference and raw long-read data). It usually takes about 10 minutes to assemble a plastome with 16Gb memory and less than 10Gbp sequence data. Our paper is in prep. [Zhou et al. (unpublished)]. We introduced this pipeline in BAGGs workshop at UNC-Chaple Hill.
Latest updates
ptGAUL 1.0.5 release (Feb 9, 2023)
ptGAUL 1.0.4 release (Oct 31, 2022)
Installation
Create a conda environment
Use conda to install.
Environment
Examples can be applied on Linux and Mac.
Quick run
The basic arguments in ptGAUL.sh are 1) -r: a plastome from a closely related species (it should work for the references either from the same genus or the same family) and 2) -l: your long read data (any seuquence file in fasta, fastq, and fq.gz format).
If you run 1.0.4 version, the command in the ptGAUL_version directory. Otherwise, combine_gfa.py will not be able to run automatically.
EXAMPLE
The command for the example data.
To check all parameters in ptGAUL using:
Parameters in details
Check your results before using it
If the edge number does not equal 1 or 3 with abnormal plastid length, You should manually check the assembled data using BANDAGE. When you confirm the edges are three, you can manually run the python script again to get the assembly results including two paths.
(Optional) Final assembly polish using long reads data
This step will improve your assembly a little, but not too much. Using short reads is highly recommended (see as follows).
install racon using conda.
(Optional) Final assembly polish using short reads data
Software for polishing step (this needs a separate python2 environment)
Highly recommended steps: use fmlrc for polishing step. It outperforms other polishers.
files illumina_* are the fq.gz file of illumina reads. Change the output path directory “/PATH/msbwt”.
Once you finished msbwt run. Nmeansthreadnumber.assembled_cp is assembled plastome from ptGAUL. Change the output path of “/PATH/fmlrc/corrected.fasta”
Citation
Zhou, W., Armijos, C.E., Lee, C., Lu, R., Wang, J., Ruhlman, T.A., Jansen, R.K., Jones, A.M. and Jones, C.D., 2023. Plastid genome assembly using long‐read data. Molecular Ecology Resources, 23(6), pp.1442-1457.
If you are using fmlrc, please cite Wang, Jeremy R. and Holt, James and McMillan, Leonard and Jones, Corbin D. FMLRC: Hybrid long read error correction using an FM-index. BMC Bioinformatics, 2018. 19 (1) 50.