A Perl Script to sort gff3 files and produce suitable results for tabix tools
Usage
gff3sort.pl [input GFF3 file] >output.sort.gff3
Optional Parameters:
--precise Run in precise mode, about 2X~3X slower than the default mode.
Only needed to be used if your original GFF3 files have parent
features appearing behind their children features.
--chr_order Select how the chromosome IDs should be sorted.
Acceptable values are: alphabet, natural, original
[Default: alphabet]
--extract_FASTA If the input GFF3 file contains FASTA sequence at the end, use this
option to extract the FASTA sequence and place in a separate file
with the extention '.fasta'. By default, the FASTA sequences would be
discarded.
Publication
Zhu T, Liang C, Meng Z, Guo S, Zhang R: GFF3sort: A novel tool to sort GFF3 files for tabix indexing. BMC Bioinformatics 2017, 18:482, https://doi.org/10.1186/s12859-017-1930-3
Background
The tabix tool from htslib requires files sorted by their chromosomes and positions. For GFF3 files, they would be sorted by column 1 (chromosomes) and 4 (start positions) as:
However, either the GNU sort or the gt tool has a bug: Lines with the same chromosomes and start positions would be placed randomly. Therefore, parent feature lines might sometimes be placed after their children lines. For example, the following features:
That is, the two mRNA lines start with pos 473 would be “randomly” placed after the two exon lines which also start with pos 473. These would encount bugs such as https://github.com/GMOD/jbrowse/issues/780
This script would adjust lines with the same start positions. It would move lines with "Parent=" attributes (case insensitive) behind lines without "Parent=" attributes. The result would be:
GFF3sort
A Perl Script to sort gff3 files and produce suitable results for tabix tools
Usage
Publication
Background
The tabix tool from htslib requires files sorted by their chromosomes and positions. For GFF3 files, they would be sorted by column 1 (chromosomes) and 4 (start positions) as:
Then, the sorted GFF3 file could be indexed by:
However, either the GNU sort or the gt tool has a bug: Lines with the same chromosomes and start positions would be placed randomly. Therefore, parent feature lines might sometimes be placed after their children lines. For example, the following features:
would be sorted as:
That is, the two
mRNAlines start with pos473would be “randomly” placed after the twoexonlines which also start with pos473. These would encount bugs such as https://github.com/GMOD/jbrowse/issues/780This script would adjust lines with the same start positions. It would move lines with
"Parent="attributes (case insensitive) behind lines without"Parent="attributes. The result would be: