gsort is a tool to sort genomic files according to a genomefile.
For example, for some reason, you may want to sort your VCF to
have order: X,Y,2,1,3,... and you want to keep the header at the top.
As a more likely example, you may want to sort your file to match GATK order
(1 … X, Y, MT) which is not possible with any other sorting tool. With gsort
one can simply place MT as the last chrom in the .genome file.
Given a genome file (lines of chrom\tlength) With this tool, you can
sort a BED/VCF/GTF/… in the order dictated by that file with:
where here, memory-use will be limited to 1500 megabytes.
We will use this to enforce chromosome ordering in ggd.
It will also be useful for getting your files ready for use in bedtools.
GFF parent
In GFF, the Parent attribute may refer to a row that would otherwise be sorted after it (based on the end position).
But, some programs require that the row referenced in a Parent attribute be sorted first. If this is required, used
the --parent flag introduced in version 0.0.6.
Performance
gsort can sort the 2 million variants in ESP in 15 seconds. It takes a few minutes to sort
the ~10 million ExAC variants because of the huuuuge INFO strings in that file.
Usage
gsort will error if your genome file has ‘chr’ prefix and your file does not (or vice-versa).
It will write temporary files to your $TMPDIR (usually /tmp/) as needed to avoid using too
much memory.
TODO
Specify a VCF for the genome file and pull order from the @SQ tags
Avoid temp file when everything can fit in memory. (more universally, last chunk can always be kept in memory).
API Documentation
–
import “github.com/brentp/gsort”
Package gsort is a library for sorting a stream of tab-delimited lines ([]bytes)
(from a reader) using the amount of memory requested.
Instead of using a compare function as most sorts do, this accepts a
user-defined function with signature: func(line []byte) []int where the []ints
are used to determine ordering. For example if we were sorting on 2 columns, one
of months and another of day of months, the function would replace “Jan” with 1
and “Feb” with 2 for the first column and just return the Atoi of the 2nd
column.
gsort
Binaries Available Here
gsortis a tool to sort genomic files according to a genomefile.For example, for some reason, you may want to sort your VCF to have order:
X,Y,2,1,3,...and you want to keep the header at the top.As a more likely example, you may want to sort your file to match GATK order (1 … X, Y, MT) which is not possible with any other sorting tool. With
gsortone can simply place MT as the last chrom in the .genome file.Given a genome file (lines of chrom\tlength) With this tool, you can sort a BED/VCF/GTF/… in the order dictated by that file with:
where here, memory-use will be limited to 1500 megabytes.
We will use this to enforce chromosome ordering in ggd.
It will also be useful for getting your files ready for use in bedtools.
GFF parent
In GFF, the
Parentattribute may refer to a row that would otherwise be sorted after it (based on the end position). But, some programs require that the row referenced in aParentattribute be sorted first. If this is required, used the--parentflag introduced in version 0.0.6.Performance
gsort can sort the 2 million variants in ESP in 15 seconds. It takes a few minutes to sort the ~10 million ExAC variants because of the huuuuge INFO strings in that file.
Usage
gsortwill error if your genome file has ‘chr’ prefix and your file does not (or vice-versa).It will write temporary files to your $TMPDIR (usually /tmp/) as needed to avoid using too much memory.
TODO
API Documentation
– import “github.com/brentp/gsort”
Package gsort is a library for sorting a stream of tab-delimited lines ([]bytes) (from a reader) using the amount of memory requested.
Instead of using a compare function as most sorts do, this accepts a user-defined function with signature:
func(line []byte) []intwhere the []ints are used to determine ordering. For example if we were sorting on 2 columns, one of months and another of day of months, the function would replace “Jan” with 1 and “Feb” with 2 for the first column and just return the Atoi of the 2nd column.func Sort
Sort accepts a tab-delimited io.Reader and writes to wtr using prepocess to determine ordering
type Processor
Processor is a function that takes a line and return a slice of ints that determine ordering