The chunked-scatter tool takes a bed file, fasta index, sequence dictionary
or vcf file as input and divides the
contigs/chromosomes into overlapping chunks of a given size. These chunks will
then be placed in new bed files, one chromosomes per file. Small chromosomes
will be put together to avoid the creation of thousands of files.
The scatter-regions tool works in a similar way but with defaults and flags
tuned towards creating genome scatters for GATK tools.
The safe-scatter tool produces a more even distribution of sizes in the
output bed files, and guarantees that none of the scatters are smaller than
--min-input-size.
Installation
Install using pip: pip install chunked-scatter
Install using conda: conda install chunked-scatter
usage: chunked-scatter [-h] [-p PREFIX] [-S] [-P] [-c SIZE]
[-m MINIMUM_BP_PER_FILE] [-o OVERLAP]
INPUT
Given a sequence dict, fasta index or a bed file, scatter over the defined
contigs/regions. Each contig/region will be split into multiple overlapping
regions, which will be written to a new bed file. Each contig will be placed
in a new file, unless the length of the contigs/regions doesn't exceed a given
number.
positional arguments:
INPUT The input file. The format is detected by the
extension. Supported extensions are: '.bed', '.dict',
'.fai', '.vcf', '.vcf.gz', '.bcf'.
optional arguments:
-h, --help show this help message and exit
-p PREFIX, --prefix PREFIX
The prefix of the ouput files. Output will be named
like: <PREFIX><N>.bed, in which N is an incrementing
number. Default 'scatter-'.
-S, --split-contigs If set, contigs are allowed to be split up over
multiple files.
-P, --print-paths If set prints paths of the output files to STDOUT.
This makes the program usable in scripts and
worfklows.
-c SIZE, --chunk-size SIZE
The size of the chunks. The first chunk in a region or
contig will be exactly length SIZE, subsequent chunks
will SIZE + OVERLAP and the final chunk may be
anywhere from 0.5 to 1.5 times SIZE plus overlap. If a
region (or contig) is smaller than SIZE the original
regions will be returned. Defaults to 1e6
-m MINIMUM_BP_PER_FILE, --minimum-bp-per-file MINIMUM_BP_PER_FILE
The minimum number of bases represented within a
single output bed file. If an input contig or region
is smaller than this MINIMUM_BP_PER_FILE, then the
next contigs/regions will be placed in the same file
untill this minimum is met. Defaults to 45e6.
-o OVERLAP, --overlap OVERLAP
The number of bases which each chunk should overlap
with the preceding one. Defaults to 150.
scatter-regions
usage: scatter-regions [-h] [-p PREFIX] [-S] [-P] [-s SCATTER_SIZE] INPUT
Given a sequence dict, fasta index or a bed file, scatter over the defined
contigs/regions. Creates a bed file where the contigs add up approximately to
the given scatter size.
positional arguments:
INPUT The input file. The format is detected by the
extension. Supported extensions are: '.bed', '.dict',
'.fai', '.vcf', '.vcf.gz', '.bcf'.
optional arguments:
-h, --help show this help message and exit
-p PREFIX, --prefix PREFIX
The prefix of the ouput files. Output will be named
like: <PREFIX><N>.bed, in which N is an incrementing
number. Default 'scatter-'.
-S, --split-contigs If set, contigs are allowed to be split up over
multiple files.
-P, --print-paths If set prints paths of the output files to STDOUT.
This makes the program usable in scripts and
worfklows.
-s SCATTER_SIZE, --scatter-size SCATTER_SIZE
The maximum size for the regions over which to
scatter. If contigs are not split, and a contig is
bigger than the maximum size, the contig will be
placed in its own file. Default: 1000000000.
safe-scatter
usage: safe-scatter [-h] [-p PREFIX] [-P] [-c SCATTER_COUNT]
[-m MIN_SCATTER_SIZE] [--mix-small-regions]
INPUT
Given a sequence dict, fasta index or a bed file, scatter over the defined
contigs/regions. Creates a bed file where the contigs add up to the average
scatter size to within min_scatter_size. NOTE, this tool always splits up
contigs.
positional arguments:
INPUT The input file. The format is detected by the
extension. Supported extensions are: '.bed', '.dict',
'.fai', '.vcf', '.vcf.gz', '.bcf'.
optional arguments:
-h, --help show this help message and exit
-p PREFIX, --prefix PREFIX
The prefix of the ouput files. Output will be named
like: <PREFIX><N>.bed, in which N is an incrementing
number. Default 'scatter-'. (default: scatter-)
-P, --print-paths If set prints paths of the output files to STDOUT.
This makes the program usable in scripts and
worfklows. (default: False)
-c SCATTER_COUNT, --scatter-count SCATTER_COUNT
The number of chunks to scatter the regions in. All
chunks will be within --min-scatter-size of each other
except for the final chunk. (default: 50)
-m MIN_SCATTER_SIZE, --min-scatter-size MIN_SCATTER_SIZE
The minimum size of a scatter. This tool will never
generate regions smaller than this value, unless the
original regions aresmaller. (default: 10000)
--mix-small-regions Mix small regions with regular regions in the input
regions. This can be useful in case there is a bias in
the composition of the regions. For example, the human
reference genome has all unplaced contigs (which are
small and difficult to process) at the end of the
file, which means they all end up in the same bedfile.
Enabling mixing prevents this (default: False)
chunked-scatter and scatter-regions
The
chunked-scattertool takes a bed file, fasta index, sequence dictionary or vcf file as input and divides the contigs/chromosomes into overlapping chunks of a given size. These chunks will then be placed in new bed files, one chromosomes per file. Small chromosomes will be put together to avoid the creation of thousands of files.The
scatter-regionstool works in a similar way but with defaults and flags tuned towards creating genome scatters for GATK tools.The
safe-scattertool produces a more even distribution of sizes in the output bed files, and guarantees that none of the scatters are smaller than--min-input-size.Installation
pip install chunked-scatterconda install chunked-scatterUsage
chunked-scatter
scatter-regions
safe-scatter
Examples
bed file
Given a bed file located at
/data/regions.bed:The command:
Will produce the following two output files:
/data/scatter_0.bed:/data/scatter_1.bed:dict file
Given a dict file located at
/data/ref.dict:The command:
Will produce the following output file at
/data/scatter_0.bed: