Straglr - Short-tandem repeat genotyping using long reads
Straglr is a tool that can be used for genome-wide scans for tandem repeat(TR) expansions or targeted genotyping using long-read alignments.
Installation
Straglr is implemented in Python 3.8 and has been tested in Linux environment.
Straglr depends on Tandem Repeat Finder(TRF) for identifying TRs and blastn for motif matching. (TRF and blastn executables must be in $PATH). Other Python dependencies are listed in requirements.txt.
The file environment.yaml can by used by conda to create an environment with all dependencies installed:
(for example to install v1.3.0), or run directly from the cloned repository:
conda activate straglr
./straglr.py
Input
Long read alignments sorted by genomic coordinates in BAM format against the reference genome. Suggested aligner: Minimap2– Please use the option -Y to enable soft-clipping so that read sequences can be assessed directly from the BAM file.
--tmpdir: user-specified directory for holding temporary files
Example application: genome scan to detect TRs longer than the reference genome by 100bp:
The most common use of Straglr is for detecting TR expansions over the reference genome by a defined size threshold. This will save computations spent on genotyping the majority of TRs in the human genome with no substantial change in lengths. The identified expanded alleles can then be screened for pathogenicity by comparing against known TR polymorphisms. A sample Straglr run to detect expansions larger than the reference alleles by 100 bp on TR loci 2-100bp in motif length:
Highly repetitive genomic regions may be problematic for aligners and give rise to questionable genotyping results. They can be skipped over in Straglr’s genome scan. To generate a bed file that contains all segmental duplications, centromeric and gap regions for exclusion from a Straglr run:
<output_prefix>.tsv - detailed output one support read per line
chrom - chromosome name
start - start coordinate of locus
end - end coordinate of locus
target_repeat - consensus(shortest) repeat motif from genome scan or target motif in genotyping
locus - locus in UCSC format (chrom:start-end)
coverage - coverage depth of locus
genotype - copy numbers (default) or sizes (--genotype_in_size) of each allele detected for given locus, separate by semi-colon(“;”) if multiple alleles detected, with number of support reads in bracket following each allele copy number/size. An example of a heterozygous allele in size: 990.8(10);30.9(10) (Alleles preceded by > indicate minimum values, as full alleles are not captured in any support reads)
actual_repeat - actual repeat motif detected in mapped read
read_name - mapped read name
copy_number - number of copies of repeat in allele
size - size of allele
read_start - start position of repeat in support read
strand - strand of reference genome from which read originates
allele - allele to which support read is assigned
read_status - classification of mapped read
“full”: read captures entire repeat (counted as support read)
“partial”: read does not capture entire repeat (counted as support read)
“skipped”: “not_spanning” - read does not span across locus (NOT counted as support read)
“failed” - read not used for genotyping (NOT counted as support read). Reasons are indicated with following descriptors:
Descriptor
Explanation
cannot_extract_sequence
cannot extract repeat sequence, could be because the repeat is deleted for the read in question, or regions flanking motif are deleted
motif_size_out_of_range
motif size detected outside specified size range
insufficient_repeat_coverage
repeat detected does not cover enough (50%) of expansion/insertion sequence
partial_and_insufficient_span
repeat not covering enough (90%) query minus flanking sequences
unmatched_motif
no repeat found matching target motif
<output_prefix>.bed - summarized genotypes one locus per line
chrom - chromosome name
start - start coordinate of locus
end - end coordinate of locus
repeat_unit - repeat motifi
allele<N>.size, where N={1,2,3…} depending on --max_num_clusters e.g. N={1,2} if --max_num_clusters==2 (default)
Straglr - Short-tandem repeat genotyping using long reads
Straglr is a tool that can be used for genome-wide scans for tandem repeat(TR) expansions or targeted genotyping using long-read alignments.
Installation
Straglr is implemented in Python 3.8 and has been tested in Linux environment.
Straglr depends on Tandem Repeat Finder(TRF) for identifying TRs and blastn for motif matching. (TRF and blastn executables must be in
$PATH). Other Python dependencies are listed inrequirements.txt.The file
environment.yamlcan by used by conda to create an environment with all dependencies installed:Straglr can be added to the environment via
pip,(for example to install v1.3.0), or run directly from the cloned repository:
Input
Long read alignments sorted by genomic coordinates in BAM format against the reference genome. Suggested aligner: Minimap2 – Please use the option
-Yto enable soft-clipping so that read sequences can be assessed directly from the BAM file.Usage
Some common parameters:
--loci: a BED file containing loci to be genotyped. 4 column BED format: chromosome start end repeat--exclude: a BED file containing regions to be skipped in genome-scan (e.g. long segmental duplications or pericentromeric regions)--chroms: space-separated list of specific chromosomes for genome-scan--regions: a BED file containing regions to be used only in genome-scan--include_alt_chroms: include ALT chromosomes (chromosomes with “_” in names) in genome scan (Default: NOT included)--use_unpaired_clips: include examination of unpaired clipped alignments in genome scan to detect expansion beyond read size (Default:NOT used)--min_support: minimum number of support reads for an expansion to be captured in genome-scan (Default:2)--min_ins_size: minimum increase in size (relative to the reference genome) for an expansion to be captured in genome-scan (Default:100)--min_str_len: minimum length of repeat-motif for an expansion to be captured in genome-scan (Default:2)--max_str_len: maximum length of repeat-motif for an expansion to be captured in genome-scan (Default:50)--nprocs: number of processes to use in Python’s multiprocessing (Default:1)--genotype_in_size: report genotype (column 5 of TSV output) in terms of allele sizes instead of copy numbers--max_num_clusters: maximum number of clusters to be tried in Gaussian Mixture Model (GMM) clustering (Default:2)--min_cluster_size: minimum number of reads required to constitute a cluster (allele) in GMM clustering (Default:2)--trf_args: TRF arguments (Default:2 5 5 80 10 10 500)--tmpdir: user-specified directory for holding temporary filesExample application: genome scan to detect TRs longer than the reference genome by 100bp:
The most common use of Straglr is for detecting TR expansions over the reference genome by a defined size threshold. This will save computations spent on genotyping the majority of TRs in the human genome with no substantial change in lengths. The identified expanded alleles can then be screened for pathogenicity by comparing against known TR polymorphisms. A sample Straglr run to detect expansions larger than the reference alleles by 100 bp on TR loci 2-100bp in motif length:
Highly repetitive genomic regions may be problematic for aligners and give rise to questionable genotyping results. They can be skipped over in Straglr’s genome scan. To generate a bed file that contains all segmental duplications, centromeric and gap regions for exclusion from a Straglr run:
Example application: genome-wide genotyping
all fields from selected tableusing the onlineTable Browsertoolbedfile into batches with smaller numbers of loci (e.g. 10,000), e.g.:Output
<output_prefix>.tsv - detailed output one support read per line
--genotype_in_size) of each allele detected for given locus, separate by semi-colon(“;”) if multiple alleles detected, with number of support reads in bracket following each allele copy number/size. An example of a heterozygous allele in size:990.8(10);30.9(10)(Alleles preceded by>indicate minimum values, as full alleles are not captured in any support reads)<output_prefix>.bed - summarized genotypes one locus per line
--max_num_clusterse.g. N={1,2} if--max_num_clusters==2 (default)<output_prefix>.vcf
Utilities
straglr_compare.py: compare Straglr’s resultsextract_repeats.py: extract repeat sequences from alignment BAM given Straglr’s TSV outputContact
Readman Chiu
Citation
Chiu R, Rajan-Babu IS, Friedman JM, Birol I. Straglr: discovering and genotyping tandem repeat expansions using whole genome long-read sequences. Genome Biol 22, 224 (2021). https://doi.org/10.1186/s13059-021-02447-3