EnsembleTR is a tool for ensemble Tandem Repeat (TR) calling. It takes one or more VCF files with TR genotypes for a panel of samples and outputs a consensus set of genotypes.
Installation
pip install --upgrade pip
pip install ensembletr
Type ensembletr --help. You should see the help message appear.
--vcfs <file.vcf,[file2.vcf]> Comma separated list of input VCF files
--ref Refererence genome (.fa)
--out Path to output VCF file
File formats
VCF (--vcfs)
Both zipped and unzipped VCF files are accepted as input. EnsembleTR can currently process VCF files generated by hipSTR, GangSTR, adVNTR, and ExpansionHunter.
FASTA Reference genome (--ref)
You must input a reference genome in FASTA format. This must be the same reference build used for TR calling in input files.
VCF (--out)
For more information on VCF file format, see the VCF spec. The output VCF is not necessarily sorted, please use vcf-sort or other VCF sorting tools to sort the output before downstream analysis. EnsembleTR output VCF file contains several fields described below.
INFO fields
INFO fields contain aggregated statistics about each TR. The following custom fields are added:
FIELD
DESCRIPTION
START
Start position of the TR
END
End position of the TR
PERIOD
Length of the repeat unit
RU
Repeat motif
METHODS
Methods that attempted to genotype this locus (AdVNTR, EH, HipSTR, GangSTR)
Note that the RU shows the canonical sequence of the repeat unit, which is the first alphabetically out of all possible rotations on + and - strands of the sequence. e.g. “TG” canonical sequence is “AC”.
FORMAT fields
FORMAT fields contain information specific to each genotype call. The following custom fields are added:
FIELD
DESCRIPTION
GT
Genotype
GB
Base pair difference from ref allele
NCOPY
Genotype given in number of copies of the repeat motif
EXP
Boolean showing if the genotype alleles were expanded
SCORE
Score of the consensus call
GTS
Method(s) that support the consensus call
ALS
Number of times each bp difference was seen across all calls
INPUTS
Raw calls
Score is calculated by aggregating quality information from calls that are getting merged at each locus.
Using statSTR on EnsembleTR files
You can use statSTR from TRTools to compute various per-locus statistics for EnsembleTR .VCF files.
For example, to compute per-locus allele frequency use the following command:
TRs phased/imputed from 3,202 1kGP samples based on EnsembleTR calls.
There are in total 1,070,762 TRs and 70,692,015 SNPs/indels.
All the coordinates are based on hg38 human reference genome.
These files contain the same data as Version II, with the following updates to facilitate use in downstream imputation pipelines:
Remove TRs for which the REF allele does not match the expected sequence based on CHR:POS
For each TR, remove alelles with 0 count.
If reference allele have 0 count, keep the reference alleles.
Remove TRs which have more than 100 alleles.
Remove TRs which have less than 2 alleles.
Remove the DS/GP fields which are large and not used by downstream steps.
Add unique IDs for each TR of the format EnsTR:CHROM:POS. For TRs with the same CHR:POS, add the duplicate number of the TR following format: EnsTR:CHROM:POS:Duplicate_num. Duplicated loci with identical alleles are removed.
Add VT field, set to VT=TR for TRs and VT=OTHER for other variant types
Add the bref format files which have the same information as the VCFs but can improve Beagle imputation performance.
All file description and download links can be found here. Data and links for each chromosome for the Verson IV panel are also provided below.
We have tested this with Beagle jar file beagle.27May24.118.jar. Earlier releases of Beagle 5.4 had problems imputing from this panel due to a file decompression issue.
Additional resources
Per locus summary statistics can be downloaded from here. Each table has information on coordinates, repeat unit sequence, and potential overlap with genes listed in GENCODE v22 for repeats in EnsembleTR catalog.
Population-specific per locus statistics on allele frequency, heterozygosity, and the number of called samples can be found here. Statistics are computed using statSTR from the TRTools package.
EnsembleTR
EnsembleTR is a tool for ensemble Tandem Repeat (TR) calling. It takes one or more VCF files with TR genotypes for a panel of samples and outputs a consensus set of genotypes.
Installation
Type
ensembletr --help. You should see the help message appear.Usage
To run EnsembleTR, use the following command
Required parameters:
--vcfs <file.vcf,[file2.vcf]>Comma separated list of input VCF files--refRefererence genome (.fa)--outPath to output VCF fileFile formats
VCF (
--vcfs)Both zipped and unzipped VCF files are accepted as input. EnsembleTR can currently process VCF files generated by hipSTR, GangSTR, adVNTR, and ExpansionHunter.
FASTA Reference genome (
--ref)You must input a reference genome in FASTA format. This must be the same reference build used for TR calling in input files.
VCF (
--out)For more information on VCF file format, see the VCF spec. The output VCF is not necessarily sorted, please use vcf-sort or other VCF sorting tools to sort the output before downstream analysis. EnsembleTR output VCF file contains several fields described below.
INFO fields
INFO fields contain aggregated statistics about each TR. The following custom fields are added:
Note that the RU shows the canonical sequence of the repeat unit, which is the first alphabetically out of all possible rotations on + and - strands of the sequence. e.g. “TG” canonical sequence is “AC”.
FORMAT fields
FORMAT fields contain information specific to each genotype call. The following custom fields are added:
Score is calculated by aggregating quality information from calls that are getting merged at each locus.
Using statSTR on EnsembleTR files
You can use statSTR from TRTools to compute various per-locus statistics for EnsembleTR .VCF files.
For example, to compute per-locus allele frequency use the following command:
EnsembleTR data releases
Archived datasets, including the Version II calls and other versions of haplotype panel files can be found here.
Version II of EnsembleTR calls on samples from 1000 Genomes Project and H3Africa
Chromosome 1 VCF file and tbi file
Chromosome 2 VCF file and tbi file
Chromosome 3 VCF file and tbi file
Chromosome 4 VCF file and tbi file
Chromosome 5 VCF file and tbi file
Chromosome 6 VCF file and tbi file
Chromosome 7 VCF file and tbi file
Chromosome 8 VCF file and tbi file
Chromosome 9 VCF file and tbi file
Chromosome 10 VCF file and tbi file
Chromosome 11 VCF file and tbi file
Chromosome 12 VCF file and tbi file
Chromosome 13 VCF file and tbi file
Chromosome 14 VCF file and tbi file
Chromosome 15 VCF file and tbi file
Chromosome 16 VCF file and tbi file
Chromosome 17 VCF file and tbi file
Chromosome 18 VCF file and tbi file
Chromosome 19 VCF file and tbi file
Chromosome 20 VCF file and tbi file
Chromosome 21 VCF file and tbi file
Chromosome 22 VCF file and tbi file
Version IV of reference SNP+TR haplotype panel for imputation of TR variants
These files contain:
There are in total 1,070,762 TRs and 70,692,015 SNPs/indels.
All the coordinates are based on hg38 human reference genome.
These files contain the same data as Version II, with the following updates to facilitate use in downstream imputation pipelines:
All file description and download links can be found here. Data and links for each chromosome for the Verson IV panel are also provided below.
Chromosome 1 [VCF] [tbi] [bref] SNPs/indels=5,759,060 TRs=92,378
Chromosome 2 [VCF] [tbi] [bref] SNPs/indels=6,088,598 TRs=91,137
Chromosome 3 [VCF] [tbi] [bref] SNPs/indels=4,983,185 TRs=75,243
Chromosome 4 [VCF] [tbi] [bref] SNPs/indels=4,875,465 TRs=69,327
Chromosome 5 [VCF] [tbi] [bref] SNPs/indels=4,536,819 TRs=66,492
Chromosome 6 [VCF] [tbi] [bref] SNPs/indels=4,315,217 TRs=65,940
Chromosome 7 [VCF] [tbi] [bref] SNPs/indels=4,137,254 TRs=59,422
Chromosome 8 [VCF] [tbi] [bref] SNPs/indels=3,886,222 TRs=55,144
Chromosome 9 [VCF] [tbi] [bref] SNPs/indels=3,165,513 TRs=44,189
Chromosome 10 [VCF] [tbi] [bref] SNPs/indels=3,495,473 TRs=51,640
Chromosome 11 [VCF] [tbi] [bref] SNPs/indels=3,423,341 TRs=49,603
Chromosome 12 [VCF] [tbi] [bref] SNPs/indels=3,332,788 TRs=55,887
Chromosome 13 [VCF] [tbi] [bref] SNPs/indels=2,509,179 TRs=35,720
Chromosome 14 [VCF] [tbi] [bref] SNPs/indels=2,290,400 TRs=36,203
Chromosome 15 [VCF] [tbi] [bref] SNPs/indels=2,109,285 TRs=32,338
Chromosome 16 [VCF] [tbi] [bref] SNPs/indels=2,362,361 TRs=35,452
Chromosome 17 [VCF] [tbi] [bref] SNPs/indels=2,073,624 TRs=38,382
Chromosome 18 [VCF] [tbi] [bref] SNPs/indels=1,963,845 TRs=28,446
Chromosome 19 [VCF] [tbi] [bref] SNPs/indels=1,670,692 TRs=33,536
Chromosome 20 [VCF] [tbi] [bref] SNPs/indels=1,644,384 TRs=25,745
Chromosome 21 [VCF] [tbi] [bref] SNPs/indels=1,002,753 TRs=12,894
Chromosome 22 [VCF] [tbi] [bref] SNPs/indels=1,066,557 TRs=15,644
Usage
Use Beagle to impute TRs into SNP data:
We have tested this with Beagle jar file beagle.27May24.118.jar. Earlier releases of Beagle 5.4 had problems imputing from this panel due to a file decompression issue.
Additional resources
Per locus summary statistics can be downloaded from here. Each table has information on coordinates, repeat unit sequence, and potential overlap with genes listed in GENCODE v22 for repeats in EnsembleTR catalog.
Population-specific per locus statistics on allele frequency, heterozygosity, and the number of called samples can be found here. Statistics are computed using statSTR from the TRTools package.