GangSTR is a tool for genome-wide profiling tandem repeats from short reads. A key advantage of GangSTR over existing genome-wide TR tools (e.g. lobSTR or hipSTR) is that it can handle repeats that are longer than the read length.
GangSTR takes aligned reads (BAM) and a set of repeats in the reference genome as input and outputs a VCF file containing genotypes for each locus.
The latest GangSTR release is available on the releases page.
For a list of TR references available, see references below.
Prerequisites
A recent version of C/C++ compiler supporting C++11 standard
CMake version 3.16 or above
The following development files in the build system: libz-dev, libbz2-dev, and liblzma-dev (required by htslib)
Basic Install
If you are installing from the tarball (which for most purposes you should be), the following instructions will install all dependencies as well as GangSTR itself. Both UNIX and Mac OSX are supported.
If you are running as root:
tar -xzvf GangSTR-X.X.tar.gz
cd GangSTR-X.X
mkdir build
cd build
cmake ..
make
sudo cmake --install .
If you are installing locally (e.g. on a cluster where you don’t have root access):
tar -xzvf GangSTR-X.X.tar.gz
cd GangSTR-X.X
mkdir build
cd build
cmake ..
make
cmake --install . --prefix PREFIX
where PREFIX is a place you have write permissions. In most cases this will be your home directory, e.g. $HOME. If you install locally, make sure $PREFIX/bin is on your PATH.
Typing GangSTR --help should show a help message if GangSTR was successfully installed.
Compiling from git source
To compile from git source:
# Clone the repo
git clone https://github.com/gymreklab/GangSTR
cd GangSTR/
mkdir build
cmake ..
make
cmake --install . --prefix PREFIX
Install using conda
You can install GangSTR v2.5.0 using conda (or mamba) package manager.
conda install -c bioconda -c conda-forge gangstr
Special thanks to the users in this thread for help in setting this up.
Basic usage
To run GangSTR using default parameters use the following command:
--bam <file.bam,[file2.bam]> Comma separated list of input BAM files
--ref Refererence genome (.fa)
--regions Target TR loci (regions) (.bed)
--out Output prefix
Additional general options:
--targeted Run GangSTR in targeted mode. This mode should be used when targeting disease loci. (as opposed to genome-wide run)
--chrom <string> Only genotype regions on this chromosome.
--bam-samps <string> Comma separated list of sample IDs for –bam
--samp-sex <string> Comma separated list of sample sex for each sample ID (–bam-samps must be provided, see readme for more details)
--str-info <string> Tab file with additional per-STR info (e.g., expansion cutoff. see below for format)
--period <string> Only genotype loci with periods (motif lengths) in this comma-separated list.
--skip-qscore Skip calculation of Q-score (see Q field in VCF output).
Options for different sequencing settings
--readlength <int> Preset read length (default: extract from alignments if not provided)
--coverage <float> Preset average coverage, should be set for exome/targeted data. Comma separated list to specify for each BAM. (default: calculate if not provided)
--model-gc-coverage Model coverage as a function of GC content. Requires genome-wide data. Experimental feature.
--insertmean <float> Fragment length mean. (default: calculate if not provided)
--insertsdev <float> Fragment length standard deviation. (default: calculate if not provided)
--nonuniform Indicates non-uniform coverage in alignment file (i.e., used for exome sequencing). Using this flag removes the likelihood term corresponding to FRR count.
--min-sample-reads <int> Minimum number of reads per sample.
Advanced parameters for likelihood model:
--frrweight <float> Reset weight for FRR class in likelihood model. (default 1.0)
--spanweight <float> Reset weight for Spanning class in likelihood model. (default 1.0)
--enclweight <float> Reset weight for Enclosing class in likelihood model. (default 1.0)
--flankweight <float> Reset weight for Flanking class in likelihood model. (default 1.0)
--ploidy [1,2] Haploid (1) or diploid (2) genotyping. (default 2)
--skipofftarget Skip off target regions included in the regions file.
--readprobmode Only use read probabilities in likelihood model. (ignore class probability)
--numbstrap <int> Number of bootstrap samples for calculating confidence intervals. (default 100)
--grid-theshold <int> Use optimization rather than grid search to find MLE if search space (grid) contains more alleles than this threshold. Default: 10000
--rescue-count <int> Number of regions that GangSTR attempts to rescue mates from (excluding off-target regions). Default: 0
--max-proc-read <int> Maximum number of processed reads per sample before a region is skipped.
Parameters for local realignment:
--minscore <int> Minimun alignment score for accepting reads (default 75).
--minmatch <int> Minimum matching basepairs required at the edge of the repeat region to accept flanking and enclosing reads (default 5).
Stutter model parameters:
--stutterup <float> Stutter insertion probability (default 0.05)
--stutterdown <float> Stutter deletion probability (default: 0.05)
Parameters for more detailed info about each locus:
--output-readinfo Output a file containing extracted read information.
--output-bootstraps Output a file containing bootstrap samples.
--include-ggl Output GGL (special GL field) in VCF.
Additional optional parameters:
-h,--help display help screen
--quiet Don’t print out anything
--seed Random number generator initial seed
-v,--verbose Print progress information (major steps)
--very Print detailed progress information
--version Print out the version of this software
File formats
GangSTR takes as input a BAM file of short read alignments, a reference set of TRs, and a reference genome, and outputs genotypes in a VCF file. Each of these formats is described below.
BAM (--bam)
GangSTR requires a BAM file produced by an indel-sensitive aligner. The BAM file must be sorted and indexed e.g. by using samtools sort and samtools index. GangSTR currently only processes a single sample at a time.
FASTA Reference genome (--ref)
You must input a reference genome in FASTA format. This must be the same reference build used to align the sequences in the BAM file.
TR regions (--regions)
GangSTR requires a reference set of regions to genotype. This is a BED-like file with the following columns:
The name of the chromosome on which the STR is located
The start position of the STR on its chromosome
The end position of the STR on its chromosome
The motif length
The repeat motif
An optional 6th column may contain a comma-separated list of off-target regions for each TR. These are regions where misaligned reads for a given TR may be incorrectly mapped.
Below is an example file which contains 5 TR loci. Standard references for hg19 and GRCh38 can be obtained below.
NOTE: The table header is for descriptive purposes. The BED file should not have a header
A tab delimited with the following header and format can be used to specify additional per locus information.
GangSTR currently supports expansion threshold through str-info. The threshold is specified in number of repeat copies, and it is used to calculate expansion probability. (See QEXP field in VCF format).
Note: The loci represented in this file are unique and duplicates should be removed.
chrom
pos
end
thresh
chr1
26454
26465
50
chr1
31556
31570
20
chr1
35489
35504
25
VCF (output)
For more information on VCF file format, see the VCF spec. In addition to standard VCF fields, GangSTR adds custom fields described below.
INFO fields
INFO fields contain aggregated statistics about each TR. The following custom fields are added:
FIELD
DESCRIPTION
END
End position of the TR
PERIOD
Length of the repeat unit
GRID
Range of the optimization grid. Gives min and max repeat copy number considered
EXPTHRESH
The threshold copy number used to test for repeat expansions
STUTTERUP
The model probability to observe a stutter error increasing the repeat number
STUTTERDOWN
The model probability to observe a stutter error decreasing the repeat number
STUTTERP
The geometric parameter for modeling the stutter step size distribution
RU
Repeat motif
REF
Reference copy number (number of repeat units
FORMAT fields
FORMAT fields contain information specific to each genotype call. The following custom fields are added:
FIELD
DESCRIPTION
GT
Genotype
DP
Read Depth (number of informative reads)
Q
Quality Score
REPCN
Genotype given in number of copies of the repeat motif
REPCI
95% Confidence interval for each allele based on bootstrapping
RC
Number of reads in each class (enclosing, spanning, FRR, flanking)
ENCLREADS
Summary of reads in enclosing class in | separated key-value pairs. Keys are number of copies and values show number of reads with that many copies.
FLNKREADS
Summary of reads in flanking class in | separated key-value pairs. Keys are number of copies and values show number of reads with that many copies.
ML
Maximum likelihood
INS
Insert size mean and stddev at the locus
STDERR
Bootstrap standard error of each allele
QEXP
Prob. of no expansion, 1 expanded allele, both expanded alleles
GGL
Genotpye Likelihood of all pairs of alleles in the search space. Formatted similar to standard GL fields but with allele space defined by the INFO/GRID field
Q: Quality score estimated alleles (REPCN), between 0 and 1. This quality score is a measure of GangSTR’s confidence in short allele calls (shorter than read length). It gives the likelihood of the maximum likelihood genotype divided by the sum of likelihoods of all possible genotypes. This can be interpreted as a posterior probability with a uniform prior over all possible genotypes. Calculation of Q-score can be slow if the estimation search space (grid) is large. To skip this step, use --skip-qscore option.
STDERR: Standard error of estimated alleles using bootstrap method.
QEXP: Given estimated alleles, the likelihood plane, and an expansion threshold, this field shows three numbers: the probability of both alleles being smaller than the threshold, one allele larger and one smaller than threshold, and both alleles larger than threshold. The expansion threshold should be provided using --str-info field.
Read info file (output)
By using --output-readinfo a file with .readinfo.tab extention containing information from the reads extracted for each locus is generated. The columns are ordered as follows:
Column number
Description
1
Chromosome
2
Repeat start position
3
Repeat end position
4
Read ID (originated from BAM file)
5
Read class {**}
6
Read class data field {**}
7
Found mate (boolean flag)
{**} Read class codes and their corresponding data field
Each read in the .readinfo.tab file belongs to one of 5 classes. The following table shows what each read class code means and how to interpret the read class data field column. For more information on read classes please refer to manuscript https://doi.org/10.1093/nar/gkz501.
Read Class Code
Description
Data field
SPAN
Spanning read pair
Fragment length (insert size) of the spanning read pair
SPFLNK
A flanking read that creates a spanning read pair with its mate
Number of repeat copies on the flanking read
BOUND
A flanking read
Number of repeat copies on the flanking read
ENCLOSE
Enclosing read
Number of repeat copies enclosed in the read
FRR
Fully repetitive read
Distance of mate from the repeat region (set to -(read_length) if mate is also FRR)
GangSTR reference files
The following lists available references created using Tandem Repeats Finder. We update the reference periodically with additional loci or annotation changes. Note references must be unzipped before using with GangSTR. The file listed in bold is the current recommended version.
You can call TRs on chrX and chrY using a combination of --bam-samps and --samp-sex. --samp-sex is a list of sex assignments (‘F’ or ‘M’) for the list of samples in --bam-samps, in the same order. For example if sample1 and sample2 are Male and Female respectively, --bam-samps sample1,sample2 --samp-sex M,F as input option.
Currently, GangSTR is not capable of extracting sample sex automatically.
GangSTR
GangSTR is a tool for genome-wide profiling tandem repeats from short reads. A key advantage of GangSTR over existing genome-wide TR tools (e.g. lobSTR or hipSTR) is that it can handle repeats that are longer than the read length.
GangSTR takes aligned reads (BAM) and a set of repeats in the reference genome as input and outputs a VCF file containing genotypes for each locus.
Manuscript: https://doi.org/10.1093/nar/gkz501
For questions on installation or usage, please open an issue, submit a pull request, or contact Nima Mousavi (mousavi@ucsd.edu).
For advanced topics such as those below, see the GangSTR wiki.
A Docker with GangSTR plus the dumpSTR filtering tool installed is available at gymreklab/str-toolkit from Docker hub.
Download | Install | Basic Usage | File formats | Reference files
Download
The latest GangSTR release is available on the releases page.
For a list of TR references available, see references below.
Prerequisites
C/C++compiler supportingC++11standardCMakeversion3.16or abovelibz-dev,libbz2-dev, andliblzma-dev(required by htslib)Basic Install
If you are installing from the tarball (which for most purposes you should be), the following instructions will install all dependencies as well as GangSTR itself. Both UNIX and Mac OSX are supported.
If you are running as root:
If you are installing locally (e.g. on a cluster where you don’t have root access):
where
PREFIXis a place you have write permissions. In most cases this will be your home directory, e.g.$HOME. If you install locally, make sure$PREFIX/binis on yourPATH.Typing
GangSTR --helpshould show a help message if GangSTR was successfully installed.Compiling from git source
To compile from git source:
Install using conda
You can install GangSTR
v2.5.0using conda (or mamba) package manager.Special thanks to the users in this thread for help in setting this up.
Basic usage
To run GangSTR using default parameters use the following command:
Required parameters:
--bam <file.bam,[file2.bam]>Comma separated list of input BAM files--refRefererence genome (.fa)--regionsTarget TR loci (regions) (.bed)--outOutput prefixAdditional general options:
--targetedRun GangSTR in targeted mode. This mode should be used when targeting disease loci. (as opposed to genome-wide run)--chrom <string>Only genotype regions on this chromosome.--bam-samps <string>Comma separated list of sample IDs for –bam--samp-sex <string>Comma separated list of sample sex for each sample ID (–bam-samps must be provided, see readme for more details)--str-info <string>Tab file with additional per-STR info (e.g., expansion cutoff. see below for format)--period <string>Only genotype loci with periods (motif lengths) in this comma-separated list.--skip-qscoreSkip calculation of Q-score (see Q field in VCF output).Options for different sequencing settings
--readlength <int>Preset read length (default: extract from alignments if not provided)--coverage <float>Preset average coverage, should be set for exome/targeted data. Comma separated list to specify for each BAM. (default: calculate if not provided)--model-gc-coverageModel coverage as a function of GC content. Requires genome-wide data. Experimental feature.--insertmean <float>Fragment length mean. (default: calculate if not provided)--insertsdev <float>Fragment length standard deviation. (default: calculate if not provided)--nonuniformIndicates non-uniform coverage in alignment file (i.e., used for exome sequencing). Using this flag removes the likelihood term corresponding to FRR count.--min-sample-reads <int>Minimum number of reads per sample.Advanced parameters for likelihood model:
--frrweight <float>Reset weight for FRR class in likelihood model. (default 1.0)--spanweight <float>Reset weight for Spanning class in likelihood model. (default 1.0)--enclweight <float>Reset weight for Enclosing class in likelihood model. (default 1.0)--flankweight <float>Reset weight for Flanking class in likelihood model. (default 1.0)--ploidy [1,2]Haploid (1) or diploid (2) genotyping. (default 2)--skipofftargetSkip off target regions included in the regions file.--readprobmodeOnly use read probabilities in likelihood model. (ignore class probability)--numbstrap <int>Number of bootstrap samples for calculating confidence intervals. (default 100)--grid-theshold <int>Use optimization rather than grid search to find MLE if search space (grid) contains more alleles than this threshold. Default: 10000--rescue-count <int>Number of regions that GangSTR attempts to rescue mates from (excluding off-target regions). Default: 0--max-proc-read <int>Maximum number of processed reads per sample before a region is skipped.Parameters for local realignment:
--minscore <int>Minimun alignment score for accepting reads (default 75).--minmatch <int>Minimum matching basepairs required at the edge of the repeat region to accept flanking and enclosing reads (default 5).Stutter model parameters:
--stutterup <float>Stutter insertion probability (default 0.05)--stutterdown <float>Stutter deletion probability (default: 0.05)--stutterprob <float>Stutter step size parameter (default: 0.90)Parameters for more detailed info about each locus:
--output-readinfoOutput a file containing extracted read information.--output-bootstrapsOutput a file containing bootstrap samples.--include-gglOutput GGL (special GL field) in VCF.Additional optional parameters:
-h,--helpdisplay help screen--quietDon’t print out anything--seedRandom number generator initial seed-v,--verbosePrint progress information (major steps)--veryPrint detailed progress information--versionPrint out the version of this softwareFile formats
GangSTR takes as input a BAM file of short read alignments, a reference set of TRs, and a reference genome, and outputs genotypes in a VCF file. Each of these formats is described below.
BAM (
--bam)GangSTR requires a BAM file produced by an indel-sensitive aligner. The BAM file must be sorted and indexed e.g. by using
samtools sortandsamtools index. GangSTR currently only processes a single sample at a time.FASTA Reference genome (
--ref)You must input a reference genome in FASTA format. This must be the same reference build used to align the sequences in the BAM file.
TR regions (
--regions)GangSTR requires a reference set of regions to genotype. This is a BED-like file with the following columns:
An optional 6th column may contain a comma-separated list of off-target regions for each TR. These are regions where misaligned reads for a given TR may be incorrectly mapped.
Below is an example file which contains 5 TR loci. Standard references for hg19 and GRCh38 can be obtained below. NOTE: The table header is for descriptive purposes. The BED file should not have a header
–str-info
A tab delimited with the following header and format can be used to specify additional per locus information. GangSTR currently supports expansion threshold through str-info. The threshold is specified in number of repeat copies, and it is used to calculate expansion probability. (See QEXP field in VCF format). Note: The loci represented in this file are unique and duplicates should be removed.
VCF (output)
For more information on VCF file format, see the VCF spec. In addition to standard VCF fields, GangSTR adds custom fields described below.
INFO fields
INFO fields contain aggregated statistics about each TR. The following custom fields are added:
FORMAT fields
FORMAT fields contain information specific to each genotype call. The following custom fields are added:
|separated key-value pairs. Keys are number of copies and values show number of reads with that many copies.|separated key-value pairs. Keys are number of copies and values show number of reads with that many copies.Q: Quality score estimated alleles (REPCN), between 0 and 1. This quality score is a measure of GangSTR’s confidence in short allele calls (shorter than read length). It gives the likelihood of the maximum likelihood genotype divided by the sum of likelihoods of all possible genotypes. This can be interpreted as a posterior probability with a uniform prior over all possible genotypes. Calculation of Q-score can be slow if the estimation search space (grid) is large. To skip this step, use
--skip-qscoreoption.STDERR: Standard error of estimated alleles using bootstrap method.
QEXP: Given estimated alleles, the likelihood plane, and an expansion threshold, this field shows three numbers: the probability of both alleles being smaller than the threshold, one allele larger and one smaller than threshold, and both alleles larger than threshold. The expansion threshold should be provided using
--str-infofield.Read info file (output)
By using
--output-readinfoa file with.readinfo.tabextention containing information from the reads extracted for each locus is generated. The columns are ordered as follows:{**} Read class codes and their corresponding data field
Each read in the
.readinfo.tabfile belongs to one of 5 classes. The following table shows what each read class code means and how to interpret the read class data field column. For more information on read classes please refer to manuscript https://doi.org/10.1093/nar/gkz501.GangSTR reference files
The following lists available references created using Tandem Repeats Finder. We update the reference periodically with additional loci or annotation changes. Note references must be unzipped before using with GangSTR. The file listed in bold is the current recommended version.
The references below contain pre-defined off-target loci for target pathogenic loci (hg38 coordinates):
Non-human reference builds:
GangSTR callsets
GangSTR callsets on publicly available datasets.
Calling on sex chromosomes.
You can call TRs on chrX and chrY using a combination of
--bam-sampsand--samp-sex.--samp-sexis a list of sex assignments (‘F’ or ‘M’) for the list of samples in--bam-samps, in the same order. For example if sample1 and sample2 are Male and Female respectively,--bam-samps sample1,sample2 --samp-sex M,Fas input option.Currently, GangSTR is not capable of extracting sample sex automatically.