TOMM40_WGS: Genotyping TOMM40’523 Poly-T Polymorphisms from Whole-Genome Sequencing
TOMM40_WGS is a Snakemake pipeline that enables automated genotyping of the TOMM40’523 poly-T repeat from WGS data. It integrates STR genotyping tools, k-mer analysis, and machine learning to provide accurate repeat length and genotype predictions.
Background
The TOMM40’523 poly-T repeat polymorphism (rs10524523) is a variable-length poly-T repeat located in intron 6 of the TOMM40 gene on chromosome 19. This polymorphism has been associated with age of onset of Alzheimer’s disease (AD) and cognitive decline.
The repeat length is typically categorized into three classes:
Clone the repository and create a testing directory
git clone https://github.com/RushAlz/TOMM40_WGS.git
chmod u+x TOMM40_WGS/TOMM40_WGS
# Make a temporary working directory
mkdir -p tomm40_wgs_test
cd tomm40_wgs_test
Test TOMM40_WGS with GRCh38 CRAMs
# Download the reference genome
REF_FASTA="https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa"
REF_FASTA_LOCAL=$PWD/"GRCh38_full_analysis_set_plus_decoy_hla.fa"
wget -O ${REF_FASTA_LOCAL} ${REF_FASTA}
wget -O ${REF_FASTA_LOCAL}.fai ${REF_FASTA}.fai
# Subset the CRAM file to the TOMM40 region
CRAMFILE_1="https://ftp.sra.ebi.ac.uk/vol1/run/ERR324/ERR3240114/HG00096.final.cram"
CRAM_1_LOCAL=$PWD/"HG00096.GRCh38.cram"
samtools view -h -T GRCh38_full_analysis_set_plus_decoy_hla.fa -C ${CRAMFILE_1} "chr19:44399792-45399826" > ${CRAM_1_LOCAL}
samtools index ${CRAM_1_LOCAL}
# Download a second CRAM file for testing
CRAMFILE_2="https://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239334/NA12878.final.cram"
CRAM_2_LOCAL=$PWD/"NA12878.GRCh38.cram"
samtools view -h -T GRCh38_full_analysis_set_plus_decoy_hla.fa -C ${CRAMFILE_2} "chr19:44399792-45399826" > ${CRAM_2_LOCAL}
samtools index ${CRAM_2_LOCAL}
# Run the pipeline specifying the input files as a glob (quotes are required)
../TOMM40_WGS/TOMM40_WGS --input_wgs "*.cram" \
--ref_fasta GRCh38_full_analysis_set_plus_decoy_hla.fa \
--output_dir results_GRCh38 \
--cores 2
# Alternatively, run the pipeline specifying the input files as a TSV file
echo -e "sample_id\tpath\nHG00096\t${CRAM_1_LOCAL}\nNA12878\t${CRAM_2_LOCAL}" > samples.tsv
../TOMM40_WGS/TOMM40_WGS --input_table samples.tsv \
--ref_fasta GRCh38_full_analysis_set_plus_decoy_hla.fa \
--output_dir results_GRCh38 \
--cores 2
# Check results
cat results_GRCh38/predictions/tomm40_predictions_summary.tsv | column -t
Test TOMM40_WGS with b37 BAMs
# Download the reference genome
REF_FASTA="https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz"
REF_FAI="https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz.fai"
wget ${REF_FASTA}
wget ${REF_FAI}
REF_FASTA_LOCAL=$PWD/"hs37d5.fa.gz"
# Subset the BAM file to the TOMM40 region
BAMFILE_1="https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/high_coverage_alignment/HG00096.wgs.ILLUMINA.bwa.GBR.high_cov_pcr_free.20140203.bam"
BAM_1_LOCAL=$PWD/"HG00096.hg37.bam"
samtools view -h -b ${BAMFILE_1} "19:44903049-45903083" > ${BAM_1_LOCAL}.tmp
samtools addreplacerg -r '@RG\tID:HG00096\tSM:HG00096\tLB:lib1' -o ${BAM_1_LOCAL} ${BAM_1_LOCAL}.tmp
samtools index ${BAM_1_LOCAL}
# Download a second CRAM file for testing
BAMFILE_2="https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA12878/high_coverage_alignment/NA12878.mapped.ILLUMINA.bwa.CEU.high_coverage_pcr_free.20130906.bam"
BAM_2_LOCAL=$PWD/"NA12878.hg37.bam"
samtools view -h -b ${BAMFILE_2} "19:44903049-45903083" > ${BAM_2_LOCAL}.tmp
samtools addreplacerg -r '@RG\tID:NA12878\tSM:NA12878\tLB:lib1' -o ${BAM_2_LOCAL} ${BAM_2_LOCAL}.tmp
samtools index ${BAM_2_LOCAL}
# Run the pipeline specifying the input files as a glob (quotes are required)
../TOMM40_WGS/TOMM40_WGS --input_wgs "*.bam" \
--ref_fasta hs37d5.fa.gz \
--genome_build b37 \
--output_dir results_b37 \
--cores 2
# Check results
cat results_b37/predictions/tomm40_predictions_summary.tsv | column -t
Pipeline Overview
The pipeline performs the following steps for each input sample:
Subset CRAM/BAM to the TOMM40 region
STR genotyping using:
HipSTR
GangSTR
ExpansionHunter
K-mer extraction from poly-T region using Jellyfish
Feature processing using R scripts
ML-based prediction of repeat lengths and TOMM40 genotype class
Aggregate predictions into a summary table
Input Configuration
Input data can be specified in two ways:
Using command line parameters:
--input_wgs: A single file, a glob pattern (must be quoted, e.g., "*.cram")
--input_table: Path to a TSV file with sample_id and path columns
Using configuration file:
input_glob: A glob pattern to CRAM/BAM files
input_table: Path to a TSV file with sample_id and path columns
When both --input_wgs and --input_table are specified, --input_table takes precedence.
TOMM40_WGS is licensed under the GPL-2.0 License - see the LICENSE file for details.
Important licensing information is available here.
Citation
If you use TOMM40_WGS, please cite:
Vialle RA et al. (2025). Genotyping TOMM40’523 poly-T polymorphisms using whole-genome sequencing. Human Genetics and Genomics Advances. https://doi.org/10.1016/j.xhgg.2025.100488
TOMM40_WGS: Genotyping TOMM40’523 Poly-T Polymorphisms from Whole-Genome Sequencing
TOMM40_WGSis a Snakemake pipeline that enables automated genotyping of the TOMM40’523 poly-T repeat from WGS data. It integrates STR genotyping tools, k-mer analysis, and machine learning to provide accurate repeat length and genotype predictions.Background
The TOMM40’523 poly-T repeat polymorphism (
rs10524523) is a variable-length poly-T repeat located in intron 6 of the TOMM40 gene on chromosome 19. This polymorphism has been associated with age of onset of Alzheimer’s disease (AD) and cognitive decline.The repeat length is typically categorized into three classes:
🚀 Quick Start
Software Dependencies
Install with bioconda
Use the
TOMM40_WGSlauncher script to run the pipeline with minimal setup.Alternatively: Create a conda environment
Clone the repository and install dependencies
Basic usage (recommended)
This script wraps Snakemake, applies default values when needed, and accepts any extra Snakemake flags (e.g.,
--dryrun).End-to-end testing with real data (data from 1000 Genomes Project)
Reference for the data
Pipeline Overview
The pipeline performs the following steps for each input sample:
Input Configuration
Input data can be specified in two ways:
Using command line parameters:
--input_wgs: A single file, a glob pattern (must be quoted, e.g.,"*.cram")--input_table: Path to a TSV file withsample_idandpathcolumnsUsing configuration file:
input_glob: A glob pattern to CRAM/BAM filesinput_table: Path to a TSV file withsample_idandpathcolumnsWhen both
--input_wgsand--input_tableare specified,--input_tabletakes precedence.input_tableformat:genome_build: Specify the genome build according to the reference genome used for alignment. The default isGRCh38.Supported builds include:
Important currently only GRCh38 has been extensively tested.
Outputs
Each sample generates:
results/GangSTR/,HipSTR/,ExpansionHunter/: Genotyping VCFs for each STR toolresults/jellyfish/{sample}.polyT_kmer.txt: k-mer countsresults/features/{sample}_features.txt: Combined features for MLresults/predictions/{sample}.tomm40_prediction.txt: Length/genotype predictions for {sample}results/predictions/tomm40_predictions_summary.tsv: Combined summaryLicense
TOMM40_WGS is licensed under the GPL-2.0 License - see the LICENSE file for details. Important licensing information is available here.
Citation
If you use TOMM40_WGS, please cite: