Please have a look at Haplogrep 3 which also includes a command-line version and uses the same core functionality as this version! This repository is not maintained anymore.
Download and Install
curl -sL haplogrep.now.sh | bash
./haplogrep
If you want to use our web service, please click here.
Classify allows to classify input profiles (hsd, fasta, VCF) into haplogroups.
Distance calculates the distance between two haplogroups.
HaploGrep Classify
Run HaploGrep Classification with test data
# Download test data
wget https://github.com/seppinho/haplogrep-cmd/raw/master/test-data/vcf/HG00097.vcf.gz
# Run Haplogrep Classification
./haplogrep classify --in HG00097.vcf.gz --format vcf --out haplogroups.txt
Input File Formats
VCF or Fasta
The recommended input format is a single-sample/multi-sample VCF (*.vcf.gz or *.vcf).
FASTA
For alignment, bwa version 0.7.17 is used. For each input sequence, HaploGrep excludes positions from the tested range that are (1) not covered by the input fragment or (2) has marked with a N in the sequence.
hsd Format
You can also specify your profiles in the original HaploGrep hsd format, which is a simple tab-delimited file format consisting of 4 columns (ID, Range, Haplogroup and Polymorphisms).
For readability, the polymorphisms are also tab-delimited (so columns >= 4). A hsd example can be found here.
Required Parameters
Parameter
Description
--in
Please provide the input file name
--format
Please provide the input format of your data - valid options are: hsd, vcf, or fasta files
--out
Please provide an output name
Additional Parameters
Parameter
Description
--rsrs
By default HaploGrep expects that your data is aligned against rCRS (which is included in the human references hg19 and hg38). If your data is aligned against RSRS, add the --rsrs parameter (Default: off). Please read this blog post carefully before adding this option.
--metric
To change the classification metric to Hamming Distance (hamming) or Jaccard (jaccard) add the --metric parameter (Default: Kulczynski Measure).
--extend-report
For additional information on mtSNPs (e.g. found or remaining polymorphisms) please add the --extend-report flag (Default: off).
--phylotree
The used Phylotree version can be changed using the --phylotree parameter (Default: 17_FU1, allowed numbers from 10,11,12,..,17 (latest version)).
--chip
If you are using genotyping arrays, please add the --chip parameter to limit the range to array SNPs only (Default: off, VCF only). To get the same behaviour for hsd files, please add only the variants to the range, which are included on the array or in the range you have sequenced (e.g. control region). Range can be sepearted by a semicolon ;, both ranges and single positions are allowed (e.g. 16024-16569;1-576;8860).
--skip-alignment-rules
Add this option to skip our rules that fixes the mtDNA nomenclature for fasta import. Click here for further information. Applying the rules is the default option since v2.4.0
--hits
To export the best n hits for each sample add the --hits parameter. By default only the tophit is exported.
--lineage
Create a graph of all input samples by using the --lineage parameter. (Default: 0). 0=no tree, 1=tree with genotypes, 2=only structure, no genotypes. As an output we provide a Graphviz DOT file. You can then use graphviz (sudo apt-get install graphviz) to convert the dot file to a e.g. pdf (dot <dot-file> -Tpdf > graph.pdf).
HaploGrep Distance
This tool allows to calculate the distance between two haplogroups.
Required Parameters
Parameter
Description
--in
Input file must include 2 columns named “hg1” and “hg2” seperated by “;”
--out
Output location of distance file
mtDNA reference sequences
Several mtDNA references exist, HaploGrep supports rCRS and RSRS. Please checkout our blog post to learn more about this topic.
Genotyping arrays
If you are using HaploGrep for genotyping array data, please have a look at the --chip parameter above.
mtDNA Nomenclature
When using fasta as an input format, HaploGrep uses bwa mem to align data. Since the mitochondrial phylogeny is using a 3′ alignment, indels are often not correctly placed for haplogroup classification, when using standard-aligner designed for nuclear DNA. In some cases, where haplogroup defining indels are expected (e.g. missing 8281d-8289d) this can yield to a lower haplogroup quality. To adjust for that, we provide a set of currently 66 rules that can be applied prior to classification. The rules have been estimated based on 7,848 fasta files in 4 steps:
Downloading Phylotree defining sequences from GenBank
Aligning data with bwa mem
Classifying the profiles using HaploGrep
Comparing final fasta profiles with the Phylotree input profiles (remaining vs. not found) in a txt format (derived from parsing Phylotree).
For example, the subsequent rule changes input polymorphisms 309.1CCT 310Cto309.1C 309.2C 315.1C.
Heteroplasmies (VCF only)
Heteroplasmies are often stored as heterozygous genotypes (0/1). If a AF tag (= Allele Frequency) is specified in the VCF file, we add variants with a AF > 0.90 to the input profile. Mutation Server is able to create a valid VCF including heteroplasmies starting from BAM or CRAM.
Related work
Please have a look at mitoverse to check for heteroplasmies and contamination in your NGS data.
Important: Switch to Haplogrep 3
Please have a look at Haplogrep 3 which also includes a command-line version and uses the same core functionality as this version! This repository is not maintained anymore.
Download and Install
If you want to use our web service, please click here.
Phylogenetic Trees
The haplogroup classifications in Haplogrep are based on the revised tree by Dür et al, 2021, which is an update of the latest PhyloTree version 17 by van Oven, 2016 based on the work of van Oven & Kayser, 2009.
Cite us
If you use HaploGrep, please cite our latest Haplogrep2 paper or the initial Haplogrep paper.
Additionally please cite (1) Dür et al, 2021 if you use the latest Phylotree17_FU1 tree, (2) van Oven, 2016 for PhyloTree 17 or van Oven & Kayser, 2009 in case an older PhyloTree version has been used.
Available Tools
Currently two tools are available.
HaploGrep Classify
Run HaploGrep Classification with test data
Input File Formats
VCF or Fasta
The recommended input format is a single-sample/multi-sample VCF (*.vcf.gz or *.vcf).
FASTA
For alignment, bwa version 0.7.17 is used. For each input sequence, HaploGrep excludes positions from the tested range that are (1) not covered by the input fragment or (2) has marked with a N in the sequence.
hsd Format
You can also specify your profiles in the original HaploGrep hsd format, which is a simple tab-delimited file format consisting of 4 columns (ID, Range, Haplogroup and Polymorphisms).
For readability, the polymorphisms are also tab-delimited (so columns >= 4). A hsd example can be found here.
Required Parameters
--in--formathsd,vcf, orfastafiles--outAdditional Parameters
--rsrs--rsrsparameter (Default: off). Please read this blog post carefully before adding this option.--metrichamming) or Jaccard (jaccard) add the--metricparameter (Default: Kulczynski Measure).--extend-report--extend-reportflag (Default: off).--phylotree--phylotreeparameter (Default:17_FU1, allowed numbers from10,11,12,..,17(latest version)).--chip--chipparameter to limit the range to array SNPs only (Default: off, VCF only). To get the same behaviour for hsd files, please add only the variants to the range, which are included on the array or in the range you have sequenced (e.g. control region). Range can be sepearted by a semicolon;, both ranges and single positions are allowed (e.g. 16024-16569;1-576;8860).--skip-alignment-rules--hits--hitsparameter. By default only the tophit is exported.--lineage--lineageparameter. (Default: 0). 0=no tree, 1=tree with genotypes, 2=only structure, no genotypes. As an output we provide a Graphviz DOT file. You can then use graphviz (sudo apt-get install graphviz) to convert the dot file to a e.g. pdf (dot <dot-file> -Tpdf > graph.pdf).HaploGrep Distance
This tool allows to calculate the distance between two haplogroups.
Required Parameters
--in--outmtDNA reference sequences
Several mtDNA references exist, HaploGrep supports rCRS and RSRS. Please checkout our blog post to learn more about this topic.
Genotyping arrays
If you are using HaploGrep for genotyping array data, please have a look at the
--chipparameter above.mtDNA Nomenclature
When using fasta as an input format, HaploGrep uses bwa mem to align data. Since the mitochondrial phylogeny is using a 3′ alignment, indels are often not correctly placed for haplogroup classification, when using standard-aligner designed for nuclear DNA. In some cases, where haplogroup defining indels are expected (e.g. missing 8281d-8289d) this can yield to a lower haplogroup quality. To adjust for that, we provide a set of currently 66 rules that can be applied prior to classification. The rules have been estimated based on 7,848 fasta files in 4 steps:
309.1CCT 310Cto309.1C 309.2C 315.1C.Heteroplasmies (VCF only)
Heteroplasmies are often stored as heterozygous genotypes (0/1). If a AF tag (= Allele Frequency) is specified in the VCF file, we add variants with a AF > 0.90 to the input profile. Mutation Server is able to create a valid VCF including heteroplasmies starting from BAM or CRAM.
Related work
Please have a look at mitoverse to check for heteroplasmies and contamination in your NGS data.
Blog
Check out our blog regarding mtDNA topics.
Contact
Sebastian Schoenherr (@seppinho)
Hansi Weissensteiner (@haansi)
Institute of Genetic Epidemiology, Medical University of Innsbruck