MultiVCFAnalyzer is a SNP filtering and SNP alignment generation tool, designed around (but not limited to) low coverage ancient DNA data. MultiVCFanalyzer reads multiple VCF files as produced by GATK UnifiedGenotyper, performs filtering based on a number of criteria, and provides the combined genotype calls in a number of formats that are suitable for follow-up analyses such as phylogenetic reconstruction, SNP effect analyses, population genetic analyses etc.
Furthermore, the results are provided in the form of various tables for manual inspection and presentation/publication purposes.
Citation
If you use MultiVCFAnalyzer please cite:
Bos, Harkins, Herbig, Coscolla et al. ‘Pre-Columbian mycobacterial genomes reveal seals as a source of New World human tuberculosis’ Nature 514, 494–497 (23 October 2014) doi:10.1038/nature13591
A more detailed description of the program can be seen in the supplementary information under “SNP Calling and Phylogenetic Analysis”.
The tool is a java program and requires openJDK 8. Input VCF files must be generated from GATK UnifiedGenotyper (<= 3.5), and ploidy must be set to 2 (to give allele frequency values).
To get the help message run
java -jar MultiVCFAnalyzer_X-XX-X.jar <OPTIONS>
Please start the program with the following parameters in strictly this order:
SNP effect analysis result file (from SnpEff; txt format) [OPTIONAL]
Reference genome file (fasta) - the same as used for VCF construction
Reference genome gene annotation (gff) [OPTIONAL]
Output directory - location of where to put output files
Write allele frequencies (‘T’ or ‘F’) - whether to include the percentage of reads a given allele is present in in the SNP table e.g. A (70%). In haploid microbial contexts, this can be used to assess cross-strain mapping.
Minimal genotyping quality (GATK) - a threshold of which a SNP call falling under is ‘discarded’
Minimal coverage for base call - the minimum number of a reads that a position must be covered by to be reported
Minimal allele frequency for homozygous call - the fraction of reads a base must have to be called ‘homozygous’
Minimal allele frequency for heterozygous call - a fraction of which whereby if a call falls above this value, and lower than the homozygous threshold, a base will be called ‘heterozygous’ and reported with a IUPAC uncertainity code
List of positions to exclude (gff) [OPTIONAL] - a file listing positions that will be ‘filtered’ (i.e. ignored)
[vcf_files …] input vcf files as generated by the GATK UnifiedGenotyper
To omit an optional input file put NA as the file name.
Example
The following is an example of running MultiVCFAnalyzer requiring a minimum genotyping quality of 30, a minimum fold coverage threshold of 5, a homozygous Ref/Allele being called if the base is >= 90% of
java -jar MultiVCFanalyzer_X-XX-X.jar \
NA \
/<PATH>/<TO>/<REFERENCE>.fasta \
NA \
/<PATH>/<TO>/<OUTPUTDIRECTORY>/ \
T \
30 \
5 \
0.9 \
0.1 \
NA \
/<PATH>/<TO>/Sample_1/<VCFFILE1>.vcf.gz \
/<PATH>/<TO>/Sample_2/<VCFFILE2>.vcf.gz \
/<PATH>/<TO>/Sample_3/<VCFFILE3>.vcf.gz
Output
Output files
The following files are created in output directory:
fullAlignment.fasta a fasta file of all positions contained in the VCF files i.e. including ref calls
info.txt information about the run
snpAlignment.fasta a fasta file of just SNP positions with samples only
snpAlignmentIncludingRefGenome.fasta a fasta file of just SNP positions with reference genome
snpStatistics.tsv some basic statistics about the SNP calls of each sample
The snpTable.tsv file will not display positions that are the same call across all samples (e.g. if all samples have N, or . [a reference call])!
snpStatistics.tsv
The snpStatistics column definitions are as follows
Sample: folder name of VCF file
allPos: length of FASTA file in base pairs (bp)
noCall: Number of positions with no call made as reported by GATK
discardedRefCall: Number of positions with a discarded reference call for not passing genotyping or coverage thresholds
discardedVarCall: Number of positions with a discarded variant call for not passing genotyping or coverage thresholds
filteredVarCall: Number of positions with discarded calls based on user filter list
unhandledGenotype: Number of positions with discarded calls for having more than two possible alleles (e.g. Ref: A, ALT: G,T)
refCall: Number of reference calls made
SNP Calls (all): Total number of non-reference homozygous and heterozgyous calls made
SNP Calls (het): Total number of non-reference calls not passing user-supplied heterozygosity/homozygosity thresholds
coverage(fold): Average number of reads covering final calls
coverage(percent): Percent coverage of all positions with final calls
Even if ‘EMIT_ALL_SITES’ is turned on, GATK will ignore any non-ACGT positions in the reference entirely (e.g. Ns), and will not export that position in the VCF file. Thus, refCall + SNP Calls (all) may not match allPos - noCall - discardedRefCall - discardedVarCall - filteredVarCall - unhandledGenotype as these positions are not supplied to MultiVCFAnalyzer’. You can check for this by checking the discrepancy of the two previous calculations is equal across every sample.
FAQs
How do I restrict to only ‘homozygous’ SNPs?
Set both homozygous and heterozygous frequency to the same value. Only ACTGs passing your thresholds will then be reported.
How do I increase the amount of memory?
Increase the amount of memory allocated to java with the -Xmx parameter (here 16 gigabytes)
MultiVCFAnalyzer
Table of Contents
Description
MultiVCFAnalyzer is a SNP filtering and SNP alignment generation tool, designed around (but not limited to) low coverage ancient DNA data. MultiVCFanalyzer reads multiple VCF files as produced by GATK UnifiedGenotyper, performs filtering based on a number of criteria, and provides the combined genotype calls in a number of formats that are suitable for follow-up analyses such as phylogenetic reconstruction, SNP effect analyses, population genetic analyses etc.
Furthermore, the results are provided in the form of various tables for manual inspection and presentation/publication purposes.
Citation
If you use MultiVCFAnalyzer please cite:
Bos, Harkins, Herbig, Coscolla et al. ‘Pre-Columbian mycobacterial genomes reveal seals as a source of New World human tuberculosis’ Nature 514, 494–497 (23 October 2014) doi:10.1038/nature13591
A more detailed description of the program can be seen in the supplementary information under “SNP Calling and Phylogenetic Analysis”.
In case of any questions please contact Alexander Herbig (alexander_herbig@eva.mpg.de).
Usage
The tool is a java program and requires openJDK 8. Input VCF files must be generated from GATK UnifiedGenotyper (<= 3.5), and ploidy must be set to 2 (to give allele frequency values).
To get the help message run
Please start the program with the following parameters in strictly this order:
Example
The following is an example of running MultiVCFAnalyzer requiring a minimum genotyping quality of 30, a minimum fold coverage threshold of 5, a homozygous Ref/Allele being called if the base is >= 90% of
Output
Output files
The following files are created in output directory:
fullAlignment.fastaa fasta file of all positions contained in the VCF files i.e. including ref callsinfo.txtinformation about the runsnpAlignment.fastaa fasta file of just SNP positions with samples onlysnpAlignmentIncludingRefGenome.fastaa fasta file of just SNP positions with reference genomesnpStatistics.tsvsome basic statistics about the SNP calls of each samplesnpTableForSnpEff.tsvinput file for SnpEffsnpTable.tsvbasic SNP table of combined positions taken from each VCF filesnpTableWithUncertaintyCalls.tsvsame as above, but with lower case characters indicating uncertain callstructureGenoypes_noMissingData-Columns.tsvalternate input file for STRUCTUREstructureGenotypes.tsvinput file for STRUCTUREsnpStatistics.tsvThe snpStatistics column definitions are as follows
FAQs
How do I restrict to only ‘homozygous’ SNPs?
Set both homozygous and heterozygous frequency to the same value. Only ACTGs passing your thresholds will then be reported.
How do I increase the amount of memory?
Increase the amount of memory allocated to java with the
-Xmxparameter (here 16 gigabytes)How to build the JAR file from source?
If you want to create a JAR file and use the tool, simply install Gradle and follow this:
This will create a folder
build/libs/that contains the compiled JAR file automatically for you. You can then use the tool as described above using: