To convert a VCF into a MAF, each variant must be mapped to only one of all possible gene transcripts/isoforms that it might affect. But even within a single isoform, a Missense_Mutation close enough to a Splice_Site, can be labeled as either in MAF format, but not as both. This selection of a single effect per variant, is often subjective. And that’s what this project attempts to standardize. The vcf2maf and maf2maf scripts leave most of that responsibility to Ensembl’s VEP, but allows you to override their “canonical” isoforms, or use a custom ExAC VCF for annotation. Though the most useful feature is the extensive support in parsing a wide range of crappy MAF-like or VCF-like formats we’ve seen out in the wild.
Quick start
Find the latest release, download it, and view the detailed usage manuals for vcf2maf and maf2maf:
If you don’t have VEP installed, then follow this gist. Of the many annotators out there, VEP is preferred for its large team of active coders, and its CLIA-compliant HGVS formats. After installing VEP, test out vcf2maf like this:
To fill columns 16 and 17 of the output MAF with tumor/normal sample IDs, and to parse out genotypes and allele counts from matched genotype columns in the VCF, use options --tumor-id and --normal-id. Skip option --normal-id if you didn’t have a matched normal:
VCFs from variant callers like VarScan use hardcoded sample IDs TUMOR/NORMAL to name genotype columns. To have vcf2maf correctly locate the columns to parse genotypes, while still printing proper sample IDs in the output MAF:
If you want to skip running VEP and need a minimalist MAF-like file listing data from the input VCF only, then use the --inhibit-vep option. If your input VCF contains VEP annotation, then vcf2maf will try to extract it. But be warned that the accuracy of your resulting MAF depends on how VEP was operated upstream. In standard operation, vcf2maf runs VEP with very specific parameters to make sure everyone produces comparable MAFs. So, it is strongly recommended to avoid --inhibit-vep unless you know what you’re doing.
maf2maf
If you have a MAF or a MAF-like file that you want to reannotate, then use maf2maf, which simply runs maf2vcf followed by vcf2maf:
After tests on variant lists from many sources, maf2vcf and maf2maf are quite good at dealing with formatting errors or “MAF-like” files. It even supports VCF-style alleles, as long as Start_Position == POS. But it’s OK if the input format is imperfect. Any variants with a reference allele mismatch are kept aside in a separate file for debugging. The bare minimum columns that maf2maf expects as input are:
Chromosome Start_Position Reference_Allele Tumor_Seq_Allele2 Tumor_Sample_Barcode
1 3599659 C T TCGA-A1-A0SF-01
1 6676836 A AGC TCGA-A1-A0SF-01
1 7886690 G A TCGA-A1-A0SI-01
See data/minimalist_test_maf.tsv for a sampler. Addition of Tumor_Seq_Allele1 will be used to determine zygosity. Otherwise, it will try to determine zygosity from variant allele fractions, assuming that arguments --tum-vad-col and --tum-depth-col are set correctly to the names of columns containing those read counts. Specifying the Matched_Norm_Sample_Barcode with its respective columns containing read-counts, is also strongly recommended. Columns containing normal allele read counts can be specified using argument --nrm-vad-col and --nrm-depth-col.
Docker
Assuming you have a recent version of docker, clone the main branch and build an image as follows:
docker run --rm vcf2maf:main perl vcf2maf.pl --help
docker run --rm vcf2maf:main perl maf2maf.pl --help
Testing
A small standalone test dataset was created by restricting VEP v112 cache/fasta to chr21 in GRCh38 and hosting that on a private server for download by CI services. We can manually fetch those as follows:
vcf
maf
To convert a VCF into a MAF, each variant must be mapped to only one of all possible gene transcripts/isoforms that it might affect. But even within a single isoform, a
Missense_Mutationclose enough to aSplice_Site, can be labeled as either in MAF format, but not as both. This selection of a single effect per variant, is often subjective. And that’s what this project attempts to standardize. Thevcf2mafandmaf2mafscripts leave most of that responsibility to Ensembl’s VEP, but allows you to override their “canonical” isoforms, or use a custom ExAC VCF for annotation. Though the most useful feature is the extensive support in parsing a wide range of crappy MAF-like or VCF-like formats we’ve seen out in the wild.Quick start
Find the latest release, download it, and view the detailed usage manuals for
vcf2mafandmaf2maf:If you don’t have VEP installed, then follow this gist. Of the many annotators out there, VEP is preferred for its large team of active coders, and its CLIA-compliant HGVS formats. After installing VEP, test out
vcf2maflike this:To fill columns 16 and 17 of the output MAF with tumor/normal sample IDs, and to parse out genotypes and allele counts from matched genotype columns in the VCF, use options
--tumor-idand--normal-id. Skip option--normal-idif you didn’t have a matched normal:VCFs from variant callers like VarScan use hardcoded sample IDs TUMOR/NORMAL to name genotype columns. To have
vcf2mafcorrectly locate the columns to parse genotypes, while still printing proper sample IDs in the output MAF:If VEP is installed under
/opt/vepand the VEP cache is under/srv/vep, there are options available to tellvcf2mafwhere to find them:If you want to skip running VEP and need a minimalist MAF-like file listing data from the input VCF only, then use the
--inhibit-vepoption. If your input VCF contains VEP annotation, thenvcf2mafwill try to extract it. But be warned that the accuracy of your resulting MAF depends on how VEP was operated upstream. In standard operation,vcf2mafruns VEP with very specific parameters to make sure everyone produces comparable MAFs. So, it is strongly recommended to avoid--inhibit-vepunless you know what you’re doing.maf2maf
If you have a MAF or a MAF-like file that you want to reannotate, then use
maf2maf, which simply runsmaf2vcffollowed byvcf2maf:After tests on variant lists from many sources,
maf2vcfandmaf2mafare quite good at dealing with formatting errors or “MAF-like” files. It even supports VCF-style alleles, as long asStart_Position == POS. But it’s OK if the input format is imperfect. Any variants with a reference allele mismatch are kept aside in a separate file for debugging. The bare minimum columns thatmaf2mafexpects as input are:See
data/minimalist_test_maf.tsvfor a sampler. Addition ofTumor_Seq_Allele1will be used to determine zygosity. Otherwise, it will try to determine zygosity from variant allele fractions, assuming that arguments--tum-vad-coland--tum-depth-colare set correctly to the names of columns containing those read counts. Specifying theMatched_Norm_Sample_Barcodewith its respective columns containing read-counts, is also strongly recommended. Columns containing normal allele read counts can be specified using argument--nrm-vad-coland--nrm-depth-col.Docker
Assuming you have a recent version of docker, clone the main branch and build an image as follows:
Now you run the scripts in docker as follows:
Testing
A small standalone test dataset was created by restricting VEP v112 cache/fasta to chr21 in GRCh38 and hosting that on a private server for download by CI services. We can manually fetch those as follows:
And the following scripts test the docker image on predefined inputs and compare outputs against expected outputs:
License
Citation