Genotypes and sample-aware queries — per-sample
zygosity on multi-sample VCFs (Genotype, Zygosity, VariantCollection.for_sample,
.heterozygous_in, .homozygous_alt_in). New in 2.3.
CSV round-trip and metadata headers — to_csv /
from_csv on both collection types, with #-prefixed provenance
headers. New in 2.1, refined in 2.2.
Error handling — ReferenceMismatchError,
SampleNotFoundError, and the raise_on_error=False escape hatch.
Replace annotated start codon with alternative start codon (e.g. “ATG>CAG”).
ComplexSubstitution
Insertion and deletion of multiple amino acids.
Deletion
Coding mutation which causes deletion of amino acid(s).
ExonLoss
Deletion of entire exon, significantly disrupts protein.
ExonicSpliceSite
Mutation at the beginning or end of an exon, may affect splicing.
FivePrimeUTR
Variant affects 5’ untranslated region before start codon.
FrameShiftTruncation
A frameshift which leads immediately to a stop codon (no novel amino acids created).
FrameShift
Out-of-frame insertion or deletion of nucleotides, causes novel protein sequence and often premature stop codon.
IncompleteTranscript
Can’t determine effect since transcript annotation is incomplete (often missing either the start or stop codon).
Insertion
Coding mutation which causes insertion of amino acid(s).
Intergenic
Occurs outside of any annotated gene.
Intragenic
Within the annotated boundaries of a gene but not in a region that’s transcribed into pre-mRNA.
IntronicSpliceSite
Mutation near the beginning or end of an intron but less likely to affect splicing than donor/acceptor mutations.
Intronic
Variant occurs between exons and is unlikely to affect splicing.
NoncodingTranscript
Transcript doesn’t code for a protein.
PrematureStop
Insertion of stop codon, truncates protein.
Silent
Mutation in coding sequence which does not change the amino acid sequence of the translated protein.
SpliceAcceptor
Mutation in the last two nucleotides of an intron, likely to affect splicing.
SpliceDonor
Mutation in the first two nucleotides of an intron, likely to affect splicing.
StartLoss
Mutation causes loss of start codon, likely result is that an alternate start codon will be used down-stream (possibly in a different frame).
StopLoss
Loss of stop codon, causes extension of protein by translation of nucleotides from 3’ UTR.
Substitution
Coding mutation which causes simple substitution of one amino acid for another.
ThreePrimeUTR
Variant affects 3’ untranslated region after stop codon of mRNA.
Coordinate System
Varcode currently uses a “base counted, one start” genomic coordinate system, to match the Ensembl annotation database. We are planning to switch over to “space counted, zero start” (interbase) coordinates, since that system allows for more uniform logic (no special cases for insertions). To learn more about genomic coordinate systems, read this blog post.
Varcode
Varcode is a library for working with genomic variant data in Python and predicting the impact of those variants on protein sequences.
Installation
You can install varcode using pip:
You can install required reference genome data through PyEnsembl as follows:
Example
If you are looking for a quick start guide, you can check out this iPython book that demonstrates simple use cases of Varcode.
Further reading
Feature guides live in
docs/:Genotype,Zygosity,VariantCollection.for_sample,.heterozygous_in,.homozygous_alt_in). New in 2.3.to_csv/from_csvon both collection types, with#-prefixed provenance headers. New in 2.1, refined in 2.2.ReferenceMismatchError,SampleNotFoundError, and theraise_on_error=Falseescape hatch.See
CHANGELOG.mdfor the release history.Effect Types
Coordinate System
Varcode currently uses a “base counted, one start” genomic coordinate system, to match the Ensembl annotation database. We are planning to switch over to “space counted, zero start” (interbase) coordinates, since that system allows for more uniform logic (no special cases for insertions). To learn more about genomic coordinate systems, read this blog post.