The package is currently being submitted to
Bioconductor. Please use it once it is
accepted there and a suitable citation is provided.
General design
Introduction
The concept of mutational signatures was introduced in a series of papers by
Ludmil Alexandrov, Serena Nik-Zainal, Michael Stratton and others (for precise
citations please refer to the vignette of the package). The general
approach is as follows:
The SNVs are categorized by their nucleotide exchange. In total there are
4 x 3 = 12 different nucleotide exchanges, but if summing over reverse
complements only 12 / 2 = 6 different categories are left. For every SNV
detected, the motif context around the position of the SNV is extracted. This
may be a trinucleotide context if taking one base upstream and one base
downstream of the position of the SNV, but larger motifs may be taken as well
(e.g. pentamers). Taking into account the motif context increases combinatorial
complexity: in the case of the trinucleotide context, there are
4 x 6 x 4 = 96 different variant categories. These categories are
called features in the following text. The number of features will be
called n.
A cohort consists of different samples with the number of samples denoted by
m. For each sample we can count the occurences of each feature, yielding an
n-dimensional vector (n being the number of features) per sample. For a
cohort, we thus get an n x m -dimensional matrix, called the
mutational catalogueV. It can be understood as a summary indicating
which sample has how many variants of which category, but omitting the
information of the genomic coordinates of the variants.
3. The mutational catalogue V is quite big and still carries a lot of
complexity. For many analyses a reduction of complexity is desirable. One way
to achieve such a complexity reduction is a matrix decomposition: we would like
to find two smaller matrices W and H which if multiplied would span a high
fraction of the complexity of the big matrix V (the mutational catalogue).
Remember that V is an n x m -dimensional matrix, n being the number
of features and m being the number of samples. W in this setting is an
n x l -dimensional matrix and H is an l x m -dimensional
matrix. The columns of
W are called the mutational signatures and the columns of H are called
exposures. l denotes the number of mutational signatures. Hence the
signatures are n-dimensional vectors (with n being the number of features),
while the exposures are l-dimensional vectors (l being the number of
signatures). Note that as we are dealing with count data, we would like to have
only positive entries in W and H. A mathematical method which is able to do
such a decomposition is the NMF (nonnegative matrix factorization). It
basically solves the problem as illustrated in the following figure:
Of course the whole concept only leads to a reduction in complexity if l < n,
i.e. if the number of signatures is smaller than the number of features, as
indicated in the above figure. Note that the NMF itself solves the above
problem for a given number of signatures l. Addinional criteria exist to
evaluate the true number of signatures.
The YAPSA package
In a context where mutational signatures W are already known (because they
were decribed and published or they are available in a
database as under http://cancer.sanger.ac.uk/cosmic/signatures), we might
want to just find the exposures H for these known signatures in the
mutational catalogue V of a given cohort.
The YAPSA-package (Yet Another Package for Signature Analysis) presented
here provides the function LCD (linear combination decomposition)
to perform this task. The advantage of this method is that there are no
constraints on the cohort size, so LCD can be run for as little as one
sample and thus be used e.g. for signature analysis in personalized oncology.
In contrast to NMF, LCD is very fast and requires very little computational
resources. The YAPSA package provides additional functions for signature
analysis, e.g. for stratifying the mutational catalogue to determine signature
exposures in different strata, part of which is discussed in the vignette of
the package.
Install
As long as YAPSA is not yet accepted on
Bioconductor it may be downloaded and installed
from github:
YAPSA already has preparations to use the newest versions of the pacakges
circlize and ComplexHeatmap by Zuguang Gu. These are currently not in the
release branch of Bioconductor. If you want your system to be ready for the
next coming update of YAPSA you may already now install the newest versions of
these packages from github as well:
## CHROM POS REF ALT PID
## 1 1 183502381 G A 07-35482
## 2 18 60985506 T A 07-35482
## 3 18 60985748 G T 07-35482
## 4 18 60985799 T C 07-35482
## 5 2 242077457 A G 07-35482
## 6 6 13470412 C T 07-35482
For convenience later on, we annotate subgroup information to every variant
(indirectly through the sample it occurs in). For reasons of simplicity, we
also restrict the analysis to the Whole Genome Sequencing (WGS) datasets:
As stated above, one of the functions in the YAPSA package (LCD) is
designed to do mutational signatures analysis with known signatures. There are
(at least) two possible sources for signature data: i) the ones published
initially by Alexandrov, and ii) an updated and curated
current set of mutational signatures is maintained by Ludmil Alexandrov at http://cancer.sanger.ac.uk/cosmic/signatures.
When using LCD_complex_cutoff, we have to supply a vector of cutoffs with as
many entries as there are signatures. It may make sense to provide different
cutoffs for different signatures.
In this example, the cutoff for signatures AC1 and AC5 is thus set to 0, whereas
the cutoffs for all other signatures remains at 0.06. Running the function
LCD_complex_cutoff:
Note that the signatures extracted with the signature-specific cutoffs are the
same in the example displayed here. Depending on the analyzed cohort and the
choice of cutoffs, the extracted signatures may vary considerably.
Cluster samples based on their signature exposures
To identify groups of samples which were exposed to similar mutational
processes, the exposure vectors of the samples can be compared. The YAPSA
package provides a custom function for this task: complex_heatmap_exposures,
which uses the package ComplexHeatmap by Zuguang Gu. It produces
output as follows:
The dendrogram produced by either the function complex_heatmap_exposures or
the function hclust_exposures can be cut to yield signature exposure specific
subgroups of the PIDs.
Performing a stratification based on mutation density
This type of analysis is performed using the function run_SMC where SMC stands
for stratification of the mutational catalogue. For details on this
function please consult the vignette.
YAPSA
Daniel Huebschmann
26/08/2015
Citation
The package is currently being submitted to Bioconductor. Please use it once it is accepted there and a suitable citation is provided.
General design
Introduction
The concept of mutational signatures was introduced in a series of papers by Ludmil Alexandrov, Serena Nik-Zainal, Michael Stratton and others (for precise citations please refer to the vignette of the package). The general approach is as follows:
4 x 3 = 12different nucleotide exchanges, but if summing over reverse complements only12 / 2 = 6different categories are left. For every SNV detected, the motif context around the position of the SNV is extracted. This may be a trinucleotide context if taking one base upstream and one base downstream of the position of the SNV, but larger motifs may be taken as well (e.g. pentamers). Taking into account the motif context increases combinatorial complexity: in the case of the trinucleotide context, there are4 x 6 x 4 = 96different variant categories. These categories are called features in the following text. The number of features will be calledn.m. For each sample we can count the occurences of each feature, yielding ann-dimensional vector (nbeing the number of features) per sample. For a cohort, we thus get ann x m-dimensional matrix, called themutational catalogue
V. It can be understood as a summary indicating which sample has how many variants of which category, but omitting the information of the genomic coordinates of the variants. 3. The mutational catalogueVis quite big and still carries a lot of complexity. For many analyses a reduction of complexity is desirable. One way to achieve such a complexity reduction is a matrix decomposition: we would like to find two smaller matricesWandHwhich if multiplied would span a high fraction of the complexity of the big matrixV(the mutational catalogue). Remember thatVis ann x m-dimensional matrix,nbeing the number of features andmbeing the number of samples.Win this setting is ann x l-dimensional matrix andHis anl x m-dimensional matrix. The columns ofWare called the mutational signatures and the columns ofHare called exposures.ldenotes the number of mutational signatures. Hence the signatures aren-dimensional vectors (withnbeing the number of features), while the exposures arel-dimensional vectors (lbeing the number of signatures). Note that as we are dealing with count data, we would like to have only positive entries inWandH. A mathematical method which is able to do such a decomposition is the NMF (nonnegative matrix factorization). It basically solves the problem as illustrated in the following figure:Of course the whole concept only leads to a reduction in complexity if
l < n, i.e. if the number of signatures is smaller than the number of features, as indicated in the above figure. Note that the NMF itself solves the above problem for a given number of signaturesl. Addinional criteria exist to evaluate the true number of signatures.The YAPSA package
In a context where mutational signatures
Ware already known (because they were decribed and published or they are available in a database as under http://cancer.sanger.ac.uk/cosmic/signatures), we might want to just find the exposuresHfor these known signatures in the mutational catalogueVof a given cohort.The YAPSA-package (Yet Another Package for Signature Analysis) presented here provides the function
LCD(linear combination decomposition) to perform this task. The advantage of this method is that there are no constraints on the cohort size, soLCDcan be run for as little as one sample and thus be used e.g. for signature analysis in personalized oncology. In contrast to NMF,LCDis very fast and requires very little computational resources. The YAPSA package provides additional functions for signature analysis, e.g. for stratifying the mutational catalogue to determine signature exposures in different strata, part of which is discussed in the vignette of the package.Install
As long as
YAPSAis not yet accepted on Bioconductor it may be downloaded and installed from github:Of course,
devtoolshas to be installed:YAPSAalready has preparations to use the newest versions of the pacakgescirclizeandComplexHeatmapby Zuguang Gu. These are currently not in the release branch of Bioconductor. If you want your system to be ready for the next coming update ofYAPSAyou may already now install the newest versions of these packages from github as well:If you ran into dependency conflicts before, try rerunning
install_github("huebschm/YAPSA")now.Usage
Example: a cohort of B-cell lymphomas
Loading example data
Load data in a vcf-like format:
Adapt the data structure:
Note that there are 48 different samples:
For convenience later on, we annotate subgroup information to every variant (indirectly through the sample it occurs in). For reasons of simplicity, we also restrict the analysis to the Whole Genome Sequencing (WGS) datasets:
As stated above, one of the functions in the YAPSA package (
LCD) is designed to do mutational signatures analysis with known signatures. There are (at least) two possible sources for signature data: i) the ones published initially by Alexandrov, and ii) an updated and curated current set of mutational signatures is maintained by Ludmil Alexandrov at http://cancer.sanger.ac.uk/cosmic/signatures.Now we can start using main functions of the YAPSA package:
LCDandLCD_complex_cutoff.Building a mutational catalogue
Prepare.
This section uses functions which are to a large extent wrappers for functions in the package
SomaticSignaturesby Julian Gehring.The function
create_mutation_catalogue_from_dfreturns a list object with several entries. We will use the one calledmatrix.LCD analysis with signature-specific cutoffs
When using
LCD_complex_cutoff, we have to supply a vector of cutoffs with as many entries as there are signatures. It may make sense to provide different cutoffs for different signatures.In this example, the cutoff for signatures AC1 and AC5 is thus set to 0, whereas the cutoffs for all other signatures remains at 0.06. Running the function
LCD_complex_cutoff:Some adaptation (extracting and reformatting the information which sample belongs to which subgroup):
Plotting absolute exposures for visualization:
And relative exposures:
Note that the signatures extracted with the signature-specific cutoffs are the same in the example displayed here. Depending on the analyzed cohort and the choice of cutoffs, the extracted signatures may vary considerably.
Cluster samples based on their signature exposures
To identify groups of samples which were exposed to similar mutational processes, the exposure vectors of the samples can be compared. The YAPSA package provides a custom function for this task:
complex_heatmap_exposures, which uses the package ComplexHeatmap by Zuguang Gu. It produces output as follows:If you are interested only in the clustering and not in the heatmap information, you could also use
hclust_exposures:The dendrogram produced by either the function
complex_heatmap_exposuresor the functionhclust_exposurescan be cut to yield signature exposure specific subgroups of the PIDs.Performing a stratification based on mutation density
This type of analysis is performed using the function
run_SMCwhere SMC stands for stratification of the mutational catalogue. For details on this function please consult the vignette.