doppelgangR is a
package for identifying duplicate samples within or between datasets of
transcriptome profiles. It is intended for microarray and RNA-seq gene
expression profiles where biological replicates are ordinarily more
distinct than technical replicates, as is the case for cancer types with
“noisy” genomes. It is intended for cases where per-gene summaries are
available but full genotypes are not, which is typical of public
databases such as the Gene Expression Omnibus.
Results from running doppelgangR on CRC, bladder, and ovarian on
Dropbox.
The doppelgangR() function identifies duplicates in three different
ways:
“expression” doppelgängers have highly similar expression
profiles, which are identified by default by having higher Pearson
correlation than expected based on an empirical distribution of
Pearson correlations between biological replicates. The type of
correlation, and default use of ComBat batch correction, can be
changed using the “corFinder.args” argument.
“phenotype” doppelgängers have highly similar clinical or
phenotype data, as contained in the phenoData slot of the
ExpressionSet. In order to identify duplicates this way, it is
required to curate the phenoData of each ExpressionSet they have
identical column names, and encode phenotypes in the same way. For
example, if each dataset provides information on age, this column of
the phenoData could be called “age” in every dataset, and encoded as
an integer number of years. If the phenoData slots are NULL then this
type of checking will automatically be turned off. If they are not
NULL but are also not curated, you should turn off phenotype checking
by setting phenoFinder.args=NULL.
“smoking gun” doppelgängers have the same value for an identifier
that should be unique. You can enable this type of check by setting
the argument “manual.smokingguns” to the names of columns containing
supposedly unique identifiers, or setting “automatic.smokingguns” to
TRUE, and the function will assume any column containing unique values
within the column should also be unique across datasets.
This vignette focuses on the “expression” type of doppelgänger.
Data types
Identification of doppelgängers is effective for both microarray and
log-transformed RNA-seq data, and even for matching samples that
have been profiled by microarray and RNA-seq.
Case Study: Batch correction in Japanese datasets
We load for datasets by Yoshihara et al. that have been curated in
curatedOvarianData.
These are objects of class ExpressionSet.
Now run doppelgangR with default arguments, except for setting
phenoFinder.args=NULL, which turns off checking for similar clinical
data in the phenoData slot of the ExpressionSet objects:
Pair plot of JapaneseA:JapaneseA Doppelgängers identified. The vertical
red lines indicate samples that were flagged.
Important options
Changing sensitivity
If after inspecting the histograms, you see that some visible outliers
were not caught, or non-outliers exceeded the sensitivity threshold, you
can change the default sensitivity using the argument:
The default 0.5 is a reasonable but arbitrary trade-off between
sensitivity and specificity which we have found to often select dataset
pairs containing duplicates, but to often not find all the duplicate
samples. Sensitivity can be increased by changing the bonf.prob
argument, i.e.:
The doppelgangR() function takes as its main argument a list of
ExpressionSet objects. If you just have matrices, you can easily
convert these to the ExpressionSet objects, for example:
mat <- matrix(1:4, ncol=2)
library(Biobase)
eset <- ExpressionSet(mat)
class(eset)
#> [1] "ExpressionSet"
#> attr(,"package")
#> [1] "Biobase"
Parallelizing
The doppelgangR() function checks all pairwise combinations of
datasets in a list of ExpressionSet objects, and these dataset pairs
can be checked in parallel using multiple processing cores using the
BPPARAM argument. This functionality is imported from the
(“BiocParallel”) package. Please see
“?BiocParallel::`BiocParallelParam-class`” documentation.
By default, the doppelgangR() function caches intermediate results to
make re-running with different arguments faster. Turn caching off by
setting the argument cache.dir=NULL.
doppelgängR
Introduction
doppelgangR is a package for identifying duplicate samples within or between datasets of transcriptome profiles. It is intended for microarray and RNA-seq gene expression profiles where biological replicates are ordinarily more distinct than technical replicates, as is the case for cancer types with “noisy” genomes. It is intended for cases where per-gene summaries are available but full genotypes are not, which is typical of public databases such as the Gene Expression Omnibus.
Results from running
doppelgangRon CRC, bladder, and ovarian on Dropbox.For the manuscript vignette, visit https://github.com/waldronlab/doppelgangR_paper.
The
doppelgangR()function identifies duplicates in three different ways:“expression” doppelgängers have highly similar expression profiles, which are identified by default by having higher Pearson correlation than expected based on an empirical distribution of Pearson correlations between biological replicates. The type of correlation, and default use of ComBat batch correction, can be changed using the “corFinder.args” argument.
“phenotype” doppelgängers have highly similar clinical or phenotype data, as contained in the phenoData slot of the
ExpressionSet. In order to identify duplicates this way, it is required to curate the phenoData of each ExpressionSet they have identical column names, and encode phenotypes in the same way. For example, if each dataset provides information on age, this column of the phenoData could be called “age” in every dataset, and encoded as an integer number of years. If the phenoData slots are NULL then this type of checking will automatically be turned off. If they are not NULL but are also not curated, you should turn off phenotype checking by settingphenoFinder.args=NULL.“smoking gun” doppelgängers have the same value for an identifier that should be unique. You can enable this type of check by setting the argument “manual.smokingguns” to the names of columns containing supposedly unique identifiers, or setting “automatic.smokingguns” to TRUE, and the function will assume any column containing unique values within the column should also be unique across datasets.
This vignette focuses on the “expression” type of doppelgänger.
Data types
Identification of doppelgängers is effective for both microarray and log-transformed RNA-seq data, and even for matching samples that have been profiled by microarray and RNA-seq.
Case Study: Batch correction in Japanese datasets
We load for datasets by Yoshihara et al. that have been curated in curatedOvarianData. These are objects of class
ExpressionSet.The
doppelgangRfunction requires a list ofExpressionSetobjects as input, which we create here:Now run
doppelgangRwith default arguments, except for settingphenoFinder.args=NULL, which turns off checking for similar clinical data in thephenoDataslot of the ExpressionSet objects:This creates an object of class
DoppelGang, which has print, summary, and plot methods. Summary method output not shown here due to voluminous output:Plot creates a histogram of sample pairwise correlations within and between each study:
Doppelgängers identified on the basis of similar expression profiles. The vertical red lines indicate samples that were flagged.
One of these histograms can be drawn using the plot.pair argument:
Pair plot of JapaneseA:JapaneseA Doppelgängers identified. The vertical red lines indicate samples that were flagged.
Important options
Changing sensitivity
If after inspecting the histograms, you see that some visible outliers were not caught, or non-outliers exceeded the sensitivity threshold, you can change the default sensitivity using the argument:
The default 0.5 is a reasonable but arbitrary trade-off between sensitivity and specificity which we have found to often select dataset pairs containing duplicates, but to often not find all the duplicate samples. Sensitivity can be increased by changing the bonf.prob argument, i.e.:
Use of the ExpressionSet
The
doppelgangR()function takes as its main argument a list ofExpressionSetobjects. If you just have matrices, you can easily convert these to theExpressionSetobjects, for example:Parallelizing
The
doppelgangR()function checks all pairwise combinations of datasets in a list ofExpressionSetobjects, and these dataset pairs can be checked in parallel using multiple processing cores using the BPPARAM argument. This functionality is imported from the (“BiocParallel”) package. Please see “?BiocParallel::`BiocParallelParam-class`” documentation.Caching
By default, the
doppelgangR()function caches intermediate results to make re-running with different arguments faster. Turn caching off by setting the argumentcache.dir=NULL.