Standard regression functions in R enabled for parallel processing over
large data-frames
================
Kevin Blighe, Jessica Lasky-Su
2021-08-01
Introduction
In many analyses, a large amount of variables have to be tested
independently against the trait/endpoint of interest, and also adjusted
for covariates and confounding factors at the same time. The major
bottleneck in these is the amount of time that it takes to complete
these analyses.
With RegParallel, a large number of tests can be performed
simultaneously. On a 12-core system, 144 variables can be tested
simultaneously, with 1000s of variables processed in a matter of seconds
via ‘nested’ parallel processing.
Works for logistic regression, linear regression, conditional logistic
regression, Cox proportional hazards and survival models, and Bayesian
logistic regression. Also caters for generalised linear models that
utilise survey weights created by the ‘survey’ CRAN package and that
utilise ‘survey::svyglm’.
Installation
1. Download the package from Bioconductor
if (!requireNamespace('BiocManager', quietly = TRUE))
install.packages('BiocManager')
BiocManager::install('RegParallel')
Perform the most basic logistic regression analysis
Here, we fit a binomial logistic regression model to the data via
glmParallel, with dexamethasone as the dependent variable.
## NOT RUN
res1 <- RegParallel(
data = rlddata[ ,1:3000],
formula = 'dex ~ [*]',
FUN = function(formula, data)
glm(formula = formula,
data = data,
family = binomial(link = 'logit')),
FUNtype = 'glm',
variables = colnames(rlddata)[10:3000])
res1[order(res1$P, decreasing=FALSE),]
Perform a basic linear regression
Here, we will perform the linear regression using both
glmParallel and lmParallel. We will appreciate that a
linear regression is the same using either function with the default
settings.
Regularised log expression levels from our DESeq2 data will be
used.
rlddata <- rlddata[ ,1:2000]
res2 <- RegParallel(
data = rlddata,
formula = '[*] ~ cell',
FUN = function(formula, data)
glm(formula = formula,
data = data,
method = 'glm.fit'),
FUNtype = 'glm',
variables = colnames(rlddata)[10:ncol(rlddata)],
p.adjust = "none")
res3 <- RegParallel(
data = rlddata,
formula = '[*] ~ cell',
FUN = function(formula, data)
lm(formula = formula,
data = data),
FUNtype = 'lm',
variables = colnames(rlddata)[10:ncol(rlddata)],
p.adjust = "none")
subset(res2, P<0.05)
library(Biobase)
library(GEOquery)
# load series and platform data from GEO
gset <- getGEO('GSE2990', GSEMatrix =TRUE, getGPL=FALSE)
x <- exprs(gset[[1]])
# remove Affymetrix control probes
x <- x[-grep('^AFFX', rownames(x)),]
# transform the expression data to Z scores
x <- t(scale(t(x)))
# extract information of interest from the phenotype data (pdata)
idx <- which(colnames(pData(gset[[1]])) %in%
c('age:ch1', 'distant rfs:ch1', 'er:ch1',
'ggi:ch1', 'grade:ch1', 'node:ch1',
'size:ch1', 'time rfs:ch1'))
metadata <- data.frame(pData(gset[[1]])[,idx],
row.names = rownames(pData(gset[[1]])))
# remove samples from the pdata that have any NA value
discard <- apply(metadata, 1, function(x) any(is.na(x)))
metadata <- metadata[!discard,]
# filter the Z-scores expression data to match the samples in our pdata
x <- x[,which(colnames(x) %in% rownames(metadata))]
# check that sample names match exactly between pdata and Z-scores
all((colnames(x) == rownames(metadata)) == TRUE)
Two of the top hits include CXCL12 and MMP10. High
expression of CXCL12 was previously associated with good
progression free and overall survival in breast cancer in (doi:
10.1016/j.cca.2018.05.041.)[https://www.ncbi.nlm.nih.gov/pubmed/29800557]
, whilst high expression of MMP10 was associated with poor
prognosis in colon cancer in (doi:
10.1186/s12885-016-2515-7)[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4950722/].
We can further explore the role of these genes to RFS by dividing their
gene expression Z-scores into tertiles for low, mid, and high
expression:
# extract RFS and probe data for downstream analysis
survplotdata <- coxdata[,c('Time.RFS', 'Distant.RFS',
'X203666_at', 'X205680_at')]
colnames(survplotdata) <- c('Time.RFS', 'Distant.RFS',
'CXCL12', 'MMP10')
# set Z-scale cut-offs for high and low expression
highExpr <- 1.0
lowExpr <- 1.0
# encode the expression for CXCL12 and MMP10 as low, mid, and high
survplotdata$CXCL12 <- ifelse(survplotdata$CXCL12 >= highExpr, 'High',
ifelse(x <= lowExpr, 'Low', 'Mid'))
survplotdata$MMP10 <- ifelse(survplotdata$MMP10 >= highExpr, 'High',
ifelse(x <= lowExpr, 'Low', 'Mid'))
# relevel the factors to have mid as the reference level
survplotdata$CXCL12 <- factor(survplotdata$CXCL12,
levels = c('Mid', 'Low', 'High'))
survplotdata$MMP10 <- factor(survplotdata$MMP10,
levels = c('Mid', 'Low', 'High'))
Plot the survival curves and place Log Rank p-value in the plots:
In this example, we will re-use the Cox data for the purpose of
performing conditional logistic regression with tumour grade as our
grouping / matching factor. For this example, we will use ER status as
the dependent variable and also adjust for age.
x <- exprs(gset[[1]])
x <- x[-grep('^AFFX', rownames(x)),]
x <- scale(x)
x <- x[,which(colnames(x) %in% rownames(metadata))]
coxdata <- data.frame(metadata, t(x))
colnames(coxdata)[1:8] <- c('Age', 'Distant.RFS', 'ER',
'GGI', 'Grade', 'Node',
'Size', 'Time.RFS')
coxdata$Age <- as.numeric(gsub('^KJ', '', coxdata$Age))
coxdata$Grade <- factor(coxdata$Grade, levels = c(1, 2, 3))
coxdata$ER <- as.numeric(coxdata$ER)
coxdata <- coxdata[!is.na(coxdata$ER),]
res6 <- RegParallel(
data = coxdata,
formula = 'ER ~ [*] + Age + strata(Grade)',
FUN = function(formula, data)
clogit(formula = formula,
data = data,
method = 'breslow'),
FUNtype = 'clogit',
variables = colnames(coxdata)[9:ncol(coxdata)],
blocksize = 2000)
subset(res6, P < 0.01)
Oestrogen receptor (ESR1) comes out - makes sense! Also,
although 204667_at is not listed in biomaRt, it overlaps an exon
of FOXA1, which also makes sense in relation to oestrogen
signalling.
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 7061891 377.2 11504125 614.4 11504125 614.4
## Vcells 28488913 217.4 72389158 552.3 72389120 552.3
Advanced features
Advanced features include the ability to modify block size, choose
different numbers of cores, enable ‘nested’ parallel processing, modify
limits for confidence intervals, and exclude certain model terms from
output.
Speed up processing
First create some test data for the purpose of benchmarking:
options(scipen=10)
options(digits=6)
# create a data-matrix of 20 x 60000 (rows x cols) random numbers
col <- 60000
row <- 20
mat <- matrix(
rexp(col*row, rate = .1),
ncol = col)
# add fake gene and sample names
colnames(mat) <- paste0('gene', 1:ncol(mat))
rownames(mat) <- paste0('sample', 1:nrow(mat))
# add some fake metadata
modelling <- data.frame(
cell = rep(c('B', 'T'), nrow(mat) / 2),
group = c(rep(c('treatment'), nrow(mat) / 2), rep(c('control'), nrow(mat) / 2)),
dosage = t(data.frame(matrix(rexp(row, rate = 1), ncol = row))),
mat,
row.names = rownames(mat))
Focusing on the elapsed time (as system time only reports time from the
last core that finished), we can see that nested processing has
negligible improvement or may actually be slower under certain
conditions when tested over a small number of variables. This is likely
due to the system being slowed by simply managing the larger number of
threads. Nested processing’s benefits can only be gained when processing
a large number of variables:
Performance is system-dependent and even increasing cores may not result
in huge gains in time. Performance is a trade-off between cores, forked
threads, blocksize, and the number of terms in each model.
Thanks to Horácio Montenegro and GenoMax for testing cross-platform
differences, and Wolfgang Huber for providing the nudge that FDR
correction needed to be implemented.
Thanks to Michael Barnes in London for introducing me to parallel
processing in R.
Finally, thanks to Juan Celedón at Children’s Hospital of Pittsburgh.
Sarega Gurudas, whose suggestion led to the implementation of survey
weights via svyglm.
Blighe, K, and J Lasky-Su. 2018. “RegParallel: Standard regression
functions in R enabled for parallel processing over large data-frames.”
https://github.com/kevinblighe/RegParallel.
Standard regression functions in R enabled for parallel processing over large data-frames ================ Kevin Blighe, Jessica Lasky-Su 2021-08-01
Introduction
In many analyses, a large amount of variables have to be tested independently against the trait/endpoint of interest, and also adjusted for covariates and confounding factors at the same time. The major bottleneck in these is the amount of time that it takes to complete these analyses.
With RegParallel, a large number of tests can be performed simultaneously. On a 12-core system, 144 variables can be tested simultaneously, with 1000s of variables processed in a matter of seconds via ‘nested’ parallel processing.
Works for logistic regression, linear regression, conditional logistic regression, Cox proportional hazards and survival models, and Bayesian logistic regression. Also caters for generalised linear models that utilise survey weights created by the ‘survey’ CRAN package and that utilise ‘survey::svyglm’.
Installation
1. Download the package from Bioconductor
Note: to install development version:
2. Load the package into R session
Quick start
For this quick start, we will follow the tutorial (from Section 3.1) of RNA-seq workflow: gene-level exploratory analysis and differential expression. Specifically, we will load the ‘airway’ data, where different airway smooth muscle cells were treated with dexamethasone.
Normalise the raw counts in DESeq2 and produce regularised log expression levels:
Perform the most basic logistic regression analysis
Here, we fit a binomial logistic regression model to the data via glmParallel, with dexamethasone as the dependent variable.
Perform a basic linear regression
Here, we will perform the linear regression using both glmParallel and lmParallel. We will appreciate that a linear regression is the same using either function with the default settings.
Regularised log expression levels from our DESeq2 data will be used.
Survival analysis via Cox Proportional Hazards regression
For this example, we will load breast cancer gene expression data with recurrence free survival (RFS) from Gene Expression Profiling in Breast Cancer: Understanding the Molecular Basis of Histologic Grade To Improve Prognosis. Specifically, we will encode each gene’s expression into Low|Mid|High based on Z-scores and compare these against RFS while adjusting for tumour grade in a Cox Proportional Hazards model.
First, let’s read in and prepare the data:
With the data prepared, we can now apply a Cox Proportional Hazards model independently for each probe in the dataset against RFS.
In this we also increase the default blocksize to 2000 in order to speed up the analysis.
We now take the top probes from the model by Log Rank p-value and use biomaRt to look up the corresponding gene symbols.
not run
Two of the top hits include CXCL12 and MMP10. High expression of CXCL12 was previously associated with good progression free and overall survival in breast cancer in (doi: 10.1016/j.cca.2018.05.041.)[https://www.ncbi.nlm.nih.gov/pubmed/29800557] , whilst high expression of MMP10 was associated with poor prognosis in colon cancer in (doi: 10.1186/s12885-016-2515-7)[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4950722/].
We can further explore the role of these genes to RFS by dividing their gene expression Z-scores into tertiles for low, mid, and high expression:
Plot the survival curves and place Log Rank p-value in the plots:
Perform a conditional logistic regression
In this example, we will re-use the Cox data for the purpose of performing conditional logistic regression with tumour grade as our grouping / matching factor. For this example, we will use ER status as the dependent variable and also adjust for age.
not run
Oestrogen receptor (ESR1) comes out - makes sense! Also, although 204667_at is not listed in biomaRt, it overlaps an exon of FOXA1, which also makes sense in relation to oestrogen signalling.
Advanced features
Advanced features include the ability to modify block size, choose different numbers of cores, enable ‘nested’ parallel processing, modify limits for confidence intervals, and exclude certain model terms from output.
Speed up processing
First create some test data for the purpose of benchmarking:
~2000 tests; blocksize, 500; cores, 2; nestedParallel, TRUE
With 2 cores instead of the default of 3, coupled with nestedParallel being enabled, a total of 2 x 2 = 4 threads will be used.
~2000 tests; blocksize, 500; cores, 2; nestedParallel, FALSE
Focusing on the elapsed time (as system time only reports time from the last core that finished), we can see that nested processing has negligible improvement or may actually be slower under certain conditions when tested over a small number of variables. This is likely due to the system being slowed by simply managing the larger number of threads. Nested processing’s benefits can only be gained when processing a large number of variables:
~40000 tests; blocksize, 2000; cores, 2; nestedParallel, TRUE
~40000 tests; blocksize, 2000; cores, 2; nestedParallel, FALSE
Performance is system-dependent and even increasing cores may not result in huge gains in time. Performance is a trade-off between cores, forked threads, blocksize, and the number of terms in each model.
~40000 tests; blocksize, 5000; cores, 3; nestedParallel, TRUE
In this example, we choose a large blocksize and 3 cores. With nestedParallel enabled, this translates to 9 simultaneous threads.
Modify confidence intervals
Remove some terms from output / include the intercept
Acknowledgments
Thanks to Horácio Montenegro and GenoMax for testing cross-platform differences, and Wolfgang Huber for providing the nudge that FDR correction needed to be implemented.
Thanks to Michael Barnes in London for introducing me to parallel processing in R.
Finally, thanks to Juan Celedón at Children’s Hospital of Pittsburgh.
Sarega Gurudas, whose suggestion led to the implementation of survey weights via svyglm.
Session info
References
Blighe and Lasky-Su (2018)
Blighe, K, and J Lasky-Su. 2018. “RegParallel: Standard regression functions in R enabled for parallel processing over large data-frames.” https://github.com/kevinblighe/RegParallel.