Quantile normalization is one of the most widely used multi-sample normalization tools for the analysis of noisy high-throughput data. Although it was originally developed for gene expression microarrays it is now used across many different high-throughput applications including RNAseq and ChIPseq. However, quantile normalization relies on assumptions about the data generation process that are not appropriate in some context. Unfortunately, no method exists to check for the appropriateness of these assumptions.
For example in gene expression, we assume that observed differences between the distributions of each sample are due to only technical variation unrelated to biological variation. To normalize the samples, the distributions are forced to be the same. In general, this assumption is justified as only a minority of genes are expected to be differentially expressed between samples, but if the samples are expected to have a high percentage of global differences, it may not be appropriate to use quantile normalization as it may remove interesting global biological variation.
The quantro R-package can be used to test a priori to the data analysis whether global normalization methods such as quantile normalization should be applied. Our method uses the raw unprocessed high-throughput data to test for global differences in the distributions across a set of groups.
For help with the quantro R-package, there is a vignette available in the /vignettes folder.
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("quantro")
After installation, the package can be loaded into R.
library(quantro)
Using quantro
The main function in the quantro package is quantro(). The quantro() function needs two objects: (1) a data frame containing the samples to test for differences between their distributions with observations (rows) and samples (columns) (e.g. let’s call it mySamps) and (2) a group level factor called groupFactor (let’s call it outcome). This order of this factor variable must match the order of the columns in the mySamps object because it contains information about which group each sample is from.
Individual slots can be extracted using accessor methods:
summary(qtest)
quantroStat(qtest)
A permutation test is performed to assess the statistical significance of the test statistic quantroStat from quantro().
Elements in the output from quantro() include:
Element
Description
summary
A list that contains (1) number of groups (nGroups), (2) total number of samples (nTotSamples) (3) number of samples in each group (nSamplesinGroups)
anova
ANOVA to test if the average medians of the distributions are different across groups
MSbetween
mean squared error between groups
MSwithin
mean squared error within groups
quantroStat
test statistic which is a ratio of the mean squared error between groups of distributions to the mean squared error within groups of distributions
quantroStatPerm
If B is not equal to 0, then a permutation test was performed to assess the statistical significance of quantroStat. These are the test statistics resulting from the permuted samples
quantroPvalPerm
If B is not equal to 0, then this is the p-value associated with the proportion of times the test statistics (quantroStatPerm) resulting from the permuted samples were larger than quantroStat
Visualizing the results from the permutation test
There is a second function in the package called quantroPlot() which will plot the results from the permutation testing. The plot is a histogram of the test statistics quantroStatPerm from the permuted samples from quantro() and the red line is the observed test statistic quantroStat from quantro().
quantro
Why use quantro?
Quantile normalization is one of the most widely used multi-sample normalization tools for the analysis of noisy high-throughput data. Although it was originally developed for gene expression microarrays it is now used across many different high-throughput applications including RNAseq and ChIPseq. However, quantile normalization relies on assumptions about the data generation process that are not appropriate in some context. Unfortunately, no method exists to check for the appropriateness of these assumptions.
For example in gene expression, we assume that observed differences between the distributions of each sample are due to only technical variation unrelated to biological variation. To normalize the samples, the distributions are forced to be the same. In general, this assumption is justified as only a minority of genes are expected to be differentially expressed between samples, but if the samples are expected to have a high percentage of global differences, it may not be appropriate to use quantile normalization as it may remove interesting global biological variation.
The quantro R-package can be used to test a priori to the data analysis whether global normalization methods such as quantile normalization should be applied. Our method uses the raw unprocessed high-throughput data to test for global differences in the distributions across a set of groups.
For help with the quantro R-package, there is a vignette available in the /vignettes folder.
Installation
The R-package quantro can be installed from the Bioconductor
After installation, the package can be loaded into R.
Using quantro
The main function in the quantro package is
quantro(). Thequantro()function needs two objects: (1) a data frame containing the samples to test for differences between their distributions with observations (rows) and samples (columns) (e.g. let’s call itmySamps) and (2) a group level factor calledgroupFactor(let’s call itoutcome). This order of this factor variable must match the order of the columns in themySampsobject because it contains information about which group each sample is from.To run the
quantro()function,Individual slots can be extracted using accessor methods:
A permutation test is performed to assess the statistical significance of the test statistic
quantroStatfromquantro().Elements in the output from
quantro()include:summarynGroups), (2) total number of samples (nTotSamples) (3) number of samples in each group (nSamplesinGroups)anovaMSbetweenMSwithinquantroStatquantroStatPermBis not equal to 0, then a permutation test was performed to assess the statistical significance ofquantroStat. These are the test statistics resulting from the permuted samplesquantroPvalPermBis not equal to 0, then this is the p-value associated with the proportion of times the test statistics (quantroStatPerm) resulting from the permuted samples were larger thanquantroStatVisualizing the results from the permutation test
There is a second function in the package called
quantroPlot()which will plot the results from the permutation testing. The plot is a histogram of the test statisticsquantroStatPermfrom the permuted samples fromquantro()and the red line is the observed test statisticquantroStatfromquantro().Additional options in the
quantroPlot()function include:Bug reports
Report bugs as issues on the GitHub repository
Contributors