Ballgown is a software package designed to facilitate flexible differential expression analysis of RNA-Seq data. It also provides functions to organize, visualize, and analyze the expression measurements for your transcriptome assembly.
Before using the Ballgown R package, a few preprocessing steps are necessary:
RNA-Seq reads should be aligned to a reference genome.
A transcriptome should be assembled, or a reference transcriptome should be downloaded.
Expression for the features (transcript, exon, and intron junctions) in the transcriptome should be estimated in a Ballgown readable format.
Two sample pipelines for preprocessing are as follows:
Pipeline 1:TopHat2 (1) + Stringtie (2,3)
TopHat2 [Trapnell et al. (2009)] is built on the ultrafast short read mapping program Bowtie and aligns RNA-Seq reads to a genome while identifying exonic splice junctions. Sample command:
tophat2 -G reference.gff -o outputDirectory -p 6 referenceIndex reads
Stringtie [M. Pertea et al. (2015)] is a highly efficient assembler for RNA-Seq alignments using a novel network flow algorithm. It simultaneously assembles and quantifies expression levels for the features of the transcriptome in a Ballgown readable format (by using the option -B). One command to Stringtie satisfies steps 2 and 3 above. Sample command: stringtie -B -G reference.gff -p 6 accepted_hits.bam -o stringtie.gff
Cufflinks [Trapnell et al. (2010)] also assembles transcriptomes from RNA-Seq data and quantifies their expression. Sample command:
cufflinks -g reference.gff -o outputDirectory accepted_hits.bam
Tablemaker calls Cufflinks to estimate feature expressions in a Ballgown readable format. Tablemaker access and instructions can be found here.
Installation
Start R and run:
if (!requireNamespace("BiocManager", quietly=TaRUE))
install.packages("BiocManager")
BiocManager::install("ballgown")
Ballgown readable expression output
The files that Stringtie and Tablemaker produce for Ballgown to load is as follows:
e_data.ctab: exon-level expression measurements. One row per exon. Columns are e_id (numeric exon id), chr, strand, start, end (genomic location of the exon), and the following expression measurements for each sample:
rcount: reads overlapping the exon
ucount: uniquely mapped reads overlapping the exon
mrcount: multi-map-corrected number of reads overlapping the exon
cov average per-base read coverage
cov_sd: standard deviation of per-base read coverage
mcov: multi-map-corrected average per-base read coverage
mcov_sd: standard deviation of multi-map-corrected per-base coverage
i_data.ctab: intron- (i.e., junction-) level expression measurements. One row per intron. Columns are i_id (numeric intron id), chr, strand, start, end (genomic location of the intron), and the following expression measurements for each sample:
rcount: number of reads supporting the intron
ucount: number of uniquely mapped reads supporting the intron
mrcount: multi-map-corrected number of reads supporting the intron
t_data.ctab: transcript-level expression measurements. One row per transcript. Columns are:
t_id: numeric transcript id
chr, strand, start, end: genomic location of the transcript
t_name: Cufflinks-generated transcript id
num_exons: number of exons comprising the transcript
length: transcript length, including both exons and introns
gene_id: gene the transcript belongs to
gene_name: HUGO gene name for the transcript, if known
cov: per-base coverage for the transcript (available for each sample)
FPKM: Cufflinks-estimated FPKM for the transcript (available for each sample)
e2t.ctab: table with two columns, e_id and t_id, denoting which exons belong to which transcripts. These ids match the ids in the e_data and t_data tables.
i2t.ctab: table with two columns, i_id and t_id, denoting which introns belong to which transcripts. These ids match the ids in the i_data and t_data tables.
Loading data into R
The default directory structure produced by Stringtie and Tablemaker should mirror the extdata folder in the Ballgown pacakge:
If your data is stored in directories matching the above structure (one root folder, subfolders named by sample, and .ctab files in the subfolders), you can use the dataDir and samplePattern arguments to load the data. samplePattern takes a regular expressions specifying the subfolders that should be included in the ballgown object:
# make the ballgown object:
bg = ballgown(dataDir=data_directory, samplePattern='sample', meas='all')
bg
## ballgown instance with 100 transcripts and 20 samples
If your data is stored in a directory structure other than the one specified above, you can use the samples argument in the ballgown function: samples should be a vector (1-d array) with one entry per sample, where the entry gives the path to the folder containing that sample’s .ctab files.
The result from either of these approaches is an object of class ballgown (named bg in these examples).
In the rest of this document, we use bg to refer to the first example, where samples are named sample01 through sample20.
A note for large experiments (with many samples or with large genomes): loading the data might require a lot of time and memory. In these cases, it’s often useful to do the data loading in non-interactive mode. More specifically, you could create a script called load.R that contains these lines:
You could then run this script non-interactively using R CMD BATCH: from the command line, run:
R CMD BATCH load.R
This may take some time, but when it finishes, the file bg.rda will be saved in the current directory, and you can read it back into R using the load() function. Rda files are usually only a few Gb on disk, even for large experiments. It is also possible to load only a subset of all the expression measurements by changing the meas argument to the ballgown function. For example, to only load transcript-level FPKMs, set meas = 'FPKM' and to load average coverage values and read counts, set meas=c('cov', 'rcount').
See ?ballgown for detailed information on creating Ballgown objects.
Accessing assembly data
A ballgown object has six slots: structure, expr, indexes, dirs, mergedDate, and meas.
structure
The structure slot depends heavily on the GenomicRanges Bioconductor package (Lawrence et al. (2013)). The slot specifies the structure, i.e., genomic locations and relationships between exons, introns, and transcripts, of the transcriptome assembly. It is convenient to represent exons and introns as intervals and to represent transcripts as a set of intervals (exons), so assembled exons and introns are available as GRanges objects, and the assembled transcripts are available as a GRangesList object. This means that useful range operations, such as findOverlaps and reduce, are readily available for assembled features.
Exon, intron, and transcript structures are easily extracted from the main ballgown object:
The expr slot is a list that contains tables of expression data for the genomic features. These tables are very similar to the *_data.ctabTablemaker output files. Ballgown implements the following syntax to access components of the expr slot:
where * is either e for exon, i for intron, t for transcript, or g for gene, and is an expression-measurement column name from the appropriate .ctab file. Gene-level measurements are calculated by aggregating the transcript-level measurements for that gene. All of the following are valid ways to extract expression data from the bg ballgown object:
Calculating the gene-level expression measurements can be slow for large experiments.
The *expr functions return matrices unless meas = 'all', in which case some additional feature metadata is returned and the result is a data.frame.
indexes
The indexes slot of a ballgown object connects the pieces of the assembly and provides other experimental information. indexes(bg) is a list with several components that can be extracted with the $ operator.
Perhaps most importantly, there is a component called pData that should hold a data frame of phenotype information for the samples in the experiment. This must be created manually. It is very important that the rows of pData are in the correct order. Each row corresponds to a sample, and the rows of pData should be ordered the same as the tables in the expr slot. You can check that order by running sampleNames(bg). The pData component can be added during construction (you can pass a data frame to the ballgown function), or you can add it later:
The other components of indexes are the e2t and i2t tables described in the Tablemaker section, as well as a t2g table denoting which transcripts belong to which genes. There is also a bamfiles component, designed to hold paths to the read alignment files for each sample. The bamfiles component isn’t currently used by any ballgown functions, but it could come in handy for users of RSamtools or similar packages. Here are some examples of how to extract indexes components from ballgown objects:
The mergedDate slot indicates when the ballgown object was created:
bg@mergedDate
## [1] "Sun Feb 22 23:51:45 2015"
And the meas slot gives the expression measurements present in the object:
bg@meas
## [1] "all"
Plotting transcript structures
Visualization of the assembled transcripts is done with the plotTranscripts function. Transcripts or exons can be colored by expression level. This plot colors transcripts by expression level:
Ballgown provides a wide selection of simple, fast statistical methods for testing whether transcripts are differentially expressed between experimental conditions or across a continuous covariate (such as time).
The default statistical test in ballgown is a parametric F-test comparing nested linear models; details are available in the Ballgown manuscript (Frazee et al. (2014)). These models are conceptually simialar to the models used by Smyth (2005) in the limma package. In limma, more sophisticated empirical Bayes shrinkage methods are used, and generally a single linear model is fit per feature instead of doing a nested model comparison, but the flavor is similar (and in fact, limma can easily be run on any of the data matrices in a ballgown object).
Ballgown’s statistical models are implemented with the stattest function. Two models are fit to each feature, using expression as the outcome: one including the covariate of interest (e.g., case/control status or time) and one not including that covariate. An F statistic and p-value are calculated using the fits of the two models. A significant p-value means the model including the covariate of interest fits significantly better than the model without that covariate, indicating differential expression. We adjust for multiple testing by reporting q-values (Storey & Tibshirani (2003)) for each transcript in addition to p-values: reporting features with, say, q < 0.05 means the false discovery rate should be controlled at about 5%.
stattest automatically handles two-group (e.g. case/control) comparisons, multi-group comparisons (e.g. comparison of several tissue types), and “timecourse” comparisons (with the scare quotes meaning that these comparisons are also applicable to continuous covariates that aren’t time). For two- and multi-group comparisons, a significant result indicates that the feature is differentially expressed in at least one of the groups. For timecourse comparisons, significant results mean the feature has an expression profile that varies significantly over time (i.e., values of the continuous covariate) as opposed to being flat over time.
The example dataset bg contains two group labels, 0 and 1. We can test each transcript for differential expression with stattest:
The result is a data frame containing the feature tested, feature ids, and corresponding p- and q-values. See ?stattest for further usage details.
timecourse experiments
For timecourse experiments, a smooth curve is fit to time (or the continuous covariate) using natural splines. The default degrees of freedom used for the spline model is 4, but this can be adjusted with the df option. The model for expression including these spline terms is compared to a model without any spline terms for the F-test. The results indicate which features’ expression levels change significantly over time. For our example, we can define a “time” covariate and then demonstrate a typical call to stattest for a timecourse experiment:
The timecourse option assumes that “time” in your study is truly continuous, i.e., that it takes several values along a time scale. If you have very few timepoints (e.g., fewer than 5), we recommend treating time as a categorical variable, since having very few values does not give much granularity for fitting a smooth curve using splines. You can do this by setting covariate equal to ‘time’ (or whatever your time variable is named) and simply leaving timecourse as FALSE, its default. If you don’t have more timepoints than degrees of freedom in the spline model, a warning will be printed and time will be coerced to categorical.
adjusting for confounders
You can adjust for any or all variables in pData when testing for differential expression. Ballgown automatically adjusts for library size using the sum of all logged nonzero expression measurements below the 75th percentile of those measurements, for each sample. If you would like to adjust for other variables, just provide those confounders as the adjustvars argument to stattest:
It is also possible to explicitly provide the design matrices for the models to be compared. For example, suppose we had sex and age information available, in addition to group and time, and we wanted to compare a model including all information (sex, age, group, time) to a model including only group and time. Code to do this with ballgown is:
# create example data:
set.seed(43)
sex = sample(c('M','F'), size=nrow(pData(bg)), replace=TRUE)
age = sample(21:52, size=nrow(pData(bg)), replace=TRUE)
# create design matrices:
mod = model.matrix(~ sex + age + pData(bg)$group + pData(bg)$time)
mod0 = model.matrix(~ pData(bg)$group + pData(bg)$time)
# run differential expression tests:
adjusted_results = stattest(bg, feature='transcript', meas='FPKM', mod0=mod0, mod=mod)
head(adjusted_results)
Ballgown’s statistical methods for differential expression testing are straightforward and accurate (Frazee et al. (2014)), but users may wish to use one of the many existing packages for differential expression. Ballgown’s data structures make it easy to use table-based packages like limma (Smyth (2005)), limma Voom (Law et al. (2014)), DESeq (Anders & Huber (2010)), DEXSeq (Anders et al. (2012)), or EdgeR (Robinson et al. (2010)) for differential expression analysis. A feature-by-sample expression table can be easily created with a *expr function and used directly as input to these or other differential expression packages.
Simple transcript clustering
Sometimes several very similar transcripts are assembled for the same gene, which might cause expression estimates for those transcripts to be unreliable: statistically, it can very difficult or impossible to tell which of two very similar transcript a read came from. This means differential expression results might also be unreliable.
As a preliminary attempt at addressing this issue, Ballgown provides some simple transcript clustering functions. The idea is that similar assembled transcripts can be grouped together in clusters, and differential expression analysis could be performed on the cluster, whose expression measurement aggregates the expression estimates of the transcripts that compose it.
These functions measure the distance between transcripts using Jaccard distance, where each transcript’s “set” is the nucleotides included in its exons. Transcripts can be clustered using either k-means clustering or hierarchical clustering.
And you can calculate aggregate cluster expression measurements for some gene using collapseTranscripts. The tab result of collapseTranscripts can be passed to stattest as the gowntable argument, for differential expression analysis of the clusters:
This example clustered only three transcripts, but we imagine clustering could be useful when many more than three transcripts have been assembled for a single gene.
References
Simon Anders, Wolfgang Huber (2010). Differential Expression Analysis For Sequence Count Data. Genome Biology11 R106-NA 10.1186/gb-2010-11-10-r106
Simon Anders, Alejandro Reyes, Wolfgang Huber (2012). Detecting Differential Usage of Exons From RNA-Seq Data. Genome Research22 2008-2017 10.1101/gr.133744.111
Alyssa C. Frazee, Geo Pertea, Andrew E. Jaffe, Ben Langmead, Steven L. Salzberg, Jeffrey T. Leek (2014). Flexible isoform-level differential expression analysis with Ballgown. bioRxivhttp://biorxiv.org/content/early/2014/03/30/003665
Charity W Law, Yunshun Chen, Wei Shi, Gordon K Smyth, (2014) Voom: Precision Weights Unlock Linear Model Analysis Tools For RNA-Seq Read Counts. Genome Biology15 R29-NA 10.1186/gb-2014-15-2-r29
Michael Lawrence, Wolfgang Huber, Herve Pages, Patrick Aboyoun, Marc Carlson, Robert Gentleman, Martin T. Morgan, Vincent J. Carey, Andreas Prlic (2013). Software For Computing And Annotating Genomic Ranges. Plos Computational Biology9 e1003118-NA 10.1371/journal.pcbi.1003118
Mihaela Pertea, Geo M Pertea, Corina M Antonescu, Tsung-Cheng Chang, Joshua T Mendell, Steven L Salzberg (2015). StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nature Biotechnologyhttp://www.nature.com/nbt/journal/v33/n3/full/nbt.3122.html
Mark D. Robinson, Davis J. McCarthy and Gordon K. Smyth (2009) “edgeR: a
Bioconductor package for differential expression analysis of
digital gene expression data”. Bioinformatics26.1 pp. 139-140. 10.1093/bioinformatics/btp616
Gordon Smyth, (2005) Limma: linear models for microarray data. 397-420
John D. Storey, Robert Tibshirani (2003). Statistical Significance For Genomewide Studies. Proceedings of The National Academy of Sciences100 9440-9445 10.1073/pnas.1530509100
Cole Trapnell, Brian A Williams, Geo Pertea, Ali Mortazavi, Gordon Kwan, Marijke J van Baren, Steven L Salzberg, Barbara J Wold, Lior Pachter (2010) Transcript Assembly And Quantification by Rna-Seq Reveals Unannotated Transcripts And Isoform Switching During Cell Differentiation. Nature Biotechnology28 511-515 10.1038/nbt.1621
Introduction and Preprocessing
Ballgown is a software package designed to facilitate flexible differential expression analysis of RNA-Seq data. It also provides functions to organize, visualize, and analyze the expression measurements for your transcriptome assembly.
Before using the Ballgown R package, a few preprocessing steps are necessary:
Two sample pipelines for preprocessing are as follows:
tophat2 -G reference.gff -o outputDirectory -p 6 referenceIndex readsstringtie -B -G reference.gff -p 6 accepted_hits.bam -o stringtie.gffcufflinks -g reference.gff -o outputDirectory accepted_hits.bamInstallation
Start R and run:
Ballgown readable expression output
The files that Stringtie and Tablemaker produce for Ballgown to load is as follows:
e_data.ctab: exon-level expression measurements. One row per exon. Columns aree_id(numeric exon id),chr,strand,start,end(genomic location of the exon), and the following expression measurements for each sample:rcount: reads overlapping the exonucount: uniquely mapped reads overlapping the exonmrcount: multi-map-corrected number of reads overlapping the exoncovaverage per-base read coveragecov_sd: standard deviation of per-base read coveragemcov: multi-map-corrected average per-base read coveragemcov_sd: standard deviation of multi-map-corrected per-base coveragei_data.ctab: intron- (i.e., junction-) level expression measurements. One row per intron. Columns arei_id(numeric intron id),chr,strand,start,end(genomic location of the intron), and the following expression measurements for each sample:rcount: number of reads supporting the intronucount: number of uniquely mapped reads supporting the intronmrcount: multi-map-corrected number of reads supporting the intront_data.ctab: transcript-level expression measurements. One row per transcript. Columns are:t_id: numeric transcript idchr,strand,start,end: genomic location of the transcriptt_name: Cufflinks-generated transcript idnum_exons: number of exons comprising the transcriptlength: transcript length, including both exons and intronsgene_id: gene the transcript belongs togene_name: HUGO gene name for the transcript, if knowncov: per-base coverage for the transcript (available for each sample)FPKM: Cufflinks-estimated FPKM for the transcript (available for each sample)e2t.ctab: table with two columns,e_idandt_id, denoting which exons belong to which transcripts. These ids match the ids in thee_dataandt_datatables.i2t.ctab: table with two columns,i_idandt_id, denoting which introns belong to which transcripts. These ids match the ids in thei_dataandt_datatables.Loading data into R
The default directory structure produced by Stringtie and Tablemaker should mirror the
extdatafolder in the Ballgown pacakge:Data is loaded using the
ballgownfunction.If your data is stored in directories matching the above structure (one root folder, subfolders named by sample, and
.ctabfiles in the subfolders), you can use thedataDirandsamplePatternarguments to load the data.samplePatterntakes a regular expressions specifying the subfolders that should be included in the ballgown object:If your data is stored in a directory structure other than the one specified above, you can use the
samplesargument in theballgownfunction:samplesshould be a vector (1-d array) with one entry per sample, where the entry gives the path to the folder containing that sample’s.ctabfiles.The result from either of these approaches is an object of class
ballgown(namedbgin these examples).In the rest of this document, we use
bgto refer to the first example, where samples are namedsample01throughsample20.A note for large experiments (with many samples or with large genomes): loading the data might require a lot of time and memory. In these cases, it’s often useful to do the data loading in non-interactive mode. More specifically, you could create a script called
load.Rthat contains these lines:You could then run this script non-interactively using
R CMD BATCH: from the command line, run:This may take some time, but when it finishes, the file
bg.rdawill be saved in the current directory, and you can read it back into R using theload()function. Rda files are usually only a few Gb on disk, even for large experiments. It is also possible to load only a subset of all the expression measurements by changing themeasargument to theballgownfunction. For example, to only load transcript-level FPKMs, setmeas = 'FPKM'and to load average coverage values and read counts, setmeas=c('cov', 'rcount').See
?ballgownfor detailed information on creating Ballgown objects.Accessing assembly data
A
ballgownobject has six slots:structure,expr,indexes,dirs,mergedDate, andmeas.structure
The
structureslot depends heavily on theGenomicRangesBioconductor package (Lawrence et al. (2013)). The slot specifies the structure, i.e., genomic locations and relationships between exons, introns, and transcripts, of the transcriptome assembly. It is convenient to represent exons and introns as intervals and to represent transcripts as a set of intervals (exons), so assembled exons and introns are available asGRangesobjects, and the assembled transcripts are available as aGRangesListobject. This means that useful range operations, such asfindOverlapsandreduce, are readily available for assembled features.Exon, intron, and transcript structures are easily extracted from the main
ballgownobject:expr
The
exprslot is a list that contains tables of expression data for the genomic features. These tables are very similar to the*_data.ctabTablemaker output files. Ballgown implements the following syntax to access components of theexprslot:where
*is either e for exon, i for intron, t for transcript, or g for gene, and is an expression-measurement column name from the appropriate.ctabfile. Gene-level measurements are calculated by aggregating the transcript-level measurements for that gene. All of the following are valid ways to extract expression data from thebgballgown object:Calculating the gene-level expression measurements can be slow for large experiments.
The
*exprfunctions return matrices unlessmeas = 'all', in which case some additional feature metadata is returned and the result is adata.frame.indexes
The
indexesslot of a ballgown object connects the pieces of the assembly and provides other experimental information.indexes(bg)is a list with several components that can be extracted with the$operator.Perhaps most importantly, there is a component called
pDatathat should hold a data frame of phenotype information for the samples in the experiment. This must be created manually. It is very important that the rows of pData are in the correct order. Each row corresponds to a sample, and the rows of pData should be ordered the same as the tables in theexprslot. You can check that order by runningsampleNames(bg). ThepDatacomponent can be added during construction (you can pass a data frame to theballgownfunction), or you can add it later:The other components of
indexesare thee2tandi2ttables described in the Tablemaker section, as well as at2gtable denoting which transcripts belong to which genes. There is also abamfilescomponent, designed to hold paths to the read alignment files for each sample. Thebamfilescomponent isn’t currently used by any ballgown functions, but it could come in handy for users ofRSamtoolsor similar packages. Here are some examples of how to extractindexescomponents from ballgown objects:other slots
The
dirsslot gives full filepaths to Tablemaker output:The
mergedDateslot indicates when theballgownobject was created:And the
measslot gives the expression measurements present in the object:Plotting transcript structures
Visualization of the assembled transcripts is done with the
plotTranscriptsfunction. Transcripts or exons can be colored by expression level. This plot colors transcripts by expression level:It is also possible to plot several samples at once:
You can also make side-by-side plots comparing mean abundances between groups (here, 0 and 1):
Differential expression analysis
Ballgown provides a wide selection of simple, fast statistical methods for testing whether transcripts are differentially expressed between experimental conditions or across a continuous covariate (such as time).
The default statistical test in ballgown is a parametric F-test comparing nested linear models; details are available in the Ballgown manuscript (Frazee et al. (2014)). These models are conceptually simialar to the models used by Smyth (2005) in the
limmapackage. Inlimma, more sophisticated empirical Bayes shrinkage methods are used, and generally a single linear model is fit per feature instead of doing a nested model comparison, but the flavor is similar (and in fact,limmacan easily be run on any of the data matrices in aballgownobject).Ballgown’s statistical models are implemented with the
stattestfunction. Two models are fit to each feature, using expression as the outcome: one including the covariate of interest (e.g., case/control status or time) and one not including that covariate. An F statistic and p-value are calculated using the fits of the two models. A significant p-value means the model including the covariate of interest fits significantly better than the model without that covariate, indicating differential expression. We adjust for multiple testing by reporting q-values (Storey & Tibshirani (2003)) for each transcript in addition to p-values: reporting features with, say, q < 0.05 means the false discovery rate should be controlled at about 5%.stattestautomatically handles two-group (e.g. case/control) comparisons, multi-group comparisons (e.g. comparison of several tissue types), and “timecourse” comparisons (with the scare quotes meaning that these comparisons are also applicable to continuous covariates that aren’t time). For two- and multi-group comparisons, a significant result indicates that the feature is differentially expressed in at least one of the groups. For timecourse comparisons, significant results mean the feature has an expression profile that varies significantly over time (i.e., values of the continuous covariate) as opposed to being flat over time.The example dataset
bgcontains two group labels, 0 and 1. We can test each transcript for differential expression withstattest:The result is a data frame containing the feature tested, feature ids, and corresponding p- and q-values. See
?stattestfor further usage details.timecourse experiments
For timecourse experiments, a smooth curve is fit to time (or the continuous covariate) using natural splines. The default degrees of freedom used for the spline model is 4, but this can be adjusted with the
dfoption. The model for expression including these spline terms is compared to a model without any spline terms for the F-test. The results indicate which features’ expression levels change significantly over time. For our example, we can define a “time” covariate and then demonstrate a typical call tostattestfor a timecourse experiment:The timecourse option assumes that “time” in your study is truly continuous, i.e., that it takes several values along a time scale. If you have very few timepoints (e.g., fewer than 5), we recommend treating time as a categorical variable, since having very few values does not give much granularity for fitting a smooth curve using splines. You can do this by setting covariate equal to ‘time’ (or whatever your time variable is named) and simply leaving timecourse as FALSE, its default. If you don’t have more timepoints than degrees of freedom in the spline model, a warning will be printed and time will be coerced to categorical.
adjusting for confounders
You can adjust for any or all variables in
pDatawhen testing for differential expression. Ballgown automatically adjusts for library size using the sum of all logged nonzero expression measurements below the 75th percentile of those measurements, for each sample. If you would like to adjust for other variables, just provide those confounders as theadjustvarsargument tostattest:defining custom models
It is also possible to explicitly provide the design matrices for the models to be compared. For example, suppose we had sex and age information available, in addition to group and time, and we wanted to compare a model including all information (sex, age, group, time) to a model including only group and time. Code to do this with
ballgownis:Using alternative statistical methods
Ballgown’s statistical methods for differential expression testing are straightforward and accurate (Frazee et al. (2014)), but users may wish to use one of the many existing packages for differential expression. Ballgown’s data structures make it easy to use table-based packages like limma (Smyth (2005)), limma Voom (Law et al. (2014)), DESeq (Anders & Huber (2010)), DEXSeq (Anders et al. (2012)), or EdgeR (Robinson et al. (2010)) for differential expression analysis. A feature-by-sample expression table can be easily created with a
*exprfunction and used directly as input to these or other differential expression packages.Simple transcript clustering
Sometimes several very similar transcripts are assembled for the same gene, which might cause expression estimates for those transcripts to be unreliable: statistically, it can very difficult or impossible to tell which of two very similar transcript a read came from. This means differential expression results might also be unreliable.
As a preliminary attempt at addressing this issue, Ballgown provides some simple transcript clustering functions. The idea is that similar assembled transcripts can be grouped together in clusters, and differential expression analysis could be performed on the cluster, whose expression measurement aggregates the expression estimates of the transcripts that compose it.
These functions measure the distance between transcripts using Jaccard distance, where each transcript’s “set” is the nucleotides included in its exons. Transcripts can be clustered using either k-means clustering or hierarchical clustering.
You can also visualize the transcript clusters:
And you can calculate aggregate cluster expression measurements for some gene using
collapseTranscripts. Thetabresult ofcollapseTranscriptscan be passed tostattestas thegowntableargument, for differential expression analysis of the clusters:This example clustered only three transcripts, but we imagine clustering could be useful when many more than three transcripts have been assembled for a single gene.
References
Session Information