if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("terraTCGAdata")
Overview
The terraTCGAdata R package aims to import TCGA datasets, as
MultiAssayExperiment,
available on the Terra platform. The package provides a set of functions
that allow the discovery of relevant datasets. It provides one main
function and two helper functions:
terraTCGAdata allows the creation of the MultiAssayExperiment
object from the different indicated resources.
The getClinicalTable and getAssayTable functions allow for the
discovery of datasets within the Terra data model. The column names
from these tables can be provided as inputs to the terraTCGAdata
function.
Data
Some public Terra workspaces come pre-packaged with TCGA data (i.e.,
cloud data resources are linked within the data model). Particularly the
workspaces that are labelled OpenAccess_V1-0. Datasets harmonized to
the hg38 genome, such as those from the Genomic Data Commons data
repository, use a different data model / workflow and are not compatible
with the functions in this package. For those that are, we make use of
the Terra data model and represent the data as MultiAssayExperiment.
For more information on MultiAssayExperiment, please see the vignette
in that package.
Requirements
Loading packages
library(AnVIL)
library(terraTCGAdata)
gcloud sdk installation
A valid GCloud SDK installation is required to use the package. To get
set up, see the Bioconductor tutorials for running RStudio on Terra. Use
the gcloud_exists() function from the
AnVIL package to
identify whether it is installed in your system.
gcloud_exists()
#> [1] TRUE
You can also use the gcloud_project to set a project name by
specifying the project argument:
gcloud_project()
#> [1] "bioconductor-rpci-anvil"
Default Data Workspace
To get a table of available TCGA workspaces, use the
selectTCGAworkspace() function:
You can also set the package-wide option with the terraTCGAworkspace
function and check the setting with
getOption('terraTCGAdata.workspace') or by running
terraTCGAworkspace function.
In order to determine what datasets to download, use the
getClinicalTable function to list all of the columns that correspond
to clinical data from the different collection centers.
We use the same approach for assay data. We first produce a list of
assays from the getAssayTable and then we select one along with any
sample codes of interest.
Finally, once you have collected all the relevant column names, these
can be inputs to the main terraTCGAdata function:
mae <- terraTCGAdata(
clinicalName = "clin__bio__nationwidechildrens_org__Level_1__biospecimen__clin",
assays =
c("protein_exp__mda_rppa_core__mdanderson_org__Level_3__protein_normalization__data",
"rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data"),
sampleCode = NULL,
split = FALSE,
sampleIdx = 1:4,
workspace = "TCGA_COAD_OpenAccess_V1-0_DATA"
)
#> Using namespace/workspace: broad-firecloud-tcga/TCGA_COAD_OpenAccess_V1-0_DATA
#> Using namespace/workspace: broad-firecloud-tcga/TCGA_COAD_OpenAccess_V1-0_DATA
#> Warning in .checkBarcodes(barcodes): Inconsistent barcode lengths: 27, 28
#> Using namespace/workspace: broad-firecloud-tcga/TCGA_COAD_OpenAccess_V1-0_DATA
#>
#> ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> cols(
#> .default = col_character(),
#> admin.day_of_dcc_upload = col_double(),
#> admin.month_of_dcc_upload = col_double(),
#> admin.year_of_dcc_upload = col_double(),
#> patient.additional_studies = col_logical(),
#> patient.days_to_index = col_double(),
#> patient.samples.sample.additional_studies = col_logical(),
#> patient.samples.sample.biospecimen_sequence = col_logical(),
#> patient.samples.sample.longest_dimension = col_double(),
#> patient.samples.sample.intermediate_dimension = col_double(),
#> patient.samples.sample.shortest_dimension = col_double(),
#> patient.samples.sample.initial_weight = col_double(),
#> patient.samples.sample.current_weight = col_logical(),
#> patient.samples.sample.freezing_method = col_logical(),
#> patient.samples.sample.oct_embedded = col_logical(),
#> patient.samples.sample.preservation_method = col_logical(),
#> patient.samples.sample.tissue_type = col_logical(),
#> patient.samples.sample.composition = col_logical(),
#> patient.samples.sample.tumor_descriptor = col_logical(),
#> patient.samples.sample.days_to_collection = col_double(),
#> patient.samples.sample.time_between_clamping_and_freezing = col_logical()
#> # ... with 1225 more columns
#> )
#> ℹ Use `spec()` for the full column specifications.
#> Warning in .checkBarcodes(barcodes): Inconsistent barcode lengths: 27, 28
#> harmonizing input:
#> removing 455 colData rownames not in sampleMap 'primary'
mae
#> A MultiAssayExperiment object of 2 listed
#> experiments with user-defined names and respective classes.
#> Containing an ExperimentList class object of length 2:
#> [1] protein_exp__mda_rppa_core__mdanderson_org__Level_3__protein_normalization__data: matrix with 200 rows and 4 columns
#> [2] rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data: matrix with 20531 rows and 4 columns
#> Functionality:
#> experiments() - obtain the ExperimentList instance
#> colData() - the primary/phenotype DataFrame
#> sampleMap() - the sample coordination DataFrame
#> `NSCCN/bioconductor-terratcgadata:提供TCGA数据的R包,用于生物信息学分析, `[`, `[[` - extract colData columns, subset, or experiment
#> *Format() - convert into a long or wide DataFrame
#> assays() - convert ExperimentList to a SimpleList of matrices
#> exportClass() - save data to flat files
We expect that most OpenAccess_V1-0 cancer datasets follow this data
model. If you encounter any errors, please provide a minimally
reproducible example at https://github.com/waldronlab/terraTCGAdata.
terraTCGAData
Installation
Overview
The
terraTCGAdataR package aims to import TCGA datasets, as MultiAssayExperiment, available on the Terra platform. The package provides a set of functions that allow the discovery of relevant datasets. It provides one main function and two helper functions:terraTCGAdataallows the creation of theMultiAssayExperimentobject from the different indicated resources.The
getClinicalTableandgetAssayTablefunctions allow for the discovery of datasets within the Terra data model. The column names from these tables can be provided as inputs to theterraTCGAdatafunction.Data
Some public Terra workspaces come pre-packaged with TCGA data (i.e., cloud data resources are linked within the data model). Particularly the workspaces that are labelled
OpenAccess_V1-0. Datasets harmonized to the hg38 genome, such as those from the Genomic Data Commons data repository, use a different data model / workflow and are not compatible with the functions in this package. For those that are, we make use of the Terra data model and represent the data asMultiAssayExperiment.For more information on
MultiAssayExperiment, please see the vignette in that package.Requirements
Loading packages
gcloud sdk installation
A valid GCloud SDK installation is required to use the package. To get set up, see the Bioconductor tutorials for running RStudio on Terra. Use the
gcloud_exists()function from the AnVIL package to identify whether it is installed in your system.You can also use the
gcloud_projectto set a project name by specifying the project argument:Default Data Workspace
To get a table of available TCGA workspaces, use the
selectTCGAworkspace()function:You can also set the package-wide option with the
terraTCGAworkspacefunction and check the setting withgetOption('terraTCGAdata.workspace')or by runningterraTCGAworkspacefunction.Clinical data resources
In order to determine what datasets to download, use the
getClinicalTablefunction to list all of the columns that correspond to clinical data from the different collection centers.Clinical data download
After picking the column in the
getClinicalTableoutput, use the column name as input to thegetClinicalfunction to obtain the data:Assay data resources
We use the same approach for assay data. We first produce a list of assays from the
getAssayTableand then we select one along with any sample codes of interest.Summary of sample types in the data
You can get a summary table of all the samples in the adata by using the
sampleTypesTable:Intermediate function for obtaining only the data
Note that if you have the package-wide option set, the workspace argument is not needed in the function call.
MultiAssayExperiment
Finally, once you have collected all the relevant column names, these can be inputs to the main
terraTCGAdatafunction:We expect that most
OpenAccess_V1-0cancer datasets follow this data model. If you encounter any errors, please provide a minimally reproducible example at https://github.com/waldronlab/terraTCGAdata.Session Info