The goal of tcgaCleaneR is to provide a user-friendly R package to
help Bioinformaticians with easy access to a tool that can perform Data
Wrangling and Data Analysis on TCGA Pan Cancer Dataset. The package
contains a subset of the original TCGA Breast Cancer Data collected from
TCGA. The package also contains a
detailed set of functionalities that allows user to identify and handle
unwanted variations in the TCGA datasets.
Installation
You can install the development version of tcgaCleaneR from
GitHub with:
This is a quick walk trough of the tcgaCleaneR functionalities. For
the detailed information on the Data Wrangling and Data Analysis
functions and arguments in tcgaCleaneR package, you can consider
looking at the vignette.
At present, TCGA Pan Cancer Datasets supports Cancer Biology for only
four Cancer types. These four cancer type (TCGA datasets) are
Breast Cancer (BRCA), Lung Cancer (LUAD), Colon Cancer (COAD) and
Rectum Cancer (READ). This implies that RUV-III analysis can only be
performed for these four cancer types. This is because the RUV-III
approach here requires at least one roughly known biologically
homogeneous subclass of samples shared across sources of unwanted
variation. Similarly, the vector correlation between Biology and PCs can
only be viewed for these four Cancer types.
The idea behind Study Design plot is to present the summarized
information about the filtered data set using HeatMaps.
plotStudyOutline(data = filtered.data3)
PCA
Generate PCA
The principal components (in this context also called singular vectors)
of the sample × transcript array of log-counts are the linear
combinations of the transcript measurements having the largest, second
largest, third largest, etc. variation, standardized to be of unit
length and orthogonal to the preceding components. Each will give a
single value for each sample.
# Is data input for PCA logical
is.logical(filtered.data3)
Once we have the PCs generated using the PCA function the next step is
to visualize those PCs with respect to the sample features like Time,
Tissue, Plate etc., to identify any unwanted variation by identifying
patterns in the plots by feature.
library(ggplot2)
library(cowplot)
pca.plot.data <- plotPC(pca.data = pca_data, data = filtered.data3, group = "Time", plot_type = "DensityPlot", pcs.no = c(1,2,3))
PCs correlation with unwanted variations
library(tidyverse)
corr_data <- plotPCsVar(pca.data = pca_data, data = filtered.data3, type = "purity", nPCs = 7)
corr_data
tcgaCleaneR
The goal of
tcgaCleaneRis to provide a user-friendly R package to help Bioinformaticians with easy access to a tool that can perform Data Wrangling and Data Analysis on TCGA Pan Cancer Dataset. The package contains a subset of the original TCGA Breast Cancer Data collected from TCGA. The package also contains a detailed set of functionalities that allows user to identify and handle unwanted variations in the TCGA datasets.Installation
You can install the development version of tcgaCleaneR from GitHub with:
TCGA Functionality
This is a quick walk trough of the
tcgaCleaneRfunctionalities. For the detailed information on the Data Wrangling and Data Analysis functions and arguments intcgaCleaneRpackage, you can consider looking at the vignette.At present, TCGA Pan Cancer Datasets supports Cancer Biology for only four Cancer types. These four cancer type (TCGA datasets) are Breast Cancer (BRCA), Lung Cancer (LUAD), Colon Cancer (COAD) and Rectum Cancer (READ). This implies that RUV-III analysis can only be performed for these four cancer types. This is because the RUV-III approach here requires at least one roughly known biologically homogeneous subclass of samples shared across sources of unwanted variation. Similarly, the vector correlation between Biology and PCs can only be viewed for these four Cancer types.
Data
Data Wrangling
Gene Filter
Removing lowly expressed genes
Purity Filter - Filter Samples based on Tumor Purity
Library Size Filter
Determine Library Size
Filter samples based on library size
Data Analysis
Study Design Plot
The idea behind Study Design plot is to present the summarized information about the filtered data set using HeatMaps.
PCA
Generate PCA
The principal components (in this context also called singular vectors) of the sample × transcript array of log-counts are the linear combinations of the transcript measurements having the largest, second largest, third largest, etc. variation, standardized to be of unit length and orthogonal to the preceding components. Each will give a single value for each sample.
Plot PCA
Once we have the PCs generated using the PCA function the next step is to visualize those PCs with respect to the sample features like Time, Tissue, Plate etc., to identify any unwanted variation by identifying patterns in the plots by feature.
PCs correlation with unwanted variations
RUV - III
Pseudo Replicate Pseudo Sample(PRPS) Map
PRPS Generation
RUV-III
Combined Analysis
Combined data
PCA on Combined Data
PCs correlation with unwanted variations in Combined Data