SGCP: a spectral self-learning method for clustering genes in co-expression networks, link

SGCP Introduction

The Self-training Gene Clustering Pipeline (SGCP) is an innovative framework for constructing and analyzing gene co-expression networks. Its primary objective is to group genes with similar expression patterns into cohesive clusters, often referred to as modules. SGCP introduces several novel steps that enable the computation of highly enriched gene modules in an unsupervised manner. What sets SGCP apart from existing frameworks is its integration of a semi-supervised clustering approach, which leverages Gene Ontology (GO) information. This unique step significantly enhances the quality of the resulting modules, producing highly enriched and biologically relevant clusters.

SGCP Publication

SGCP is available at BMC Bioinformatics.

SGCP Installation

For detailed instructions and steps, please refer to the SGCP manual on Bioconductor page. To install the latest version of SGCP, you can access the GitHub repository using the following command:

#install.packages("devtools")
#devtools::install_github("na396/SGCP")

SGCP license

GPL-3

SGCP encoding

UTF-8

SGCP Input

SGCP requires three main inputs; expData , geneID, and annotation_db.

expData: This is a matrix or dataframe of size m*n where m represents the number of genes and n represents the number of samples. It can contain data from either DNA-microarray or RNA-seq experiments . Note that SGCP assumes that pre-processing steps, such as normalization and batch effect corection, have already been performed, as these are not handled by the pipeline.
geneID: A vector of gene identifier corresponding to the rows in expData.
anotation_db: The name of a genome-wide annotation package for the organism of interest, used in the gene ontology (GO) enrichment step. The annotation_db package must be installed by user prior to using SGCP.

Below are some commonly used annotation_db packages along with their corresponding gene identifiers for different organisms.

organism	annotation_db	gene identifier
Homo sapiens (Hs)	org.Hs.eg.db	Entrez Gene identifiers
Drosophila melanogaster (Dm)	org.Dm.eg.db	Entrez Gene identifiers
Rattus norvegicus (Rn)	org.Rn.eg.db	Entrez Gene identifiers
Mus musculus (Mm)	org.Mm.eg.db	Entrez Gene identifiers
Arabidopsis thaliana (At)	org.At.tair.db	TAIR identifiers

Gene expression datasets for your analysis can be obtained from the Gene Expression Omnibus, a public repository of high-throughput gene expression data.

SGCP Input Cleaning

In SGCP, the following assumptions are made about the input genes:

Genes must have expression values available across all samples, with no missing values.
Genes must exhibit non-zero variance in expression across all samples.
ach gene must have exactly one unique identifier, specified by geneID.
Genes must be annotated with Gene Ontology (GO) terms.

SGCP Input Example

Here, we give a brief example of the SGCP input. For this documentation, we use the gene expression GSE181225. For more information visit its Gene Expression Omnibus page).

Throughout this section, several Bioconductor packages will be required. Make sure to install and load them as needed to follow the example.

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install(c("org.Hs.eg.db", "GEOquery", "AnnotationDbi"))

First, set the directory

# Display the current working directory
print(getwd())

# If necessary, change the path below to the directory where the data files are stored.
# "." means current directory. On Windows use a forward slash / instead of the usual \.
workingDir = "."
setwd(workingDir)

First, we need to download the gene expression file. The R package GEOquery is used to obtain gene expression data from the Gene Expression Omnibus. For detailed information on how to use GEOquery, refer to the GEOquery guide.

To download the expression file for GSE181225, visit its Gene Expression Omnibus page. On the page, locate the file GSE181225_LNCaP_p57_VO_and_p57_PIM1_RNA_Seq_normalizedCounts.txt.gz in the Supplementary files section, which contains the normalized gene expression data. Download this supplementary file and save it to the directory specified by baseDir.


library(GEOquery)

gse = getGEOSuppFiles("GSE181225", baseDir = getwd())

After downloading the file, you should find a new directory named GSE181225, which contains the gene expression file. To proceed, read the gene expression file into R. The file has the following structure:

The Symbol column contains the gene symbols.
The remaining four columns represent different samples.

df = read.delim("GSE181225/GSE181225_LNCaP_p57_VO_and_p57_PIM1_RNA_Seq_normalizedCounts.txt.gz")
head(df)

Next, create the expData, geneID, and annotation_db.

geneID = df[,1]

expData = df[, 2:ncol(df)]
rownames(expData) = geneID

library(org.Hs.eg.db)

To map gene symbols to Entrez identifiers using the annotation_db, you can use the select function from the AnnotationDbi package. Here’s how you can do it in R:

library(AnnotationDbi)

genes = AnnotationDbi::select(org.Hs.eg.db, keys = rownames(expData), 
                      columns=c("ENTREZID"), 
                      keytype="SYMBOL")
# initial dimension
print(dim(genes))
head(genes)

Remove genes with missing SYMBOL or ENTREZID.

genes = genes[!is.na(genes$SYMBOL), ]
genes = genes[!is.na(genes$ENTREZID), ]

#dimension after dropping missing values
print(dim(genes))
head(genes)

Remove genes with duplicated SYMBOL or ENTREZID.

genes = genes[!duplicated(genes$SYMBOL),]
genes = genes[!duplicated(genes$ENTREZID), ]
#dimension after dropping missing values
print(dim(genes))
print(head(genes))

Keep only rows in expData that have corresponding gene identifiers present in genes.

expData = data.frame(expData, SYMBOL = rownames(expData))
expData =  merge(expData, genes, by = "SYMBOL")

Produce expData.

rownames(expData) = expData$ENTREZID
expData = expData[, c(2:6)]
print(head(expData))

Remove genes with zero variance from expData.

# Dropping zero variance genes

vars = apply(expData, 1, var)
zeroInd = which(vars == 0)

if(length(zeroInd) != 0) {
  print(paste0("number of zero variance genes ", length(zeroInd)))
  expData = expData[-zeroInd, ]
  genes = genes[-zeroInd, ]
}

print(paste0("number of genes after dropping ", dim(genes)[1]))

Remove genes with no gene ontology mapping.

## Remove genes with no GO mapping

xx = as.list(org.Hs.egGO[genes$ENTREZID])
haveGO  = sapply(xx,
                 function(x) {if (length(x) == 1 && is.na(x)) FALSE else TRUE })
numNoGO  = sum(!haveGO)
if(numNoGO != 0){
  print(paste0("number of genes with no GO mapping ", length(zeroInd)))
  expData = expData[haveGO, ]
  genes = genes[haveGO, ]
  
}
print(paste0("number of genes after dropping ", dim(genes)[1]))

Produce the final expData, geneID, annotation_db. Now, the input is ready for SGCP. Refer to SGCP Bioconductor page in order to see how to use this input in SGCP.

expData = expData
print(head(expData))

geneID = genes$ENTREZID
print(head(geneID))

annotation_db = "org.Hs.eg.db"