ggcoverage - Visualize and annotate omics coverage with ggplot2
Introduction
The goal of ggcoverage is to visualize coverage tracks from genomics,
transcriptomics or proteomics data. It contains functions to load data
from BAM, BigWig, BedGraph, txt, or xlsx files, create genome/protein
coverage plots, and add various annotations including base and amino
acid composition, GC content, copy number variation (CNV), genes,
transcripts, ideograms, peak highlights, HiC contact maps, contact links
and protein features. It is based on and integrates well with ggplot2.
It contains three main parts:
Load the data: ggcoverage can load BAM, BigWig (.bw), BedGraph,
txt/xlsx files from various omics data, including WGS, RNA-seq,
ChIP-seq, ATAC-seq, proteomics, et al.
Create omics coverage plot
Add annotations: ggcoverage supports six different annotations:
base and amino acid annotation: Visualize genome coverage at
single-nucleotide level with bases and amino acids.
GC annotation: Visualize genome coverage with GC content
CNV annotation: Visualize genome coverage with copy number
variation (CNV)
gene annotation: Visualize genome coverage across genes
transcription annotation: Visualize genome coverage across
different transcripts
ideogram annotation: Visualize the region showing on whole
chromosome
peak annotation: Visualize genome coverage and peak identified
contact map annotation: Visualize genome coverage with Hi-C
contact map
link annotation: Visualize genome coverage with contacts
peotein feature annotation: Visualize protein coverage with
features
Installation
ggcoverage is an R package distributed as part of the CRAN
repository. To install the package, start
R and enter one of the following commands:
# install via CRAN (not yet available)
install.packages("ggcoverage")
# OR install via Github
install.package("remotes")
remotes::install_github("showteeth/ggcoverage")
In general, it is recommended to install from the Github
repository (updated more
regularly).
Once ggcoverage is installed, it can be loaded like every other
package:
# load metadata
meta_file <-
system.file("extdata", "RNA-seq", "meta_info.csv", package = "ggcoverage")
sample_meta <- read.csv(meta_file)
sample_meta
#> SampleName Type Group
#> 1 ERR127302_chr14 KO_rep1 KO
#> 2 ERR127303_chr14 KO_rep2 KO
#> 3 ERR127306_chr14 WT_rep1 WT
#> 4 ERR127307_chr14 WT_rep2 WT
Load track files:
# track folder
track_folder <- system.file("extdata", "RNA-seq", package = "ggcoverage")
# load bigwig file
track_df <- LoadTrackFile(
track.folder = track_folder,
format = "bw",
region = "chr14:21,677,306-21,737,601",
extend = 2000,
meta.info = sample_meta
)
# check data
head(track_df)
#> seqnames start end width strand score Type Group
#> 1 chr14 21675306 21675950 645 * 0 KO_rep1 KO
#> 2 chr14 21675951 21676000 50 * 1 KO_rep1 KO
#> 3 chr14 21676001 21676100 100 * 2 KO_rep1 KO
#> 4 chr14 21676101 21676150 50 * 1 KO_rep1 KO
#> 5 chr14 21676151 21677100 950 * 0 KO_rep1 KO
#> 6 chr14 21677101 21677200 100 * 2 KO_rep1 KO
Prepare mark region:
# create mark region
mark_region <- data.frame(
start = c(21678900, 21732001, 21737590),
end = c(21679900, 21732400, 21737650),
label = c("M1", "M2", "M3")
)
# check data
mark_region
#> start end label
#> 1 21678900 21679900 M1
#> 2 21732001 21732400 M2
#> 3 21737590 21737650 M3
Load GTF
To add gene annotation, the gtf file should contain gene_type
and gene_name attributes in column 9; to add transcript
annotation, the gtf file should contain a transcript_name
attribute in column 9.
The ideogram is an overview plot about the respective position on a
chromosome. The plotting of the ideogram is implemented by the ggbio
package. This package needs to be installed separately (it is only
‘Suggested’ by ggcoverage).
library(ggbio)
#> Loading required package: BiocGenerics
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> anyDuplicated, aperm, append, as.data.frame, basename, cbind,
#> colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
#> get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
#> match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
#> Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
#> tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: ggplot2
#> Registered S3 method overwritten by 'GGally':
#> method from
#> +.gg ggplot2
#> Need specific help about ggbio? try mailing
#> the maintainer or visit https://lawremi.github.io/ggbio/
#>
#> Attaching package: 'ggbio'
#> The following objects are masked from 'package:ggplot2':
#>
#> geom_bar, geom_rect, geom_segment, ggsave, stat_bin, stat_identity,
#> xlim
basic_coverage +
geom_gene(gtf.gr = gtf_gr) +
geom_ideogram(genome = "hg19", plot.space = 0)
#> Loading ideogram...
#> Loading ranges...
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
basic_coverage +
geom_transcript(gtf.gr = gtf_gr, label.vjust = 1.5) +
geom_ideogram(genome = "hg19", plot.space = 0)
#> Loading ideogram...
#> Loading ranges...
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
DNA-seq data
CNV
Example 1
Load the data
The DNA-seq data used here are from Copy number work
flow,
we select tumor sample, and get bin counts with
cn.mops::getReadCountsFromBAM with WL 1000.
basic_coverage <- ggcoverage(
data = track_df,
color = "grey",
mark.region = NULL,
range.position = "out"
)
basic_coverage
Add GC annotations
Add GC, ideogram and gene annotations. The plotting of the
GC content requires the genome annotation package
BSgenome.Hsapiens.UCSC.hg19. This package needs to be installed
separately (it is only ‘Suggested’ by ggcoverage).
# load genome data
library("BSgenome.Hsapiens.UCSC.hg19")
#> Loading required package: BSgenome
#> Loading required package: S4Vectors
#> Loading required package: stats4
#>
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#>
#> findMatches
#> The following objects are masked from 'package:base':
#>
#> expand.grid, I, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: GenomicRanges
#> Loading required package: Biostrings
#> Loading required package: XVector
#>
#> Attaching package: 'Biostrings'
#> The following object is masked from 'package:base':
#>
#> strsplit
#> Loading required package: BiocIO
#> Loading required package: rtracklayer
#>
#> Attaching package: 'rtracklayer'
#> The following object is masked from 'package:BiocIO':
#>
#> FileForFormat
# create plot
basic_coverage +
geom_gc(bs.fa.seq = BSgenome.Hsapiens.UCSC.hg19) +
geom_gene(gtf.gr = gtf_gr) +
geom_ideogram(genome = "hg19")
#> Loading ideogram...
#> Loading ranges...
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
For base and amino acid annotation, the package comes with the following
default color schemes. Color schemes can be changed with nuc.color and
aa.color parameters.
THe default color scheme for base annotation is Clustal-style, more
popular color schemes are available
here.
# color scheme
nuc_color <- c(
"A" = "#ff2b08", "C" = "#009aff", "G" = "#ffb507", "T" = "#00bc0d"
)
# create plot
graphics::image(
seq_along(nuc_color),
1,
as.matrix(seq_along(nuc_color)),
col = nuc_color,
xlab = "",
ylab = "",
xaxt = "n",
yaxt = "n",
bty = "n"
)
graphics::text(seq_along(nuc_color), 1, names(nuc_color))
graphics::mtext(
text = "Base",
adj = 1,
las = 1,
side = 2
)
# create plot with twill mark
ggcoverage(
data = track_df,
color = "grey",
range.position = "out",
single.nuc = TRUE,
rect.color = "white"
) +
geom_base(
bam.file = bam_file,
bs.fa.seq = BSgenome.Hsapiens.UCSC.hg19,
mark.type = "twill"
) +
geom_ideogram(genome = "hg19", plot.space = 0)
#> Loading ideogram...
#> Loading ranges...
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
Use star to mark position with SNV:
# create plot with star mark
ggcoverage(
data = track_df,
color = "grey",
range.position = "out",
single.nuc = TRUE,
rect.color = "white"
) +
geom_base(
bam.file = bam_file,
bs.fa.seq = BSgenome.Hsapiens.UCSC.hg19,
mark.type = "star"
) +
geom_ideogram(genome = "hg19", plot.space = 0)
#> Loading ideogram...
#> Loading ranges...
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
Highlight position with SNV:
# highlight one base
ggcoverage(
data = track_df,
color = "grey",
range.position = "out",
single.nuc = TRUE,
rect.color = "white"
) +
geom_base(
bam.file = bam_file,
bs.fa.seq = BSgenome.Hsapiens.UCSC.hg19,
mark.type = "highlight"
) +
geom_ideogram(genome = "hg19", plot.space = 0)
#> Loading ideogram...
#> Loading ranges...
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
ChIP-seq data
The ChIP-seq data used here is from
DiffBind.
Four samples are selected as examples: Chr18_MCF7_input,
Chr18_MCF7_ER_1, Chr18_MCF7_ER_3, Chr18_MCF7_ER_2, and all bam
files were converted to bigwig files with
deeptools.
Create metadata:
# load metadata
sample_meta <- data.frame(
SampleName = c(
"Chr18_MCF7_ER_1",
"Chr18_MCF7_ER_2",
"Chr18_MCF7_ER_3",
"Chr18_MCF7_input"
),
Type = c("MCF7_ER_1", "MCF7_ER_2", "MCF7_ER_3", "MCF7_input"),
Group = c("IP", "IP", "IP", "Input")
)
sample_meta
#> SampleName Type Group
#> 1 Chr18_MCF7_ER_1 MCF7_ER_1 IP
#> 2 Chr18_MCF7_ER_2 MCF7_ER_2 IP
#> 3 Chr18_MCF7_ER_3 MCF7_ER_3 IP
#> 4 Chr18_MCF7_input MCF7_input Input
Load track files:
# track folder
track_folder <- system.file("extdata", "ChIP-seq", package = "ggcoverage")
# load bigwig file
track_df <- LoadTrackFile(
track.folder = track_folder,
format = "bw",
region = "chr18:76822285-76900000",
meta.info = sample_meta
)
# check data
head(track_df)
#> seqnames start end width strand score Type Group
#> 1 chr18 76820285 76820400 116 * 219.658005 MCF7_ER_1 IP
#> 2 chr18 76820401 76820700 300 * 0.000000 MCF7_ER_1 IP
#> 3 chr18 76820701 76821000 300 * 439.316010 MCF7_ER_1 IP
#> 4 chr18 76821001 76821300 300 * 219.658005 MCF7_ER_1 IP
#> 5 chr18 76821301 76821600 300 * 0.000000 MCF7_ER_1 IP
#> 6 chr18 76821601 76821900 300 * 219.658005 MCF7_ER_1 IP
Prepare mark region:
# create mark region
mark_region <- data.frame(
start = c(76822533),
end = c(76823743),
label = c("Promoter")
)
# check data
mark_region
#> start end label
#> 1 76822533 76823743 Promoter
Add gene, ideogram and peak annotations. To create peak
annotation, we first get consensus peaks with
MSPC.
# get consensus peak file
peak_file <- system.file("extdata",
"ChIP-seq",
"consensus.peak",
package = "ggcoverage"
)
basic_coverage +
geom_gene(gtf.gr = gtf_gr) +
geom_peak(bed.file = peak_file) +
geom_ideogram(genome = "hg19", plot.space = 0)
#> Loading ideogram...
#> Loading ranges...
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
Hi-C data
The Hi-C method maps chromosome contacts in eukaryotic cells. For this
purpose, DNA and protein complexes are cross-linked and DNA fragments
then purified. As a result, even distant chromatin fragments can be
found to interact due to the spatial organization of the DNA and
histones in the cell. Hi-C data shows these interactions for example as
a contact map.
basic_coverage <-
ggcoverage(
data = track_df,
color = "grey",
mark.region = NULL,
range.position = "out"
)
basic_coverage
Add annotations
Add link, contact mapannotations:
library(HiCBricks)
#> Loading required package: curl
#> Using libcurl 7.81.0 with OpenSSL/3.0.2
#> Loading required package: rhdf5
#> Loading required package: R6
#> Loading required package: grid
#>
#> Attaching package: 'grid'
#> The following object is masked from 'package:Biostrings':
#>
#> pattern
basic_coverage +
geom_tad(
matrix = hic_mat,
granges = hic_bin_gr,
value.cut = 0.99,
color.palette = "viridis",
transform.fun = failsafe_log10,
top = FALSE,
show.rect = TRUE
) +
geom_link(
link.file = link_file,
file.type = "bedpe",
show.rect = TRUE
)
#> Read 534 lines after Skipping 0 lines
#> Inserting Data at location: 1
#> Data length: 534
#> Loaded 2315864 bytes of data...
#> Read 534 records...
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
Mass spectrometry protein coverage
Mass
spectrometry
(MS) is an important method for the accurate mass determination and
characterization of proteins, and a variety of methods and instruments
have been developed for its many uses. With ggcoverage, we can easily
inspect the peptide coverage of a protein in order to learn about the
quality of the data.
We can obtain features of the protein from
UniProt. For example, the above protein
coverage plot shows that there is empty region in 1-24, and this empty
region in UniProt is
annotated as Signal peptide and Propeptide peptide. When the protein is
mature and released extracellular, these peptides will be cleaved. This
is the reason why there is empty region in 1-24.
Please note that the ggcoverage project is released with a
Contributor Code of
Conduct.
By contributing to this project, you agree to abide by its terms.
ggcoverage - Visualize and annotate omics coverage with ggplot2
Introduction
The goal of
ggcoverageis to visualize coverage tracks from genomics, transcriptomics or proteomics data. It contains functions to load data from BAM, BigWig, BedGraph, txt, or xlsx files, create genome/protein coverage plots, and add various annotations including base and amino acid composition, GC content, copy number variation (CNV), genes, transcripts, ideograms, peak highlights, HiC contact maps, contact links and protein features. It is based on and integrates well withggplot2.It contains three main parts:
ggcoveragecan load BAM, BigWig (.bw), BedGraph, txt/xlsx files from various omics data, including WGS, RNA-seq, ChIP-seq, ATAC-seq, proteomics, et al.ggcoveragesupports six different annotations:Installation
ggcoverageis an R package distributed as part of the CRAN repository. To install the package, start R and enter one of the following commands:In general, it is recommended to install from the Github repository (updated more regularly).
Once
ggcoverageis installed, it can be loaded like every other package:Manual
ggcoverageprovides two vignettes:RNA-seq data
Load the data
The RNA-seq data used here is from Transcription profiling by high throughput sequencing of HNRNPC knockdown and control HeLa cells. We select four samples to use as example:
ERR127307_chr14,ERR127306_chr14,ERR127303_chr14,ERR127302_chr14, and all bam files were converted to bigwig files with deeptools.Load metadata:
Load track files:
Prepare mark region:
Load GTF
To add gene annotation, the gtf file should contain gene_type and gene_name attributes in column 9; to add transcript annotation, the gtf file should contain a transcript_name attribute in column 9.
Basic coverage
The basic coverage plot has two types:
facet.key). This is default.joint view
Create line plot for every sample (
facet.key = "Type") and color by every sample (group.key = "Type"):Create group average line plot (sample is indicated by
facet.key = "Type", group is indicated bygroup.key = "Group"):Facet view
Custom Y-axis style
Change the Y-axis scale label in/out of plot region with
range.position:Shared/Free Y-axis scale with
facet.y.scale:Add gene annotation
gene.size,exon.sizeandutr.sizeparametersarrow.numandarrow.gapparametersstrandby default, which can be changed using thecolor.byparameterAdd transcript annotation
In “loose” style (default style; each transcript occupies one line):
In “tight” style (attempted to place non-overlapping transcripts in one line):
Add ideogram
The ideogram is an overview plot about the respective position on a chromosome. The plotting of the ideogram is implemented by the
ggbiopackage. This package needs to be installed separately (it is only ‘Suggested’ byggcoverage).DNA-seq data
CNV
Example 1
Load the data
The DNA-seq data used here are from Copy number work flow, we select tumor sample, and get bin counts with
cn.mops::getReadCountsFromBAMwithWL1000.Basic coverage
Add GC annotations
Add GC, ideogram and gene annotations. The plotting of the GC content requires the genome annotation package
BSgenome.Hsapiens.UCSC.hg19. This package needs to be installed separately (it is only ‘Suggested’ byggcoverage).Example 2
Load the data
The DNA-seq data used here are from Genome-wide copy number analysis of single cells, and the accession number is SRR054616.
Basic coverage
Load CNV file
Add annotations
Add GC, ideogram and CNV annotations.
Single-nucleotide level
Load the data
Default color scheme
For base and amino acid annotation, the package comes with the following default color schemes. Color schemes can be changed with
nuc.colorandaa.colorparameters.THe default color scheme for base annotation is
Clustal-style, more popular color schemes are available here.Default color scheme for amino acid annotation is from Residual colours: a proposal for aminochromography:
Add base and amino acid annotation
Use twill to mark position with SNV:
Use star to mark position with SNV:
Highlight position with SNV:
ChIP-seq data
The ChIP-seq data used here is from DiffBind. Four samples are selected as examples:
Chr18_MCF7_input,Chr18_MCF7_ER_1,Chr18_MCF7_ER_3,Chr18_MCF7_ER_2, and all bam files were converted to bigwig files with deeptools.Create metadata:
Load track files:
Prepare mark region:
Basic coverage
Add annotations
Add gene, ideogram and peak annotations. To create peak annotation, we first get consensus peaks with MSPC.
Hi-C data
The Hi-C method maps chromosome contacts in eukaryotic cells. For this purpose, DNA and protein complexes are cross-linked and DNA fragments then purified. As a result, even distant chromatin fragments can be found to interact due to the spatial organization of the DNA and histones in the cell. Hi-C data shows these interactions for example as a contact map.
The Hi-C data is taken from pyGenomeTracks: reproducible plots for multivariate genomic datasets.
The Hi-C matrix visualization is implemented by
HiCBricks. This package needs to be installed separately (it is only ‘Suggested’ byggcoverage).Load track data
Load Hi-C data
Matrix:
Bin table:
Data transfromation method:
Load link
Basic coverage
Add annotations
Add link, contact mapannotations:
Mass spectrometry protein coverage
Mass spectrometry (MS) is an important method for the accurate mass determination and characterization of proteins, and a variety of methods and instruments have been developed for its many uses. With
ggcoverage, we can easily inspect the peptide coverage of a protein in order to learn about the quality of the data.Load coverage
The exported coverage from Proteome Discoverer:
The input protein fasta:
Protein coverage
Add annotation
We can obtain features of the protein from UniProt. For example, the above protein coverage plot shows that there is empty region in 1-24, and this empty region in UniProt is annotated as Signal peptide and Propeptide peptide. When the protein is mature and released extracellular, these peptides will be cleaved. This is the reason why there is empty region in 1-24.
Code of Conduct
Please note that the
ggcoverageproject is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.