⚖EpiCompare⚖ QC and Benchmarking of Epigenomic Datasets
Authors: Sera Choi, Brian Schilder, Leyla Abbasova, Alan Murphy,
Nathan Skene, Thomas Roberts, Hiranyamaya Dash
Updated: Dec-01-2025
Introduction
EpiCompare is an R package for comparing multiple epigenomic datasets
for quality control and benchmarking purposes. The function outputs a
report in HTML format consisting of three sections:
General Metrics: Metrics on peaks (percentage of blacklisted and
non-standard peaks, and peak widths) and fragments (duplication
rate) of samples.
Peak Overlap: Frequency, percentage, statistical significance of
overlapping and non-overlapping peaks. This also includes Upset,
precision-recall and correlation plots.
Functional Annotation: Functional annotation (ChromHMM,
ChIPseeker and enrichment analysis) of peaks. Also includes peak
enrichment around Transcription Start Site.
Note: Peaks located in blacklisted regions and non-standard
chromosomes are removed from the files prior to analysis.
Installation
Standard
To install EpiCompare use:
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("EpiCompare")
All dependencies
👈 Details
Installing all Imports and Suggests will allow you to use the full
functionality of EpiCompare right away, without having to stop and
install extra dependencies later on.
Note that this will increase installation time, but it means that you
won’t have to worry about installing any R packages when using functions
with certain suggested dependencies
Development
👈 Details
To install the development version of EpiCompare, use:
if (!require("remotes")) install.packages("remotes")
remotes::install_github("neurogenomics/EpiCompare")
Citation
If you use EpiCompare, please cite:
EpiCompare: R package for the comparison and quality control of
epigenomic peak files (2022) Sera Choi, Brian M. Schilder, Leyla
Abbasova, Alan E. Murphy, Nathan G. Skene, bioRxiv, 2022.07.22.501149;
doi: https://doi.org/10.1101/2022.07.22.501149
The documentation in this README and the GitHub Pages
website pertains to the
development version of EpiCompare. Older versions of EpiCompare
may have slightly different documentation (e.g. available functions,
parameters). For documentation in older versions of EpiCompare, please
see the Documentation section of the relevant version on
Bioconductor
Usage
Load package and example datasets.
library(EpiCompare)
data("encode_H3K27ac") # example peakfile
data("CnT_H3K27ac") # example peakfile
data("CnR_H3K27ac") # example peakfile
data("CnT_H3K27ac_picard") # example Picard summary output
data("CnR_H3K27ac_picard") # example Picard summary output
Prepare input files:
# create named list of peakfiles
peakfiles <- list("CnT"=CnT_H3K27ac,
"CnR"=CnR_H3K27ac)
# set ref file and name
reference <- list("ENCODE_H3K27ac" = encode_H3K27ac)
# create named list of Picard summary
picard_files <- list("CnT"=CnT_H3K27ac_picard,
"CnR"=CnR_H3K27ac_picard)
👈 Tips on importing user-supplied files
EpiCompare::gather_files is helpful for identifying and importing peak
or picard files.
# To import BED files as GRanges object
peakfiles <- EpiCompare::gather_files(dir = "path/to/peaks/",
type = "peaks.stringent")
# EpiCompare alternatively accepts paths (to BED files) as input
peakfiles <- list(sample1="/path/to/peaks/file1_peaks.stringent.bed",
sample2="/path/to/peaks/file2_peaks.stringent.bed")
# To import Picard summary output txt file as data frame
picard_files <- EpiCompare::gather_files(dir = "path/to/peaks",
type = "picard")
peakfiles : Peakfiles you want to analyse. EpiCompare accepts
peakfiles as GRanges object and/or as paths to BED files. Files must
be listed and named using list(). E.g.
list("name1"=peakfile1, "name2"=peakfile2).
genome_build : A named list indicating the human genome build used
to generate each of the following inputs:
peakfiles : Genome build for the peakfiles input. Assumes genome
build is the same for each element in the peakfiles list.
reference : Genome build for the reference input.
blacklist : Genome build for the blacklist input. E.g.
genome_build = list(peakfiles="hg38", reference="hg19", blacklist="hg19")
genome_build_output Genome build to standardise all inputs to.
Liftovers will be performed automatically as needed. Default is
“hg19”.
blacklist : Peakfile as GRanges object specifying genomic regions
that have anomalous and/or unstructured signals independent of the
cell-line or experiment. For human hg19 and hg38 genome, use built-in
data data(hg19_blacklist) and data(hg38_blacklist) respectively.
For mouse mm10 genome, use built-in data data(mm10_blacklist).
output_dir : Please specify the path to directory, where all
EpiCompare outputs will be saved.
Optional Inputs
The following input files are optional:
👈 Details
picard_files : A list of summary metrics output from
Picard. Picard
MarkDuplicates can be used to identify the duplicate reads amongst
the alignment. This tool generates a summary output, normally with the
ending .markdup.MarkDuplicates.metrics.txt. If this input is
provided, metrics on fragments (e.g. mapped fragments and duplication
rate) will be included in the report. Files must be in data.frame
format and listed using list() and named using names(). To import
Picard duplication metrics (.txt file) into R as data frame, use
picard <- read.table("/path/to/picard/output", header = TRUE, fill = TRUE).
reference : Reference peak file(s) is used in stat_plot and
chromHMM_plot. File must be in GRanges object, listed and named
using list("reference_name" = GRanges_obect). If more than one
reference is specified, EpiCompare outputs individual reports for
each reference. However, please note that this can take awhile.
Optional Plots
By default, these plots will not be included in the report unless set to
TRUE. To turn on all features at once, simply use the run_all=TRUE
argument:
👈 Details
upset_plot : Upset plot of overlapping peaks between samples.
stat_plot : included only if a reference dataset is provided. The
plot shows statistical significance (p/q-values) of sample peaks that
are overlapping/non-overlapping with the reference dataset.
chromHMM_plot : ChromHMM annotation of peaks. If a reference
dataset is provided, ChromHMM annotation of overlapping and
non-overlapping peaks with the reference is also included in the
report.
chipseeker_plot : ChIPseeker annotation of peaks.
enrichment_plot : KEGG pathway and GO enrichment analysis of peaks.
tss_plot : Peak frequency around (+/- 3000bp) transcriptional start
site. Note that it may take awhile to generate this plot for large
sample sizes.
precision_recall_plot : Plot showing the precision-recall score
across the peak calling stringency thresholds.
corr_plot : Plot showing the correlation between the quantiles when
the genome is binned at a set size. These quantiles are based on the
intensity of the peak, dependent on the peak caller used (q-value for
MACS2).
Other Options
👈 Details
chromHMM_annotation : Cell-line annotation for ChromHMM. Default is
K562. Options are:
“K562” = K-562 cells
“Gm12878” = Cellosaurus cell-line GM12878
“H1hesc” = H1 Human Embryonic Stem Cell
“Hepg2” = Hep G2 cell
“Hmec” = Human Mammary Epithelial Cell
“Hsmm” = Human Skeletal Muscle Myoblasts
“Huvec” = Human Umbilical Vein Endothelial Cells
“Nhek” = Normal Human Epidermal Keratinocytes
“Nhlf” = Normal Human Lung Fibroblasts
interact : By default, all heatmaps (percentage overlap and ChromHMM
heatmaps) in the report will be interactive. If set FALSE, all
heatmaps will be static. N.B. If interact=TRUE, interactive heatmaps
will be saved as html files, which may take time for larger sample
sizes.
output_filename : By default, the report is named EpiCompare.html.
You can specify the file name of the report here.
output_timestamp : By default FALSE. If TRUE, the filename of the
report includes the date.
Outputs
EpiCompare outputs the following:
HTML report: A summary of all analyses saved in specified
output_dir
EpiCompare_file: if save_output=TRUE, all plots generated by
EpiCompare will be saved in EpiCompare_file directory also in
specified output_dir
An example report comparing ATAC-seq and DNase-seq can be found
here
Datasets
EpiCompare includes several built-in datasets:
👈 Details
encode_H3K27ac: Human H3K27ac peak file generated with ChIP-seq
using K562 cell-line. Taken from
ENCODE project.
For more information, run ?encode_H3K27ac.
CnT_H3K27ac: Human H3K27ac peak file generated with CUT&Tag using
K562 cell-line from Kaya-Okur et al.,
(2019).
For more information, run ?CnT_H3K27ac.
CnR_H3K27ac: Human H3K27ac peak file generated with CUT&Run using
K562 cell-line from Meers et al.,
(2019).
For more details, run ?CnR_H3K27ac.
⚖
EpiCompare⚖QC and Benchmarking of Epigenomic Datasets
Authors: Sera Choi, Brian Schilder, Leyla Abbasova, Alan Murphy, Nathan Skene, Thomas Roberts, Hiranyamaya Dash
Updated: Dec-01-2025
Introduction
EpiCompareis an R package for comparing multiple epigenomic datasets for quality control and benchmarking purposes. The function outputs a report in HTML format consisting of three sections:Note: Peaks located in blacklisted regions and non-standard chromosomes are removed from the files prior to analysis.
Installation
Standard
To install
EpiCompareuse:All dependencies
👈 Details
Installing all Imports and Suggests will allow you to use the full functionality of
EpiCompareright away, without having to stop and install extra dependencies later on.To install these packages as well, use:
Note that this will increase installation time, but it means that you won’t have to worry about installing any R packages when using functions with certain suggested dependencies
Development
👈 Details
To install the development version of
EpiCompare, use:Citation
If you use
EpiCompare, please cite:Documentation
EpiCompare website
Docker/Singularity container
Bioconductor page
The documentation in this README and the GitHub Pages website pertains to the development version of
EpiCompare. Older versions ofEpiComparemay have slightly different documentation (e.g. available functions, parameters). For documentation in older versions ofEpiCompare, please see the Documentation section of the relevant version on BioconductorUsage
Load package and example datasets.
Prepare input files:
👈 Tips on importing user-supplied files
EpiCompare::gather_filesis helpful for identifying and importing peak or picard files.Run
EpiCompare():Required Inputs
These input parameters must be provided:
👈 Details
peakfiles: Peakfiles you want to analyse. EpiCompare accepts peakfiles as GRanges object and/or as paths to BED files. Files must be listed and named usinglist(). E.g.list("name1"=peakfile1, "name2"=peakfile2).genome_build: A named list indicating the human genome build used to generate each of the following inputs:peakfiles: Genome build for thepeakfilesinput. Assumes genome build is the same for each element in thepeakfileslist.reference: Genome build for thereferenceinput.blacklist: Genome build for theblacklistinput.E.g.
genome_build = list(peakfiles="hg38", reference="hg19", blacklist="hg19")genome_build_outputGenome build to standardise all inputs to. Liftovers will be performed automatically as needed. Default is “hg19”.blacklist: Peakfile as GRanges object specifying genomic regions that have anomalous and/or unstructured signals independent of the cell-line or experiment. For human hg19 and hg38 genome, use built-in datadata(hg19_blacklist)anddata(hg38_blacklist)respectively. For mouse mm10 genome, use built-in datadata(mm10_blacklist).output_dir: Please specify the path to directory, where allEpiCompareoutputs will be saved.Optional Inputs
The following input files are optional:
👈 Details
picard_files: A list of summary metrics output from Picard. Picard MarkDuplicates can be used to identify the duplicate reads amongst the alignment. This tool generates a summary output, normally with the ending .markdup.MarkDuplicates.metrics.txt. If this input is provided, metrics on fragments (e.g. mapped fragments and duplication rate) will be included in the report. Files must be in data.frame format and listed usinglist()and named usingnames(). To import Picard duplication metrics (.txt file) into R as data frame, usepicard <- read.table("/path/to/picard/output", header = TRUE, fill = TRUE).reference: Reference peak file(s) is used instat_plotandchromHMM_plot. File must be inGRangesobject, listed and named usinglist("reference_name" = GRanges_obect). If more than one reference is specified,EpiCompareoutputs individual reports for each reference. However, please note that this can take awhile.Optional Plots
By default, these plots will not be included in the report unless set to
TRUE. To turn on all features at once, simply use therun_all=TRUEargument:👈 Details
upset_plot: Upset plot of overlapping peaks between samples.stat_plot: included only if areferencedataset is provided. The plot shows statistical significance (p/q-values) of sample peaks that are overlapping/non-overlapping with thereferencedataset.chromHMM_plot: ChromHMM annotation of peaks. If areferencedataset is provided, ChromHMM annotation of overlapping and non-overlapping peaks with thereferenceis also included in the report.chipseeker_plot: ChIPseeker annotation of peaks.enrichment_plot: KEGG pathway and GO enrichment analysis of peaks.tss_plot: Peak frequency around (+/- 3000bp) transcriptional start site. Note that it may take awhile to generate this plot for large sample sizes.precision_recall_plot: Plot showing the precision-recall score across the peak calling stringency thresholds.corr_plot: Plot showing the correlation between the quantiles when the genome is binned at a set size. These quantiles are based on the intensity of the peak, dependent on the peak caller used (q-value for MACS2).Other Options
👈 Details
chromHMM_annotation: Cell-line annotation for ChromHMM. Default is K562. Options are:interact: By default, all heatmaps (percentage overlap and ChromHMM heatmaps) in the report will be interactive. If set FALSE, all heatmaps will be static. N.B. Ifinteract=TRUE, interactive heatmaps will be saved as html files, which may take time for larger sample sizes.output_filename: By default, the report is named EpiCompare.html. You can specify the file name of the report here.output_timestamp: By default FALSE. If TRUE, the filename of the report includes the date.Outputs
EpiCompareoutputs the following:output_dirsave_output=TRUE, all plots generated byEpiComparewill be saved in EpiCompare_file directory also in specifiedoutput_dirAn example report comparing ATAC-seq and DNase-seq can be found here
Datasets
EpiCompareincludes several built-in datasets:👈 Details
encode_H3K27ac: Human H3K27ac peak file generated with ChIP-seq using K562 cell-line. Taken from ENCODE project. For more information, run?encode_H3K27ac.CnT_H3K27ac: Human H3K27ac peak file generated with CUT&Tag using K562 cell-line from Kaya-Okur et al., (2019). For more information, run?CnT_H3K27ac.CnR_H3K27ac: Human H3K27ac peak file generated with CUT&Run using K562 cell-line from Meers et al., (2019). For more details, run?CnR_H3K27ac.Contact
Neurogenomics Lab
UK Dementia Research Institute
Department of Brain Sciences
Faculty of Medicine
Imperial College London
GitHub
DockerHub
Session Info
👈 Details