HaMiP: DNA hydroxymethylation analysis of Cytosine-5-methylenesulfonate ImmunoPrecipitation sequencing
A scalable, accurate, and efficient solution for hydroxymethylation analysis of CMS-IP sequencing data.
Installation
HaMiP has been deployed in Bioconda at https://anaconda.org/bioconda/HaMiP. It is encouraged to install HaMiP from Bioconda due to most runtime dependencies will be installed automatically. The following channels should be added in Conda. Namely,
HaMiP takes in a configuration file for input data and program parameters. HaMiP can be run end-to-end, starting from raw FASTQ files to peak calling and differential hydroxymethylation identification. One can also start the pipeline from intermediate steps. For example, using alignment files as input so that mapping steps will be skipped.
Inspection of configuration
The configuration file is in a YAML format. Two example templates are config_fastq.yaml and config_bam.yaml under https://github.com/lijinbio/HaMiP/blob/master/config. config_fastq.yaml is used as a full HaMiP running from FASTQ inputs, while config_bam.yaml is adapted to input existing BAM files so that HaMiP will skip the long-time alignment step. The inspection of configuration is explained below.
sampleinfo
The sampleinfo section defines metadata information in analysis. Below metadata information can be specified.
Parameter
Description
sampleinfo.sampleid
the unique identifier to one sample
sampleinfo.group
the biological group of the sample, e.g., KO or WT
sampleinfo.filenames
the absolute path of raw FASTQ files
sampleinfo.reference
the absolute path of the reference BAM file when aligninfo.inputbam is True
sampleinfo.spikein
the absolute path of the spike-in BAM file when aligninfo.inputbam and aligninfo.usespikein are True
groupinfo
This section defines biological comparison group1 - group2, e.g., KO - WT.
Parameter
Description
groupinfo.group1
the first group in biological comparison
groupinfo.group2
the second group in biological comparison
resultdir
This directory is default working directory storing intemediate result files, such as BAM and BED files.
aligninfo
This section specifies parameters used in raw reads alignment.
Parameter
Description
aligninfo.inputbam
True for BAM inputs. Default: FASTQ inputs.
aligninfo.reference
FASTA file of the reference genome, e.g. hg38.fa.
aligninfo.usespikein
True for spike-in libraries, otherwise False. This option controls the normalization method, either a spike-in normalization using spike-in mapping, or reduced to WIG sum in the reference genome.
aligninfo.spikein
FASTA file of the spike-in genome, e.g. mm10.fa.
aligninfo.statfile
the output statistics file. This file includes quality control statistics as well as estimated normalization factors.
aligninfo.barplotinfo
a barplot of normalized WIG sums of samples.
aligninfo.numthreads
number of threads in alignment program.
aligninfo.verbose
Print verbose message
genomescaninfo
This section defines parameters for CMS measurement construction.
Parameter
Description
genomescaninfo.readextension
True to extend reads length before CMS measurement construction.
genomescaninfo.fragsize
the fixed fragment size to extend when readextension is True.
genomescaninfo.windowfile
an intermediate window file with fixed-size genomic regions.
genomescaninfo.referencename
the UCSC genome name to fetch reference genome size. E.g., hg38 or mm10.
genomescaninfo.windowsize
the window size
genomescaninfo.readscount
CMS measurement using read count (True) or mean WIG (False).
genomescaninfo.counttablefile
the result count table file.
genomescaninfo.verbose
Print verbose message
dhmrinfo
Parameters in this section is for DMR detection.
Parameter
Description
dhmrinfo.method
The statistical method used in DHMR detection. Available methods: ttest, chisq, gtest, nbtest, nbtest_sf. ttest is calling Student’s t-test to examine the mean difference of CMS measurements between two biological groups. chisq and gtest are Pearson’s Chi-squared and G-test to test if sums of CMS measurements fit the numbers of replicates between two biological condtions. nbtest applies negative binomial generalized linear model to formulate CMS measurements, and Wald test evaluates the significance of logarithmic fold change. By default, CMS measurement are adjusted by size factors using spike-in normalization. In nbtest_sf, CMS measurements are normalized by the median-ratio algorithm (previously used in DESeq2 for transcriptome measurements).
dhmrinfo.meandepth
Average depth to filter out low-depth windows. This step is essential to save computing resources and increase power of downstream statistical inference
dhmrinfo.testfile
The result file with statistical outputs for whole genome windows
dhmrinfo.qthr
q-value threshod for DHMW.
dhmrinfo.maxdistance
Maximum distance to merge adjacent DHMWs into DHMRs
dhmrinfo.dhmrfile
The final DHMR result file after merging adjacent DHMWs.
dhmrinfo.numthreads
The number of threads.
dhmrinfo.nsplit
The number of split of windows. This option controls parallelization with dhmrinfo.numthreads.
dhmrinfo.verbose
Print verbose message.
dhmrinfo.keepNA
Keep genome windows ruled out by independent filtering.
useinput
To indicate if the input data is used during CMS-IP sequencing.
inputinfo
If useinput is True, this section is required to specify input data. When input data is used, peak windows are identified first by comparing CMS measurements between group 1/2 and their input data. Then, the union of peak windows are tested for DHMR between group 1 and group 2.
Parameter
Description
inputinfo.group1
The label for the first group input data.
inputinfo.group2
The label for the second group input data. Group 1 and group 2 can share same set of input data.
inputinfo.method
The statistical method used in peak calling. See dhmrinfo.method.
inputinfo.qthr
q-value threshold for peak calling.
inputinfo.testfile1
Statistical test results for group 1 peaking calling.
inputinfo.dhmrfile1
Peak regions for group 1.
inputinfo.testfile2
Statistical test results for group 2 peaking calling.
inputinfo.dhmrfile2
Peak regions for group 2.
inputinfo.inputfilterfile
Union of peak regions in group 1 and group 2.
inputinfo.verbose
Print verbose message.
A toy example using BAM inputs
To facilitate the running of HaMiP, a toy example is generated using existing BAM inputs. The example is accessible at https://github.com/lijinbio/HaMiP/blob/master/example. The example directory consists of running scripts and example BAM files. Below commands will generate the configuration file and run the example.
$ ./config.sh ## Generate config.yaml
$ ./fasta.sh ## download the reference genome and the spike-in genome under ./fasta
$ ./run.sh ## run the example
config.sh
This script will generate the running configuration file. The inspection of configuration file has been explained above. This example includes small BAM files for 2 KO and 2 WT samples, together with 3 input samples. Spike-in BAM files are also included for spike-in normalization. These BAM files are under the ./bamfile directory. The gtest is used for peaking calling and DHMR detection.
fasta.sh
This script is to download required FASTA file for reference genome and spike-in genome. These FASTA files are used in MCALL for bisulfite conversion ratio (BCR) estimation. FASTA files are downloaded into a local directory ./fasta.
run.sh
The simple command to run CMSIP:
$ HaMiP -c config.yaml
Intermediate and results files are stored under ./outdir. The example quality control statistic file (e.g., qcstats.txt) is as below.
sample_id
total
unique_ref
ref/total
unique_spk
spk/total
comm
comm/total
comm/unique_ref
twss_spk
sizefactors
twss_ref
twss_ref_norm
bcr_ref
bcr_spk
T1
2328
2328
100.00%
141
6.06%
99
4.25%
4.25%
3053
0.52
171892
88620
0.022883
0.278465
T2
20414
20414
100.00%
2780
13.62%
2539
12.44%
12.44%
9522
0.17
704965
116532
0.033482
0.178238
W1
1588
1588
100.00%
97
6.11%
74
4.66%
4.66%
1786
0.88
118312
104268
0.043317
0.206362
W2
1182
1182
100.00%
89
7.53%
66
5.58%
5.58%
1574
1.00
85335
85335
0.034864
0.214435
I1
212
212
100.00%
17
8.02%
4
1.89%
1.89%
992
1.59
15984
25362
0.738502
0.811715
I2
150
150
100.00%
14
9.33%
1
0.67%
0.67%
1032
1.53
11586
17671
0.897633
0.994475
I3
52
52
100.00%
9
17.31%
0
0.00%
0.00%
356
4.42
2053
9077
0.910494
0.983871
Specifically, the sizefactors column is the size factors generated by spike-in normalization.
The example DHMR file is as below.
chrom
start
end
baseMean
lfc
statistic
pvalue
padj
chr4
105276100
105276400
95.88
-0.8072375451
27.63702666
1.463503254e-07
2.048904555e-06
chr4
105272500
105272700
86.7175
-0.837383035
19.45881901
1.027921214e-05
4.796965666e-05
chr4
105259600
105259800
42.4925
-0.5320182773
5.682020984
0.01713961427
0.03999243329
For example, three hypo-DHMRs are identified in chr4 between group T and group W.
HaMiP: DNA hydroxymethylation analysis of Cytosine-5-methylenesulfonate ImmunoPrecipitation sequencing
A scalable, accurate, and efficient solution for hydroxymethylation analysis of CMS-IP sequencing data.
Installation
HaMiP has been deployed in Bioconda at https://anaconda.org/bioconda/HaMiP. It is encouraged to install HaMiP from Bioconda due to most runtime dependencies will be installed automatically. The following channels should be added in Conda. Namely,
Alternatively, HaMiP has been also deployed in PyPI at https://pypi.org/project/HaMiP, and it can be installed via
pip.In some cases, users want to build HaMiP manually from source code at https://github.com/lijinbio/HaMiP. Below is an example installation steps.
In order to run HaMiP after a manual installation, the following dependent software are required.
Documentation
HaMiP takes in a configuration file for input data and program parameters. HaMiP can be run end-to-end, starting from raw FASTQ files to peak calling and differential hydroxymethylation identification. One can also start the pipeline from intermediate steps. For example, using alignment files as input so that mapping steps will be skipped.
Inspection of configuration
The configuration file is in a YAML format. Two example templates are
config_fastq.yamlandconfig_bam.yamlunder https://github.com/lijinbio/HaMiP/blob/master/config.config_fastq.yamlis used as a full HaMiP running from FASTQ inputs, whileconfig_bam.yamlis adapted to input existing BAM files so that HaMiP will skip the long-time alignment step. The inspection of configuration is explained below.The sampleinfo section defines metadata information in analysis. Below metadata information can be specified.
aligninfo.inputbamis Truealigninfo.inputbamandaligninfo.usespikeinare TrueThis section defines biological comparison
group1-group2, e.g., KO - WT.This directory is default working directory storing intemediate result files, such as BAM and BED files.
This section specifies parameters used in raw reads alignment.
hg38.fa.mm10.fa.This section defines parameters for CMS measurement construction.
Parameters in this section is for DMR detection.
ttest,chisq,gtest,nbtest,nbtest_sf.ttestis calling Student’s t-test to examine the mean difference of CMS measurements between two biological groups.chisqandgtestare Pearson’s Chi-squared and G-test to test if sums of CMS measurements fit the numbers of replicates between two biological condtions.nbtestapplies negative binomial generalized linear model to formulate CMS measurements, and Wald test evaluates the significance of logarithmic fold change. By default, CMS measurement are adjusted by size factors using spike-in normalization. Innbtest_sf, CMS measurements are normalized by the median-ratio algorithm (previously used in DESeq2 for transcriptome measurements).dhmrinfo.numthreads.To indicate if the input data is used during CMS-IP sequencing.
If
useinputis True, this section is required to specify input data. When input data is used, peak windows are identified first by comparing CMS measurements between group 1/2 and their input data. Then, the union of peak windows are tested for DHMR between group 1 and group 2.dhmrinfo.method.A toy example using BAM inputs
To facilitate the running of HaMiP, a toy example is generated using existing BAM inputs. The example is accessible at https://github.com/lijinbio/HaMiP/blob/master/example. The
exampledirectory consists of running scripts and example BAM files. Below commands will generate the configuration file and run the example.config.shThis script will generate the running configuration file. The inspection of configuration file has been explained above. This example includes small BAM files for 2 KO and 2 WT samples, together with 3 input samples. Spike-in BAM files are also included for spike-in normalization. These BAM files are under the
./bamfiledirectory. Thegtestis used for peaking calling and DHMR detection.fasta.shThis script is to download required FASTA file for reference genome and spike-in genome. These FASTA files are used in MCALL for bisulfite conversion ratio (BCR) estimation. FASTA files are downloaded into a local directory
./fasta.run.shThe simple command to run CMSIP:
Intermediate and results files are stored under
./outdir. The example quality control statistic file (e.g.,qcstats.txt) is as below.Specifically, the
sizefactorscolumn is the size factors generated by spike-in normalization.The example DHMR file is as below.
For example, three hypo-DHMRs are identified in chr4 between group T and group W.
Contact
Maintainer: Jin Li, lijin.abc@gmail.com. PI: De-Qiang Sun, sunnyisgalaxy@gmail.com.