git clone https://github.com/adamewing/methylartist-tests.git
cd methylartist-tests
source run_tests.sh
Input data
Alignments (.bam)
Alignments stored in .bam format should be sorted and indexed and should use the same read names as the associated methylation data.
Modified Base Calls
The easiest way to provide modified basecall data is through .bam files with the MM and ML tags for modified base calling such as those produced by guppy or dorado. Note that the mod_mappings.bam file output by megalodon will work for modified base calling, but is unsuitable for downstream applications involving sequence variation, including phasing.
If .bam files with modified base calls are not available, methylartist has functions for loading methylation data can from nanopolish, megalodon, or basecalled guppy fast5s, using the appropriate function below. Assays that use C/U conversion for methylation inference are also supported through the db-sub command as noted below.
Once coverted, the sqlite .db file can be input to methylartist functions (e.g. segplot, locus, etc).
Commands:
db-nanopolish
Load nanopolish methylation into sqlite db.
Example:
Loading results from nanopolish call-methylation to a database:
The input file is the output of megalodon_extras per_read_text modified_bases /path/to/megalodon_output, which needs to be run prior to this script.
The default filename (/path/to/megalodon_output/per_read_modified_base_calls.txt) is the same for all megalodon runs, so the --db option is recommended to make the output database more identifiable for downstream analysis.
Appending (-a) additional results to the above database:
methylartist db-megalodon -m MCF7_ATCC_REP2/per_read_modified_base_calls.txt --db MCF7_ATCC.megalodon.db -a
Input files can be uncompressed or .gzipped.
db-custom
This enables free-form parsing of modified basecall tables into methylartist .db files for tools where modified base .bam files are not available and certain requirements are met. The table must contain, at a minimum, the read names, genomic position (chromosome and position), strand, and probability of the target base being modified. If not specified by a column (--modbasecol), the modification is specified by --modbase. The probability is assumed to be a raw probability between 0 and 1 of a given base being modified i.e. 1-p(canonical), other schemes may be used but --canprob and --mincanprob must be specified to set a column for canoncial base scores and a cutoff for a base being canonical.
For example, modified basecalls from deepsignal-plant can be loaded as follows:
methylartist now supports C/T data if the .bam file is noted as being C/T substitution data via --ctbam (works for locus, region, segmeth, wgmeth)
As of 1.3.0, methylartist supports display of C/U base substitution data (i.e. WGBS or EM-seq data) via creation of a methylartist .db file, simply pass the .bam file as input and specify an output file:
Note that the .bam file has to include the MD tag (aligners for bisulfite/em-seq data should do this).
segmeth
Outputs aggregate methylation / demethylation call counts over intervals. Required before generating strip / violin plots with segplot
Requires whitespace-delimited list of segments in a BED3+2 format: chromosome, start, end, label, strand.
One or more .bam files may be supplied via the -b/--bams parameter. Multiple .bams may be comma-delimited.
Optional sample input file -d/--data has the following whitespace-delimited fields (one sample per line): BAM file, Methylation DB (generated with e.g. db-nanopolish)
Highly recommend parallelising with -p/--proc option if possible.
Can be used to generate genome-wide methylation stats aggregated over windows via bedtools makewindows.
Example:
Aggregate whole-genome CpG methylation in 10kbp bins, promoters (Eukaryotic Promoter Database, EPD), L1HS and SVA retrotransposons:
Note that default output is in .png format. For .svg vector output suitable for editing in inkscape or illustrator add the --svg option. Note that for strip plots, this is often inadvisable due to the large number of points.
New in 1.0.7, ridge plots (-g/--ridge):
methylartist segplot -s L1_FL.MCF7_data_megalodon.segmeth.tsv -c L1HS,L1PA2,L1PA3,L1PA4,L1PA5,L1PA6,L1PA7,L1PA8 -g --palette magma
Ridge plots can also be grouped by annotation (--ridge_group_by_annotation) rather than by sample as in the above example.
locus
Generates smoothed methylation profiles across specific loci with many configurable parameters for one or more samples.
One or more .bam files (with Mm/Ml tags) may be supplied via the -b/--bams parameter. Multiple .bams may be comma-delimited.
Optional sample input file -d/--data has the following whitespace-delimited fields (one sample per line): BAM file, Methylation DB (generated with e.g. db-nanopolish)
Example:
Plot of the GPER1 locus in hg38, highlighting the GeneHancer promoter/enhancer annotation (GH07J001085).
From top to bottom, the plot shows the genome coordinates, gene models (optional if --gtf is supplied), read mappings with modified bases as closed (modified) or open (unmodified) circles, translation from genome to CpG-only coordinate space, raw log-likelihood ratios, and smoothed methylated fraction plot. Note that --genes results in plotting only the genes specified after --genes (comma-delimited), leaving this out plots all genes across the window:
Since WGBS/EM-seq data is supported by specifying that the .bam is C/T substitution data via --ctbam, this data can be displayed via any of the other methylartist commands, e.g.:
Note the apparent differential imprinting between the ATCC and ECACC MCF7 lines.
variant highlighting
The --variants option can be used with a VCF file (bgzipped, tabix-indexed) to highlight variants in the reads:
methylartist locus -b sample.bam -i chr19:56800082-56860726 -m m --motif CG --ref hg38.fasta --nomask --gtf Homo_sapiens.GRCh38.97.chr.sorted.gtf.gz --variants sample.phased.vcf.gz --phased
split samples by a variant
Using --variants and specify a specific variant ID via the --splitvar option:
methylartist locus -b sample.bam -i chr19:56800082-56860726 -m m --motif CG --ref hg38.fasta --nomask --gtf Homo_sapiens.GRCh38.97.chr.sorted.gtf.gz --variants example_variant.txt.gz --splitvar mysnpid
Contents of example_variant.txt.gz:
chr19 56827839 mysnpid C A 25.11 PASS P GT:GQ:DP:AD:AF:PS 1|0:25:29:13,15:0.5172:54626833
Same region, without --splitvar mysnpid:
region
More tractable than “locus” for larger regions using binned methylation.
Example:
Plot of the PGR (Progesterone Receptor) region on chr11 in MCF7 cells. Highlighed region corresponds to the GeneHancer annotation GH11J101126. Note the parameter setting -n CG to normalised bins for content of CpG dinucleotides. Options --samplepalette and --highlightpalette are used to set colours. The number of windows and smoothing parameters are set automatically but can be modified from automatically set values via -w/--windows and -s/--smoothwindowsize.
methylartist region -i chr11:100956822-101228191 -d MCF7_data_megalodon.txt -g Homo_sapiens.GRCh38.97.chr.sorted.gtf.gz -p 32 -n CG -r Homo_sapiens_assembly38.fasta --samplepalette magma -l 101126888-101129371 --highlightpalette viridis
Expanded to a larger ~2Mbp region:
methylartist region -i chr11:98956822-103228191 -d MCF7_data_megalodon.txt -g Homo_sapiens.GRCh38.97.chr.sorted.gtf.gz -p 32 -n CG -r Homo_sapiens_assembly38.fasta --samplepalette magma -l 101126888-101129371 --highlightpalette viridis
For an even larger region, it helps to drop the read alignment panel:
It’s also possible to use region plots to explore methylation profiles across entire chromosomes. This can be accomplished through turning off the read alignment plot with --skip_align_plot and adjusting --panelratios to set the read alignment plot panel to 0. Setting the image height via --height 4.5 maintains similar panel heights to the default.
Outputs distribution of modified base statistics with current cutoffs for one or more methylartist databases
methylartist scoredist -d MCF7_ATCC.megalodon.db,MCF7_ECACC.megalodon.db -m m
Citation
Seth W Cheetham, Michaela Kindlova, Adam D Ewing, Methylartist: tools for visualizing modified bases from nanopore sequence data, Bioinformatics, Volume 38, Issue 11, 1 June 2022, Pages 3109–3112, https://doi.org/10.1093/bioinformatics/btac292
methylartist
Tools for parsing and plotting methylation patterns
Installation
Available through pip and conda:
pip install methylartistor
conda install -c bioconda methylartistAlternatively:
git clone https://github.com/adamewing/methylartist.gitor download a .zip file from GitHub.Various tests/examples are availble at methylartists-tests:
Input data
Alignments (.bam)
Alignments stored in .bam format should be sorted and indexed and should use the same read names as the associated methylation data.
Modified Base Calls
The easiest way to provide modified basecall data is through .bam files with the
MMandMLtags for modified base calling such as those produced by guppy or dorado. Note that themod_mappings.bamfile output by megalodon will work for modified base calling, but is unsuitable for downstream applications involving sequence variation, including phasing.If .bam files with modified base calls are not available, methylartist has functions for loading methylation data can from nanopolish, megalodon, or basecalled guppy fast5s, using the appropriate function below. Assays that use C/U conversion for methylation inference are also supported through the
db-subcommand as noted below.Once coverted, the sqlite .db file can be input to methylartist functions (e.g.
segplot,locus, etc).Commands:
db-nanopolish
Load nanopolish methylation into sqlite db.
Example:
Loading results from
nanopolish call-methylationto a database:Appending additional results to the above database:
Loading results with the current recommended cutoffs for nanopolish (abs(llr) > 2.0, scale grouped CpGs):
Inputs can be uncompressed or .gzipped.
db-megalodon
Load megalodon methylation into sqlite db.
The input file is the output of
megalodon_extras per_read_text modified_bases /path/to/megalodon_output, which needs to be run prior to this script.The default filename (
/path/to/megalodon_output/per_read_modified_base_calls.txt) is the same for all megalodon runs, so the--dboption is recommended to make the output database more identifiable for downstream analysis.Example:
Appending (
-a) additional results to the above database:Input files can be uncompressed or .gzipped.
db-custom
This enables free-form parsing of modified basecall tables into methylartist .db files for tools where modified base .bam files are not available and certain requirements are met. The table must contain, at a minimum, the read names, genomic position (chromosome and position), strand, and probability of the target base being modified. If not specified by a column (
--modbasecol), the modification is specified by--modbase. The probability is assumed to be a raw probability between 0 and 1 of a given base being modified i.e. 1-p(canonical), other schemes may be used but--canproband--mincanprobmust be specified to set a column for canoncial base scores and a cutoff for a base being canonical.For example, modified basecalls from deepsignal-plant can be loaded as follows:
/home/taewing/methylartist/methylartist db-custom -m deepsignal_example.C.call_mods.tsv --readname 4 --chrom 0 --pos 1 --strand 2 --modprob 7 --modbase m -d deepsignal_example.dbdb-sub
methylartist now supports C/T data if the .bam file is noted as being C/T substitution data via
--ctbam(works for locus, region, segmeth, wgmeth)As of 1.3.0, methylartist supports display of C/U base substitution data (i.e. WGBS or EM-seq data) via creation of a methylartist .db file, simply pass the .bam file as input and specify an output file:
Note that the .bam file has to include the
MDtag (aligners for bisulfite/em-seq data should do this).segmeth
Outputs aggregate methylation / demethylation call counts over intervals. Required before generating strip / violin plots with
segplotRequires whitespace-delimited list of segments in a BED3+2 format: chromosome, start, end, label, strand.
One or more .bam files may be supplied via the
-b/--bamsparameter. Multiple .bams may be comma-delimited.Optional sample input file
-d/--datahas the following whitespace-delimited fields (one sample per line): BAM file, Methylation DB (generated with e.g.db-nanopolish)Highly recommend parallelising with
-p/--procoption if possible.Can be used to generate genome-wide methylation stats aggregated over windows via
bedtools makewindows.Example:
Aggregate whole-genome CpG methylation in 10kbp bins, promoters (Eukaryotic Promoter Database, EPD), L1HS and SVA retrotransposons:
Contents of
MCF7_data_megalodon.txt:Contents of MCF7_megalodon_annotations.bed (first 10 lines):
Output in
MCF7_megalodon_annotations.segmeth.tsv(first 10 lines):segplot
Generates strip plots or violin plots (
-v/--violin) fromsegmethoutput.Examples:
Strip plot of whole-genome CpG methylation in 10kbp bins, promoters (Eukaryotic Promoter Database, EPD), L1HS and SVA retrotransposons:
As above, but use violin plots:
Note that default output is in .png format. For .svg vector output suitable for editing in inkscape or illustrator add the
--svgoption. Note that for strip plots, this is often inadvisable due to the large number of points.New in 1.0.7, ridge plots (
-g/--ridge):Ridge plots can also be grouped by annotation (
--ridge_group_by_annotation) rather than by sample as in the above example.locus
Generates smoothed methylation profiles across specific loci with many configurable parameters for one or more samples.
One or more .bam files (with Mm/Ml tags) may be supplied via the
-b/--bamsparameter. Multiple .bams may be comma-delimited.Optional sample input file
-d/--datahas the following whitespace-delimited fields (one sample per line): BAM file, Methylation DB (generated with e.g.db-nanopolish)Example:
Plot of the GPER1 locus in hg38, highlighting the GeneHancer promoter/enhancer annotation (
GH07J001085).From top to bottom, the plot shows the genome coordinates, gene models (optional if
--gtfis supplied), read mappings with modified bases as closed (modified) or open (unmodified) circles, translation from genome to CpG-only coordinate space, raw log-likelihood ratios, and smoothed methylated fraction plot. Note that--genesresults in plotting only the genes specified after--genes(comma-delimited), leaving this out plots all genes across the window:Custom sample and highlight colours
Examples:
Plot of the ESR2 locus using the
--samplepaletteoption:Plot of the ERBB2 (HER2) locus in hg38, using a data input file (
-d) with custom colour specification:Contents of
MCF7_data_megalodon.colours.txt:Contents of
erbb2.highlights.txt:Short-read data
Since WGBS/EM-seq data is supported by specifying that the .bam is C/T substitution data via
--ctbam, this data can be displayed via any of the other methylartist commands, e.g.:Note
locusmay take a long time to plot short read alignments, a fix is pending.phasing
Use the
--phasedflag to produce haplotype-aware methylation profiles. Requires haplotypes to be tagged with whatshap.Examples:
Plot of the PEG3 imprinted region on chromosome 19, hg38.
Note the apparent differential imprinting between the ATCC and ECACC MCF7 lines.
variant highlighting
The
--variantsoption can be used with a VCF file (bgzipped, tabix-indexed) to highlight variants in the reads:split samples by a variant
Using
--variantsand specify a specific variant ID via the--splitvaroption:Contents of
example_variant.txt.gz:Same region, without
--splitvar mysnpid:region
More tractable than “locus” for larger regions using binned methylation.
Example:
Plot of the PGR (Progesterone Receptor) region on chr11 in MCF7 cells. Highlighed region corresponds to the GeneHancer annotation
GH11J101126. Note the parameter setting-n CGto normalised bins for content of CpG dinucleotides. Options--samplepaletteand--highlightpaletteare used to set colours. The number of windows and smoothing parameters are set automatically but can be modified from automatically set values via-w/--windowsand-s/--smoothwindowsize.Expanded to a larger ~2Mbp region:
For an even larger region, it helps to drop the read alignment panel:
It’s also possible to use region plots to explore methylation profiles across entire chromosomes. This can be accomplished through turning off the read alignment plot with
--skip_align_plotand adjusting--panelratiosto set the read alignment plot panel to 0. Setting the image height via--height 4.5maintains similar panel heights to the default.Example:
composite
Generates “composite” methylation plots where multiple per-element profiles are aligned to and plotted against a reference sequence.
Example:
The contents of
L1HS.bedare formatted as follows (first 10 lines), derived from hg38 repeatmasker .out files:The consensus sequence for establishing a common coordinate system is in
L1.3.fa(fasta formatted with one sequence), available here: L19088.1The contents of
L1.3.highlights.bedused for optional highlighting of regions (ORF1 and ORF2 in this case):Command to
methylartist composite(note the parsing of individual profiles can be parallelised via-p):(adjust number of processes used for computing individual profiles via
-p/--procsto be appropriate for your system)wgmeth
Outputs genome-wide statistics on CpGs covered by at least one call. By default,
wgmethyields output in bedMethyl format (0-based coordinates):The
--dssoption yields genome-wide files suitable for input to DSS for statistical assessment of differential methylation (1-based coordinates):scoredist
Outputs distribution of modified base statistics with current cutoffs for one or more methylartist databases
Citation
Seth W Cheetham, Michaela Kindlova, Adam D Ewing, Methylartist: tools for visualizing modified bases from nanopore sequence data, Bioinformatics, Volume 38, Issue 11, 1 June 2022, Pages 3109–3112, https://doi.org/10.1093/bioinformatics/btac292