@article{doi:10.1093/bioinformatics/bty396,
author = {Quang, Daniel and Guan, Yuanfang and Parker, Stephen C J},
title = {YAMDA: thousandfold speedup of EM-based motif discovery using deep learning libraries and GPU},
journal = {Bioinformatics},
volume = {},
number = {},
pages = {bty396},
year = {2018},
doi = {10.1093/bioinformatics/bty396},
URL = {http://dx.doi.org/10.1093/bioinformatics/bty396},
eprint = {/oup/backfile/content_public/journal/bioinformatics/pap/10.1093_bioinformatics_bty396/1/bty396.pdf}
}
Installation
Clone a copy of the YAMDA repository:
git clone https://github.com/daquang/YAMDA.git
Or download a stable release version (v0.1 should reproduce the paper’s results exactly, but uses older libraries):
YAMDA relies on several open source software packages. Links and version numbers for the packages used to develop and
test YAMDA are listed below; however, typically any recent version of these packages should be fine for running YAMDA.
The best and easiest way to install all dependencies is with Anaconda (5.2, Python 3.6
version). Anaconda uses pre-built binaries for specific operating systems to allow simple installation of Python and
non-Python software packages. macOS High Sierra or Ubuntu 18.04 is recommended.
Required dependencies
Python (3.6.5). I chose Python 3.6 instead of Python 2.7 for initial YAMDA development
because the latter will no longer be supported in 2020. YAMDA imports the following standard Python packages:
sys, os, errno, re, argparse, pickle, and itertools.
numpy (1.15.1). Python scientific computing library. Comes pre-packaged in Anaconda.
scipy (1.1.0). Python scientific computing library. Comes pre-packaged in Anaconda.
pyfaidx (0.5.4.1). Python wrapper module for indexing, retrieval, and in-place
modification of FASTA files using a samtools compatible index. Easily installed in Anaconda with the following command
line:
pip install pyfaidx
tqdm (4.29.0). Progress bar. Easily installed in Anaconda with the following
command line:
pip install tqdm
PyTorch (1.0). Tensor computation library from Facebook AI that forms the backbone of YAMDA.
Both GPU and CPU versions are supported. It is recommended you check out the official
PyTorch website for foolproof methods of installation for specific operating systems and hardware
configurations.
tl;dr, the following command line should work most of the time for installing PyTorch.
conda install pytorch torchvision -c pytorch
Optional dependencies
These are software packages and Python libraries that are not necessary to run YAMDA, but are nevertheless recommended.
They contain extra utilities that can extend the functionality of YAMDA or help preprocess data. Once again, I’ve put
links and version numbers of what I used, but any recent version of these packages should be fine.
The MEME suite (4.12.0). Appropriately enough, the MEME suite has many tools for
processing FASTA and motif files. Among these are the fasta-shuffle-letters utility, which is useful for generating
negative controls. MEME can also be installed easily enough from its main website or through Anaconda:
conda install -c bioconda meme
However, for my MacBook Pro, this command line yielded some errors. I had to download a more specific set of binaries
for my specific operating system and version of Python, as follows:
biopython (1.7.0). Required to read bgzipped FASTA files. Convenient if you like storing
files compressed.
conda install -c anaconda biopython
BEDTools (0.7.10). Standard BEDTools suite is useful for extracting FASTA
sequences from BED files. Since I also needed the pybedtools wrapper library, I installed BEDTools with the following
conda command:
conda install -c bioconda pybedtools
Streamlined (can ignore this part if you already manually installed all dependencies)
Anaconda Install
cd /tmp && wget https://repo.anaconda.com/archive/Anaconda3-5.1.0-Linux-x86_64.sh -O ./anaconda3.sh && bash ./anaconda3.sh -u -b -p /opt/anaconda3 && export PATH="/opt/anaconda3/bin:$PATH" && cd -;
cd /tmp && sudo curl -L https://github.com/docker/compose/releases/download/1.18.0/docker-compose-`uname -s`-`uname -m` -o /usr/local/bin/docker-compose && sudo chmod +x /usr/local/bin/docker-compose && cd -;
Make docker image using the makefile make yamda-dock
To have docker run the CMD you put into the Dockerfile sudo docker run yamda-dock
To ssh into the image, for debugging and so on: sudo docker run -it yamda-dock bash
When in the image don’t forget to source activate YAMDA-env
To kill the image and cleanup docker make cleanup
Docker is screwy about importing global variables in your environment, which you’ll probably want now or later. So far to do it easily and relatively conveniently you need to enter the variable 4 times in 3 different places, twice in the Dockerfile, once in the .env file, and once in the docker-compose.yml file. I made an example VAR to make it clear how to do that.
Examples
In the examples folder, you will find the narrowPeak and masked FASTA files that are needed to reproduce results in the
manuscript. For your convenience, I have included the major preprocessing steps that typically comprise a de novo
motif discovery pipeline.
Making a masked genome FASTA
Motif discovery for DNA usually performs better on a FASTA sequence set with all repetitive sequences masked. This is
typically accomplished by first generating a masked genome where all repetitive sequence residues are replaced with
capital N’s. The following command lines will download masked hg19 chromosome FASTA files, assemble the individual
files into a single FASTA file (hg19.fa.masked), and remove all intermediate files:
BEDTools’ fastaFromBed utility is useful for extracting letter sequences from a reference fasta file based on feature
coordinates. The following command lines demonstrate how to do this from an ENCODE narrowPeak file (H1 POU5F1) to
generate 100 bp sequences centered on peak summits. For simplicity, we will use the same masked genome FASTA file
generated in the previous example.
This example demonstrates motif discovery on the H1 POU5F1 ChIP-seq data. YAMDA requires a positive FASTA file
and a negative FASTA file. The latter is typically a dinucleotide-shuffled control version of the positive file. The
fasta-shuffle-letters utility from the MEME-suite is useful for this purpose.
The run_em.py script executes the motif discovery program on the FASTA pairs. Use python run_em.py -h to get a
detailed description of the script’s arguments. Note that to run this example, you do not necessarily need to run the
previous examples because all the necessary files have already been prepackaged with this repository.
The output folder H1_POU5F1_output contains the following files:
model.pkl. A saved/pickled version of the learned mixture model.
motifs.txt. The discovered motif(s) in Minimal MEME format. This file can be further processed with MEME utilities
such as meme2images and TOMTOM.
positive_seqs.fa. A FASTA of the positive sequences with all instances of the discovered motif(s) erased.
negative_seqs.fa. A FASTA of the negative sequences with all instances of the discovered motif(s) erased.
Motif discovery in DGF
This example demonstrates motif discovery on the K562 Digital Genomic Footprinting dataset. This is the same example
from EXTREME.
Motif discovery in DGF is similar to motif discovery in ChIP-seq; however, due to the rarity of motifs in DGF datasets,
we found that it helps to erase all overlapping instances of repetitive sequences such as AAAAAA/TTTTTT and
CCCGCCC/GGGCGGG:
The -f argument is one of the most difficult, yet perhaps most important, arguments. The closest corresponding argument
in MEME is wnsites. The closer the -f argument is to zero, the stronger the bias towards motifs with exactly the
expected number of sites. The default value of 0.1 works well for most ChIP-seq and some DGF datasets, but in cases of
even rarer motifs smaller values (e.g. 0.025) is necessary.
To-Do
Here is a list of features I plan to add. They will be added according to demand.
Test YAMDA on RNA and protein sequences
Python 2.7 compatibility
Cythonize seeding step and reduce its memory overhead
Add more examples (e.g. SELEX data)
Add ZOOPS (zero or one occurrence per sequence) and OOPS (one occurrence per sequence) models. YAMDA currently only supports the TCM (two component model), whereas
MEME supports all three. ZOOPS and OOPS may offer faster and more accurate performance for certain datasets, such as ChIP-seq.
In addition, I promise to update YAMDA as library dependencies are updated.
A highly scalable GPU-accelerated de novo motif discovery software package
Please post in the Issues board or e-mail me (daquang@umich.edu) if you have any questions, suggestions, or complaints :)
Table of Contents
Citation
Installation
Clone a copy of the YAMDA repository:
Or download a stable release version (v0.1 should reproduce the paper’s results exactly, but uses older libraries):
YAMDA relies on several open source software packages. Links and version numbers for the packages used to develop and test YAMDA are listed below; however, typically any recent version of these packages should be fine for running YAMDA. The best and easiest way to install all dependencies is with Anaconda (5.2, Python 3.6 version). Anaconda uses pre-built binaries for specific operating systems to allow simple installation of Python and non-Python software packages. macOS High Sierra or Ubuntu 18.04 is recommended.
Required dependencies
tl;dr, the following command line should work most of the time for installing PyTorch.
Optional dependencies
These are software packages and Python libraries that are not necessary to run YAMDA, but are nevertheless recommended. They contain extra utilities that can extend the functionality of YAMDA or help preprocess data. Once again, I’ve put links and version numbers of what I used, but any recent version of these packages should be fine.
Streamlined (can ignore this part if you already manually installed all dependencies)
Anaconda Install
Install Detailed
Install Easy
Exit Env
source deactivateKill Env
conda env remove --name YAMDA-envDocker (can ignore this part if you do not intend on doing a Docker installation)
make yamda-docksudo docker run yamda-docksudo docker run -it yamda-dock bashsource activate YAMDA-envmake cleanupDocker is screwy about importing global variables in your environment, which you’ll probably want now or later. So far to do it easily and relatively conveniently you need to enter the variable 4 times in 3 different places, twice in the Dockerfile, once in the .env file, and once in the docker-compose.yml file. I made an example VAR to make it clear how to do that.
Examples
In the examples folder, you will find the narrowPeak and masked FASTA files that are needed to reproduce results in the manuscript. For your convenience, I have included the major preprocessing steps that typically comprise a de novo motif discovery pipeline.
Making a masked genome FASTA
Motif discovery for DNA usually performs better on a FASTA sequence set with all repetitive sequences masked. This is typically accomplished by first generating a masked genome where all repetitive sequence residues are replaced with capital N’s. The following command lines will download masked hg19 chromosome FASTA files, assemble the individual files into a single FASTA file (hg19.fa.masked), and remove all intermediate files:
Extracting BED interval FASTA sequences
BEDTools’ fastaFromBed utility is useful for extracting letter sequences from a reference fasta file based on feature coordinates. The following command lines demonstrate how to do this from an ENCODE narrowPeak file (H1 POU5F1) to generate 100 bp sequences centered on peak summits. For simplicity, we will use the same masked genome FASTA file generated in the previous example.
Motif discovery in ChIP-seq
This example demonstrates motif discovery on the H1 POU5F1 ChIP-seq data. YAMDA requires a positive FASTA file and a negative FASTA file. The latter is typically a dinucleotide-shuffled control version of the positive file. The fasta-shuffle-letters utility from the MEME-suite is useful for this purpose.
The run_em.py script executes the motif discovery program on the FASTA pairs. Use
python run_em.py -hto get a detailed description of the script’s arguments. Note that to run this example, you do not necessarily need to run the previous examples because all the necessary files have already been prepackaged with this repository.The output folder H1_POU5F1_output contains the following files:
Motif discovery in DGF
This example demonstrates motif discovery on the K562 Digital Genomic Footprinting dataset. This is the same example from EXTREME.
Motif discovery in DGF is similar to motif discovery in ChIP-seq; however, due to the rarity of motifs in DGF datasets, we found that it helps to erase all overlapping instances of repetitive sequences such as AAAAAA/TTTTTT and CCCGCCC/GGGCGGG:
Now we can run the YAMDA algorithm on the FASTA file:
The -f argument is one of the most difficult, yet perhaps most important, arguments. The closest corresponding argument in MEME is wnsites. The closer the -f argument is to zero, the stronger the bias towards motifs with exactly the expected number of sites. The default value of 0.1 works well for most ChIP-seq and some DGF datasets, but in cases of even rarer motifs smaller values (e.g. 0.025) is necessary.
To-Do
Here is a list of features I plan to add. They will be added according to demand.
In addition, I promise to update YAMDA as library dependencies are updated.