GitHub Conventions:
- Master branch is the current release
- Development branch is not gauranteed to be stable
- Source releases are avaialable under the GitHub release tab and at
http://www.repeatmasker.org
RepeatModeler
RepeatModeler is a de novo transposable element (TE) family identification and
modeling package. At the heart of RepeatModeler are three de-novo repeat finding
programs ( RECON, RepeatScout and LtrHarvest/Ltr_retriever ) which employ
complementary computational methods for identifying repeat element boundaries
and family relationships from sequence data.
RepeatModeler assists in automating the runs of the various algorithms
given a genomic database, clustering redundant results, refining and
classifying the families and producing a high quality library of
TE families suitable for use with RepeatMasker and ultimately for submission
to the Dfam database ( http://dfam.org ).
Authors
RepeatModeler:
Robert Hubley, Arian Smit - Institute for Systems Biology
LTR Pipeline Extensions:
Jullien M. Flynn - Cornell University
Installation
There are two supported paths to installing RepeatModeler on a
UNIX-based server. RepeatModeler may be installed from source as
described in the “Source Distribution Installation” instructions
below, or using one of our Dfam-TETools container images ( Docker or
Singularity ). The containers include RepeatModeler, it’s
prerequisites and additional TE analysis tools/utilities used by
Dfam. Information on the Dfam-TETools container may be found here:
https://github.com/Dfam-consortium/TETools
RepeatMasker & Libraries
Developed and tested with 4.1.9. The program is available at
http://www.repeatmasker.org/RMDownload.html and is distributed with
Dfam - an open database of transposable element families.
RECON - De Novo Repeat Finder, Bao Z. and Eddy S.R.
Developed and tested with our patched version of RECON ( 1.08 ).
The 1.08 version fixes problems with running RECON on 64 bit machines and
supplies a workaround to a division by zero bug along with some buffer
overrun fixes. The program is available at:
http://www.repeatmasker.org/RECON-1.08.tar.gz.
The original version is available at http://eddylab.org/software/recon/.
RepeatScout - De Novo Repeat Finder, Price A.L., Jones N.C. and Pevzner P.A.
Developed and tested with our extended version of RepeatScout
( 1.0.7 ). This version is available at
https://github.com/Dfam-consortium/RepeatScout
TRF - Tandem Repeat Finder, G. Benson et al.
You can obtain a free copy at http://tandem.bu.edu/trf/trf.html
RepeatModeler requires version 4.0.9 or higher.
RMBlast - A modified version of NCBI Blast for use with RepeatMasker
and RepeatModeler. Precompiled binaries and source can be found at
http://www.repeatmasker.org/RMBlast.html
We highly recommend using 2.14.1 or higher.
UCSC genome browser command-line utilities - Some tools included with
RepeatModeler work with files in the ‘twobit’ file format using
these programs: twoBitToFa, faToTwoBit, and twoBitInfo.
Precompiled binaries and source for these programs can be found at
http://hgdownload.soe.ucsc.edu/downloads.html#utilities_downloads.
Optional. Required for running LTR structural search pipeline:
LtrHarvest - The LtrHarvest program is part of the GenomeTools suite. We
have developed this release of RepeatModeler on GenomeTools version 1.5.9
available for download from here: http://genometools.org/pub/
NOTE: use the “make threads=yes” build options to enable multi-threaded
runs.
Ltr_retriever - A LTR discovery post-processing and filtering tool. We
currently require the 2.9.0 version of LTR retreiver, newer versions
of LTR_retriever will not currently work with RepeatModeler.
https://github.com/oushujun/LTR_retriever/releases
MAFFT - A multiple sequence alignment program. We developed and tested
RepeatModeler using mafft version 7.505. Please use this verison or
higher from here:
https://mafft.cbrc.jp/alignment/software/
CD-HIT - A sequence clustering package. We developed and tested
RepeatModeler using version 4.8.1. Please use this version or higher
from:
https://github.com/weizhongli/cdhit
Github : https://github.com/Dfam-consortium/RepeatModeler
Available by cloning the master branch of the RepeatModeler repository
( latest released version ) or by downloading a release from the
repository “releases” tab.
or
Run the “configure” script interactively with prompts
for each setting:
perl ./configure
Run the “configure” script with supplied paramters:
perl ./configure -rscout_dir .. -recon_dir ..
By Hand:
Edit the configuration file “RepModelConfig.pm”
Dynamically:
Use the “configuration overrides” command line options
with the RepeatModeler programs. e.g:
./RepeatModeler -rscout_dir .. -recon_dir ..
Example Run
In this example we first downloaded elephant sequences
from Genbank ( approx 11MB ) into a file called elephant.fa.
Create a Database for RepeatModeler
RepeatModeler uses a NCBI BLASTDB as input to the
repeat modeling pipeline. A utility is provided to assist
the user in creating a single database from several
types of input structures.
Run “BuildDatabase” without any options in order to see the
full documentation on this utility. There are several options
which make it easier to import multiple sequence files into
one database.
TIP: It is a good idea to place your datafiles and run this
program suite from a local disk rather than over NFS.
This will greatly improve runtime as the filesystem
access is considerable
Run RepeatModeler
RepeatModeler runs several compute intensive programs on the
input sequence. For best results run this on a single machine with
a moderate amount of memory > 32GB and multiple processors. Our setup is Xeon(R) CPU E5-2680 v4 @ 2.40GHz - 28 cores, 128GB RAM.
To specify a run using 20 threads (at most), and including the new
LTR discovery pipeline:
The nohup (or screen) is used on our machines when running long
jobs. The log output is saved to a file and the process is backgrounded.
For typical runtimes ( can be > 1-2 days with this configuration on a
well assembled mammalian genome ) see the run statistics section of
this file. It is important to save the log output for later usage. It contains the random number generator seed so that the sampling
process may be reproduced if necessary. In addition the log file
contains details about the progress of the run for later assesment
of peformance or debuging problems.
Interpret the results
RepeatModeler produces a voluminous amount of temporary files stored
in a directory created at runtime named like:
RM_<PID>.<DATE> ie. "RM_5098.MonMar141305172005"
and remains after each run for debugging purposes or for the purpose
of resuming runs if a failure occures. At the succesful completion
of a run, three files are generated:
<database_name>-families.fa : Consensus sequences
<database_name>-families.stk : Seed alignments
<database_name>-rmod.log : A summarized log of the run
The seed alignment file is in a Dfam compatible Stockholm format and
may be uploaded to the Dfam database by submiting the data to
help@dfam.org or by going to dfam.org/login and creating an upload
account.
The fasta format is useful for running quick custom library searches
using RepeatMasker. Ie.:
Other files produced in the working directory include:
RM_<PID>.<DATE>/
consensi.fa
families.stk
round-1/
sampleDB-#.fa : The genomic sample used in this round
sampleDB-#.fa.lfreq : The RepeatScout lmer table
sampleDB-#.fa.rscons: The RepeatScout generated consensi
sampleDB-#.fa.rscons.filtered : The simple repeat/low
complexity filtered
version of *.rscons
consensi.fa : The final consensi db for this round
family-#-cons.html : A visualization of the model
refinement process. This can be opened
in web browsers that support zooming.
( such as firefox ).
This is used to track down problems
with the Refiner.pl
index.html : A HTML index to all the family-#-cons.html
files.
round-2/
sampleDB-#.fa : The genomic sample used in this round
msps.out : The output of the sample all-vs-all
comparison
summary/ : The RECON output directory
eles : The RECON family output
consensi.fa : Same as above
family-#-cons.html : Same as above
index.html : Same as above
round-3/
Same as round-2
..
round-n/
Recover from a failure
If for some reason RepeatModeler fails, you may restart an
analysis starting from the last round it was working on. The
-recoverDir [ResultDir] option allows you to specify a
diretory ( i.e RM_./ ) where a previous run of
RepeatModeler was working and it will automatically determine
how to continue the analysis.
Caveats
RMBlast uses the NCBI stat reporting mechanism to report
usage statistics back over the net. If RepeatModeler is
taking in inordinate amount of time to complete
( > 1 week on a multi-core machine ) or you do not have
an outside network connection to the machine, you should
disable this reporting feature by setting the environment
variable BLAST_USAGE_REPORT=false or by creating a .ncbirc
file in the users home directory with the stanza:
[BLAST]
BLAST_USAGE_REPORT=false
RepeatModeler is designed to run on assemblies rather
than genome reads. At the start of a run a quick analysis
is performed on the input database to ascertain the
assembly N50. A histogram of contig size is also displayed.
RepeatModeler employs symmetric multiprocessing parallelism,
therefore it should be run on a single machine per-assembly.
It is not recommended that a genome be run in a batched fashion
nor the results of multiple RepeatModeler runs on the same
genome be naively combined. Doing so will generate a combined
library that is largely redundant. The -genomeSampleSizeMax
parameter is provided for the purpose of increasing the amount
of the genome sampled while avoiding rediscovery of families.
Please see the RELEASE-NOTES file for more details.
RepeatModeler Statistics
Benchmarks and statistics for runs of RepeatModeler on reference
genomes.
Analysis run on a Ubuntu 22.04.3 LTS Linux system with
Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz processors. Both genomes
were run with “-threads 48”.
Credits
-------
Arnie Kas for the work done on the original MultAln.pm.
Andy Siegel for statistics consultations.
Thanks so much to Warren Gish for his invaluable assistance
and consultation on his ABBlast program suite.
Alkes Price and Pavel Pevzner for assistance with RepeatScout
and hosting my multi-sequence version of RepeatScout.
Shujun Ou, and Ning Jiang for discussions and assistance with
using LTR_retreiver.
This work was supported by the NIH ( R44 HG02244-02),
( RO1 HG002939 ), ( U24 HG010136 ), and the Institute
for Systems Biology.
License
-------
This work is licensed under the Open Source License v2.1.
To view a copy of this license, visit
http://www.opensource.org/licenses/osl-2.1.php or
see the LICENSE file contained in this distribution.
RepeatModeler
RepeatModeler is a de novo transposable element (TE) family identification and modeling package. At the heart of RepeatModeler are three de-novo repeat finding programs ( RECON, RepeatScout and LtrHarvest/Ltr_retriever ) which employ complementary computational methods for identifying repeat element boundaries and family relationships from sequence data.
RepeatModeler assists in automating the runs of the various algorithms given a genomic database, clustering redundant results, refining and classifying the families and producing a high quality library of TE families suitable for use with RepeatMasker and ultimately for submission to the Dfam database ( http://dfam.org ).
Authors
RepeatModeler: Robert Hubley, Arian Smit - Institute for Systems Biology
LTR Pipeline Extensions: Jullien M. Flynn - Cornell University
Installation
There are two supported paths to installing RepeatModeler on a UNIX-based server. RepeatModeler may be installed from source as described in the “Source Distribution Installation” instructions below, or using one of our Dfam-TETools container images ( Docker or Singularity ). The containers include RepeatModeler, it’s prerequisites and additional TE analysis tools/utilities used by Dfam. Information on the Dfam-TETools container may be found here: https://github.com/Dfam-consortium/TETools
Source Distribution Installation
Prerequisites
Perl Available at http://www.perl.org/get.html. Developed and tested with version 5.8.8.
RepeatMasker & Libraries Developed and tested with 4.1.9. The program is available at http://www.repeatmasker.org/RMDownload.html and is distributed with Dfam - an open database of transposable element families.
RECON - De Novo Repeat Finder, Bao Z. and Eddy S.R. Developed and tested with our patched version of RECON ( 1.08 ). The 1.08 version fixes problems with running RECON on 64 bit machines and supplies a workaround to a division by zero bug along with some buffer overrun fixes. The program is available at: http://www.repeatmasker.org/RECON-1.08.tar.gz. The original version is available at http://eddylab.org/software/recon/.
RepeatScout - De Novo Repeat Finder, Price A.L., Jones N.C. and Pevzner P.A. Developed and tested with our extended version of RepeatScout ( 1.0.7 ). This version is available at https://github.com/Dfam-consortium/RepeatScout
TRF - Tandem Repeat Finder, G. Benson et al. You can obtain a free copy at http://tandem.bu.edu/trf/trf.html RepeatModeler requires version 4.0.9 or higher.
RMBlast - A modified version of NCBI Blast for use with RepeatMasker and RepeatModeler. Precompiled binaries and source can be found at http://www.repeatmasker.org/RMBlast.html We highly recommend using 2.14.1 or higher.
RepeatAfterMe - An automated MSA extension program for TE families. This is available for download at: https://github.com/Dfam-consortium/RepeatAfterMe. We recommend 0.0.6 or higher.
UCSC genome browser command-line utilities - Some tools included with RepeatModeler work with files in the ‘twobit’ file format using these programs: twoBitToFa, faToTwoBit, and twoBitInfo. Precompiled binaries and source for these programs can be found at http://hgdownload.soe.ucsc.edu/downloads.html#utilities_downloads.
Optional. Required for running LTR structural search pipeline:
LtrHarvest - The LtrHarvest program is part of the GenomeTools suite. We have developed this release of RepeatModeler on GenomeTools version 1.5.9 available for download from here: http://genometools.org/pub/ NOTE: use the “make threads=yes” build options to enable multi-threaded runs.
Ltr_retriever - A LTR discovery post-processing and filtering tool. We currently require the 2.9.0 version of LTR retreiver, newer versions of LTR_retriever will not currently work with RepeatModeler. https://github.com/oushujun/LTR_retriever/releases
MAFFT - A multiple sequence alignment program. We developed and tested RepeatModeler using mafft version 7.505. Please use this verison or higher from here: https://mafft.cbrc.jp/alignment/software/
CD-HIT - A sequence clustering package. We developed and tested RepeatModeler using version 4.8.1. Please use this version or higher from: https://github.com/weizhongli/cdhit
Ninja - A tool for large-scale neighbor-joining phylogeny inference and clustering. We developed and tested RepeatModeler using Ninja version “0.98-cluster_only”. Please obtain a copy from: https://github.com/TravisWheelerLab/NINJA/releases/tag/0.98-cluster_only
Installation
Obtain the source distribution
Uncompress and expand the distribution archive:
Configure for your site:
Automatic:
Run the “configure” script interactively with prompts for each setting:
Run the “configure” script with supplied paramters:
By Hand:
Dynamically:
Use the “configuration overrides” command line options with the RepeatModeler programs. e.g:
Example Run
In this example we first downloaded elephant sequences from Genbank ( approx 11MB ) into a file called elephant.fa.
Create a Database for RepeatModeler
RepeatModeler uses a NCBI BLASTDB as input to the repeat modeling pipeline. A utility is provided to assist the user in creating a single database from several types of input structures.
Run “BuildDatabase” without any options in order to see the full documentation on this utility. There are several options which make it easier to import multiple sequence files into one database.
TIP: It is a good idea to place your datafiles and run this
Run RepeatModeler
RepeatModeler runs several compute intensive programs on the input sequence. For best results run this on a single machine with a moderate amount of memory > 32GB and multiple processors.
Our setup is Xeon(R) CPU E5-2680 v4 @ 2.40GHz - 28 cores, 128GB RAM. To specify a run using 20 threads (at most), and including the new LTR discovery pipeline:
The nohup (or screen) is used on our machines when running long jobs. The log output is saved to a file and the process is backgrounded. For typical runtimes ( can be > 1-2 days with this configuration on a well assembled mammalian genome ) see the run statistics section of this file. It is important to save the log output for later usage.
It contains the random number generator seed so that the sampling process may be reproduced if necessary. In addition the log file contains details about the progress of the run for later assesment of peformance or debuging problems.
Interpret the results
RepeatModeler produces a voluminous amount of temporary files stored in a directory created at runtime named like:
and remains after each run for debugging purposes or for the purpose of resuming runs if a failure occures. At the succesful completion of a run, three files are generated:
The seed alignment file is in a Dfam compatible Stockholm format and may be uploaded to the Dfam database by submiting the data to help@dfam.org or by going to dfam.org/login and creating an upload account.
The fasta format is useful for running quick custom library searches using RepeatMasker. Ie.:
Other files produced in the working directory include:
Recover from a failure
If for some reason RepeatModeler fails, you may restart an analysis starting from the last round it was working on. The -recoverDir [ResultDir] option allows you to specify a diretory ( i.e RM_./ ) where a previous run of RepeatModeler was working and it will automatically determine how to continue the analysis.
Caveats
RMBlast uses the NCBI stat reporting mechanism to report usage statistics back over the net. If RepeatModeler is taking in inordinate amount of time to complete ( > 1 week on a multi-core machine ) or you do not have an outside network connection to the machine, you should disable this reporting feature by setting the environment variable BLAST_USAGE_REPORT=false or by creating a .ncbirc file in the users home directory with the stanza:
[BLAST]
BLAST_USAGE_REPORT=false
RepeatModeler is designed to run on assemblies rather than genome reads. At the start of a run a quick analysis is performed on the input database to ascertain the assembly N50. A histogram of contig size is also displayed.
RepeatModeler employs symmetric multiprocessing parallelism, therefore it should be run on a single machine per-assembly.
It is not recommended that a genome be run in a batched fashion nor the results of multiple RepeatModeler runs on the same genome be naively combined. Doing so will generate a combined library that is largely redundant. The -genomeSampleSizeMax parameter is provided for the purpose of increasing the amount of the genome sampled while avoiding rediscovery of families.
Please see the RELEASE-NOTES file for more details.
RepeatModeler Statistics
Benchmarks and statistics for runs of RepeatModeler on reference genomes.
RepeatModeler 2.0.7 ( RECON + RepeatScout + LTRStruct ) 48 threads: