Note: This repo is created just for ease of creating a conda package for TEsorter in python3, please refer to the original site at https://github.com/zhangrengang/TEsorter if you have any questions.
TEsorter
It is coded for LTR_retriever to classify long terminal repeat retrotransposons (LTR-RTs) at first. It can also be used to classify any other TE sequences, including Class I and Class II elements which are covered by the REXdb database.
For more details of methods and benchmarking results, see the preprint paper.
git clone https://github.com/NBISweden/TEsorter
cd TEsorter
python setup.py install
Quick Start
# run
TEsorter-test
By default, the newly released REXdb (viridiplantae_v3.0 + metazoa_v3) database is used, which is more sensitive and more common and thus is recommended.
For plants, it might be better to use only the plant database:
To improve sensitivity, reduce the criteria (coverage and E-value):
TEsorter input_file -p 20 -cov 10 -eval 1e-2
To improve specificity, increase the criteria and disable the pass2 mode:
TEsorter input_file -p 20 -cov 30 -eval 1e-5 -dp2
To improve sensitivity of pass-2, reduce the rule:
TEsorter input_file -p 20 -rule 70-30-80
To classify TE polyprotein sequences (an example) or gene protein seqeunces:
TEsorter RepeatPeps.lib -st prot -p 20
Usage
$ TEsorter -h
usage: TEsorter [-h] [-v] [-db {rexdb,rexdb-plant,rexdb-metazoa,gydb}]
[-st {nucl,prot}] [-pre PREFIX] [-fw] [-p PROCESSORS]
[-tmp TMP_DIR] [-cov MIN_COVERAGE] [-eval MAX_EVALUE]
[-dp2] [-rule PASS2_RULE] [-nolib] [-norc] [-nocln]
sequence
positional arguments:
sequence input TE sequences in fasta format [required]
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-db {rexdb,rexdb-plant,rexdb-metazoa,gydb}, --hmm-database {rexdb,rexdb-plant,rexdb-metazoa,gydb}
the database used [default=rexdb]
-st {nucl,prot}, --seq-type {nucl,prot}
'nucl' for DNA or 'prot' for protein [default=nucl]
-pre PREFIX, --prefix PREFIX
output prefix [default='{-s}.{-db}']
-fw, --force-write-hmmscan
if False, will use the existed hmmscan outfile and
skip hmmscan [default=False]
-p PROCESSORS, --processors PROCESSORS
processors to use [default=4]
-tmp TMP_DIR, --tmp-dir TMP_DIR
directory for temporary files [default=./tmp]
-cov MIN_COVERAGE, --min-coverage MIN_COVERAGE
mininum coverage for protein domains in HMMScan output
[default=20]
-eval MAX_EVALUE, --max-evalue MAX_EVALUE
maxinum E-value for protein domains in HMMScan output
[default=0.001]
-dp2, --disable-pass2
do not further classify the unclassified sequences
[default=False for `nucl`, True for `prot`]
-rule PASS2_RULE, --pass2-rule PASS2_RULE
classifying rule [identity-coverage-length] in pass-2
based on simliarity [default=80-80-80]
-nolib, --no-library do not generate a library file for RepeatMasker
[default=False]
-norc, --no-reverse do not reverse complement sequences if they are
detected in minus strand [default=False]
-nocln, --no-cleanup do not clean up the temporary directory
[default=False]
Outputs
rice6.9.5.liban.rexdb.domtbl HMMScan raw output
rice6.9.5.liban.rexdb.dom.faa protein sequences of domain, which can be used for phylogenetic analysis.
rice6.9.5.liban.rexdb.dom.tsv inner domains of TEs/LTR-RTs, which might be used to filter domains based on their scores and coverages.
rice6.9.5.liban.rexdb.dom.gff3 domain annotations in `gff3` format
rice6.9.5.liban.rexdb.cls.tsv TEs/LTR-RTs classifications
Column 1: raw id
Column 2: Order, e.g. LTR
Column 3: Superfamily, e.g. Copia
Column 4: Clade, e.g. SIRE
Column 5: Complete, "yes" means one LTR Copia/Gypsy element with full GAG-POL domains.
Column 6: Strand, + or - or ?
Column 7: Domains, e.g. GAG|SIRE PROT|SIRE INT|SIRE RT|SIRE RH|SIRE; `none` for pass-2 classifications
rice6.9.5.liban.rexdb.cls.lib fasta library for RepeatMasker
rice6.9.5.liban.rexdb.cls.pep the same sequences as `rice6.9.5.liban.rexdb.dom.faa`, but id is changed with classifications.
Limitations
For each domain (e.g. RT), only the best hit with the highest score will output, which means: 1) if frame is shifted, only one part can be annotated; 2) for example, if two or more RT domains are present in one query sequence, only one of these RT domains will be annotated.
Many LTR-RTs cannot be classified due to no hit, which might be because: 1) the database is still incompleted; 2) some LTR-RTs may have too many mutations such as frame shifts and stop gains or have lost protein domains; 3) some LTR-RTs may be false positive. For the test data set (rice6.9.5.liban), ~84% LTR-RTs (_INT sequences) are classified.
Non-autonomous TEs that lack protein domains, some un-active autonomous TEs that have lost their protein domains and any other elements that contain none protein domains, are excepted to be un-classified.
Further phylogenetic analyses
You may want to use the RT domains to analysis relationships among retrotransposons (LTR, LINE, DIRS, etc.). Here is an example (with mafft and iqtree installed):
# to extract RT domain sequences
cat rice6.9.5.liban.rexdb.dom.tsv | grep -P "\-RT\t" | get_record.py -i rice6.9.5.liban.rexdb.dom.faa -o rice6.9.5.liban.rexdb.dom.RT.faa -t fasta
# to align with MAFFT or other tools
mafft --auto rice6.9.5.liban.rexdb.dom.RT.faa > rice6.9.5.liban.rexdb.dom.RT.aln
# to reconduct the phylogenetic tree with IQTREE or other tools
iqtree -s rice6.9.5.liban.rexdb.dom.RT.aln -bb 1000 -nt AUTO
# Finally, visualize and edit the tree 'rice6.9.5.liban.rexdb.RT.faa.aln.treefile' with FigTree or other tools.
Note: the domain names between rexdb and gydb are different: PROT (rexdb) = AP (gydb), RH (rexdb) = RNaseH (gydb). You should use the actual domain name.
Extracting TE sequences from genome for TEsorter
Here are examples to extract TE sequences from outputs of wide-used softwares, when you have only genome sequences.
extract all TE sequences from RepeatMasker output:
```
run RepeatMasker, which will generate a *.out file.
Note: This repo is created just for ease of creating a conda package for TEsorter in python3, please refer to the original site at https://github.com/zhangrengang/TEsorter if you have any questions.
TEsorter
It is coded for LTR_retriever to classify long terminal repeat retrotransposons (LTR-RTs) at first. It can also be used to classify any other TE sequences, including Class I and Class II elements which are covered by the REXdb database.
For more details of methods and benchmarking results, see the preprint paper.
Table of Contents
Installation
Using bioconda
Old school
Dependencies:
pip install biopythonorconda install biopython+ parallel python v1.6.4.4: quickly install byconda install ppconda install hmmerconda install blastQuick Start
By default, the newly released REXdb (viridiplantae_v3.0 + metazoa_v3) database is used, which is more sensitive and more common and thus is recommended.For plants, it might be better to use only the plant database:
Classical GyDB can also be used:
To speed up, use more processors [default=4]:
To improve sensitivity, reduce the criteria (coverage and E-value):
To improve specificity, increase the criteria and disable the pass2 mode:
To improve sensitivity of pass-2, reduce the rule:
To classify TE polyprotein sequences (an example) or gene protein seqeunces:
Usage
Outputs
Limitations
Further phylogenetic analyses
You may want to use the RT domains to analysis relationships among retrotransposons (LTR, LINE, DIRS, etc.). Here is an example (with mafft and iqtree installed):
The alignments can also be generated by:
The alignments of LTR-RTs full domains can be generated by (align and concatenate):
The alignments of Class I INT and Class II TPase (DDE-transposases) can be generated by:
Note: the domain names between rexdb and gydb are different: PROT (rexdb) = AP (gydb), RH (rexdb) = RNaseH (gydb). You should use the actual domain name.
Extracting TE sequences from genome for TEsorter
Here are examples to extract TE sequences from outputs of wide-used softwares, when you have only genome sequences.
run RepeatMasker, which will generate a *.out file.
RepeatMasker [options] genome.faextract sequences
RepeatMasker.py out2seqs genome.fa.out genome.fa > whole_genome_te.fa
classify
TEsorter whole_genome_te.fa [options]
run LTR_retriever, which generate two *.pass.list files.
LTR_retriever -genome genome.fa [options]
extract sequences
LTR_retriever.py get_full_seqs genome.fa > intact_ltr.fa
classify
TEsorter intact_ltr.fa [options]