Transcript Identification and Selection (TIdeS) is a machine learning approach to discern pORFs in the correct reading frame with substantial improvement over other popular tools, while providing support for additional non-standard genetic codes. Additionally, TIdeS can be used to classify ORFs into several user-defined categories from highly contaminated datasets (e.g., parasite + host, kleptoplasts, big “dirty” protists) or broadly into “eukaryotic” versus “non-eukaryotic” using the metagenomic classifier Kraken2.
Installation
Note that TIdeS is only supported on UNIX systems (linux and MacOS).
Feel free to use your own if you choose!
Alternatively, we provide a bash script to create a database from six diverse eukaryotes, representing a broad yet compact database.
./TIdeS/util/prep_tides_db.sh
ORF Prediction Inputs
FASTA formatted transcriptome assembly
Taxon name (e.g., Homo sapiens, Op_me_Hsap)
Protein database (can be prepared by “prep_tides_db.sh” in the util folder)
Note: examplar commands can be found in the orf_call_and_decontam.sh script found in the examples folder.
TIdeS does support several alternative genetic codes (i.e., reassigned stop-to-sense codons)
For example, using ‘ciliate’ genetic code (translation table 6; NCBI translation tables:
Taxon/project name (e.g., Durisnkia baltica, Dinotoms)
Table of annotated sequence names (see examples folder) OR path to a formatted Kraken2 database
Python scripts for generating composition plots (orf_composition.py) and selection of sequences based on composition metrics (seqs_by_composition.py) can be found in the util folder
Note: examplar commands can be found in the orf_call_and_decontam.sh script found in the examples folder.
Please note that there must be at least 25 annotated sequences for each class (this includes automatic classification with Kraken2).
Table of annotated sequences
The <annotated-seqs-table> should include sequence names and their category separated by tabs. Note that these sequences should be present within the input FASTA file as well. Please aim to include at least 25 sequences for each category, although more (up to ~200) is great!
seq1 human
seq2 lunch
seq3 lunch
seq4 human
seq5 lunch
...
Deploy a previously trained TIdeS model
Inputs
FASTA formatted transcriptome assembly
Taxon/project name (e.g., Durisnkia baltica, Dinotoms)
TIdeS
Transcript Identification and Selection (TIdeS) is a machine learning approach to discern pORFs in the correct reading frame with substantial improvement over other popular tools, while providing support for additional non-standard genetic codes. Additionally, TIdeS can be used to classify ORFs into several user-defined categories from highly contaminated datasets (e.g., parasite + host, kleptoplasts, big “dirty” protists) or broadly into “eukaryotic” versus “non-eukaryotic” using the metagenomic classifier Kraken2.
Installation
Note that TIdeS is only supported on UNIX systems (linux and MacOS).
Install with mamba (recommended)
Install with pip
Afterwards, ensure that the following dependencies are installed and in your path:
Dependencies
ORF Prediction
Prepare a reference protein database
Feel free to use your own if you choose! Alternatively, we provide a bash script to create a database from six diverse eukaryotes, representing a broad yet compact database.
ORF Prediction Inputs
TIdeS does support several alternative genetic codes (i.e., reassigned stop-to-sense codons) For example, using ‘ciliate’ genetic code (translation table 6; NCBI translation tables:
ORF Classification and Decontamination
Inputs
orf_composition.py) and selection of sequences based on composition metrics (seqs_by_composition.py) can be found in the util folderorf_call_and_decontam.shscript found in the examples folder.Using user-defined table of annotated sequences:
Using Kraken2 to identify non-eukaryotic sequences:
Please note that there must be at least 25 annotated sequences for each class (this includes automatic classification with Kraken2).
Table of annotated sequences
The
<annotated-seqs-table>should include sequence names and their category separated by tabs. Note that these sequences should be present within the input FASTA file as well. Please aim to include at least 25 sequences for each category, although more (up to ~200) is great!Deploy a previously trained TIdeS model
Inputs
List of all options
-h,--help-i,--fin <STRING>-o,--taxon <STRING>-t,--threads <INTEGER>4.-d,--db <STRING>-p,--partials-id,--id <INTEGER>97.-l,--min-orf <INTEGER>300.-ml,--max-orf <INTEGER>10000.-e,--evalue <REAL>1e-30.--memory <INTEGER>2000, unlimited is0.-g,--gencode <STRING/INTEGER>1.-s,--strand <STRING>both.-c,--contam <STRING>-k,--kraken <STRING>--no-filterm,--model <STRING>--kmer <INTEGER>3.--overlap--step <INTEGER>kmer-length/2.--clean-gz,--gzipAdditional uses/approaches
More on how to run TIdeS and its uses can be found in the
examplesfolder, including:<annotated-seqs-table>files