RNAProt is a computational RBP binding site prediction framework based on recurrent neural networks (RNNs). Conceived as an end-to-end method, RNAProt includes all necessary functionalities, from dataset generation over model training to the evaluation of binding preferences and binding site prediction. Various input types and features are supported, accompanied by comprehensive statistics and visualizations to inform the user about datatset characteristics and learned model properties.
Citation
The RNAProt paper is available at GigaScience.
Supplementary data including benchmark and training data sets can be downloaded from Zenodo.
RNAProt utilizes RBP binding sites identified by CLIP-seq and related protocols to train an RNN-based model. The model is then used to predict new binding sites on given input RNA sequences. The following figure illustrates the RNAProt framework and its general workflow:
Yellow boxes mark necessary framework inputs, blue boxes the five program modes of RNAProt, and green boxes the framework outputs. Arrows show the dependencies between inputs, modes, and outputs. RNAProt accepts RBP binding sites in FASTA or BED format (transcript or genomic regions). The latter one also requires a genomic sequence file (.2bit format) and a genomic annotations file (GTF format).
Additional inputs are available, depending on the binding site input type as well as the selected features. For more details on inputs, modes, supported features, and outputs, see the Documentation.
Installation
RNAProt was tested on Ubuntu (18.04 LTS), with Nvidia driver >=440, CUDA >=10, and various Nvidia graphics cards (RTX 2080 Ti, RTX 2070, GTX 1060, GTX 1030). We thus assume that you have a similar system available and running. While RNAProt runs fine without a dedicated GPU, we definitely recommend having an Nvidia graphics card with CUDA support for speeding up model training (specifically we recommend a >= GTX 1060 or a similar newer model, with >= 4 GB RAM). Regarding main memory, we recommend at least 8 GB RAM.
In the following we show how to install RNAProt via Conda package (easiest way + recommended), or alternatively manually (not too difficult either). In any case, you first need Conda running on your computer.
Conda
If you do not have Conda yet, you can e.g. install miniconda, a free + lightweight Conda installer. Get miniconda here, choose the Python 3.8 Miniconda3 Linux 64-bit installer and follow the installation instructions. In the end, Conda should be evocable on the command line via (possibly in a different version):
$ conda --version
conda 4.10.0
Conda package installation
RNAProt is available as Conda package here. This is the most convenient way to install RNAProt, since Conda takes care of all the dependencies. Note however that the Conda package version might not always be the latest release (but we work hard to not let this happen).
We recommend to create a Conda environment inside which we will then install RNAProt:
Note that Python 3.8 is not a necessity (higher versions should work just as fine). If you are experiencing problems while running conda install -c bioconda rnaprot (e.g. complaints about conflicting dependencies), the following commands should do the trick:
This tells conda to explicitly look for packages in the specified channels, stored in the .condarcconda configuration file.
NOTE that the bioconda installation only includes the CPU version of RNAProt (no GPU support). If you have a GPU supporting CUDA and want to take advantage of it (very much recommended for fast model training!), just install in addition:
conda install -c conda-forge pytorch-gpu=1.8
RNAProt was initially written and tested with pytorch 1.8. However, higher version numbers should work as well (i.e., you can simply discard the version number). Now RNAProt should be available inside the environment:
rnaprot -h
Finally, if you have a compatible GPU, we want to check whether the GPU (CUDA) is available for RNAProt:
This is great news, meaning that we can RNAProt with GPU support.
Manual installation
To manually install RNAProt, we first create a Conda environment (as described above). Once inside the environment, we need to install the following dependencies for the GPU version (GPU with CUDA support required):
If you don’t have a CUDA supporting GPU (and you’re not planning on getting one any time soon either), you don’t need to install the additional GPU dependencies. To install pyTorch without GPU support, simply exchange the above pytorch call with:
conda install -c conda-forge pytorch-cpu=1.8
Concerning version numbers, RNAProt was tested with the following versions: pytorch=1.8.0, seaborn=0.11.1, viennarna=2.4.17, bedtools=2.30.0, logomaker=0.8, hpbandster=0.7.4, markdown=3.2.2, plotly=4.14.3, and scikit-learn=0.24.1. Note that so far we have not experienced problems with higher pytorch version numbers (i.e., simply discarding the version number should work just as well).
Finally, to install RNAProt, we simply clone the repository and execute the installation script inside the folder:
Now we can run RNAProt from any given folder (just remember to re-activate the environment once you open a new shell):
rnaprot -h
Test runs
Once installed, we can do some quick test runs.
Test example with FASTA sequences as input
We first train a sequence model, using a provided set of positive and negative FASTA sequences sampled from the PARCLIP PUM2 dataset (3,000 positives, 3,000 negatives, all sequences with length 81 nt). In the following we will mainly use default parameters, but note that there are many options available for each program mode. To learn more about the mode options, refer to the Documentation, or simply list all mode options, e.g. for rnaprot train, by typing:
rnaprot train -h
Before training a model, we need to generate an RNAProt training dataset. For this we go to the cloned repository folder (clone + enter if not already there), and use the FASTA sequences supplied in the test folder as training data. To get training set statistics, we also enable --report:
We can then take a look at the report.rnaprot_gt.html inside test_gt_out, informing us about similarities and differences between the positive and negative set. The content of the HTML report depends on selected features (e.g. structure, conservation scores, region annotations), and the input type given to rnaprot gt (FASTA sequences, genomic sites BED, or transcript sites BED). Here for example we can compare k-mer statistics of the positive and negative set, observing that the positives tend to contain more AA, UU, and AU repeat sites. This likely also contributes to the lower sequence complexity of the postive set.
Next we train a model on the created dataset, using the default hyperparameters. For this we simply run rnaprot train with the rnaprot gt output folder as input. We also enable --verbose-train, to see the learning progress over the number of epochs:
In the end we get a summary for the trained model, e.g. reporting the model validation AUC, the training runtime, and set hyperparameters. To obtain identical models from identical calls, you have to set a fixed random seed number (e.g. --seed 1). This guarantees that the same initial random weights are set when training the network, as well as the same train-validation-test splits are used.
Note that if you want to obtain the generalization peformance of the model on the given dataset, you need to run rnaprot train in cross validation mode (10-fold by default) by adding --cv:
This will plot a sequence logo informing us about global preferences, as well as profiles for the top 25 scoring sites (default setting). The profiles contain the saliency map and single mutations track, giving us an idea what local information the model regards as important for each of the 25 sites. As with the other modes, more options are available (e.g. --report for additional statistics, comparing two models, or specifying motif sizes and which profiles to plot).
Now that we have a model, we naturally want to use it for prediction. For this we first create a prediction dataset, choosing the lncRNA NORAD for window prediction. NORAD was shown to act as a decoy for PUMILIO proteins (PUM1/PUM2). We therefore use its provided FASTA sequence as input:
rnaprot gp --in test/NORAD_lncRNA.fa --train-in PUM2_PARCLIP_train_out --out PUM2_PARCLIP_gp_out --report
Note that the input can be any number of sequences, genomic regions, or transcript regions (also see examples below).
By default, RNAProt predicts whole sites, i.e., we would get one score returned for the whole lncRNA. To run the window prediction, we use --mode 2, and also plot the top window profiles containing the reported peak regions:
Now we can take a look at the predicted peak regions (BED, TSV), or observe the profiles just like for rnaprot eval. The predicted peak regions are stored in BED format, as well as in a table file with additional information (.tsv). For details on output formats, see the Documentation. Note that while model prediction itself is very fast, plotting (especially getting the single mutation infos) takes some time for each peak region. So if you predict on a large set of input sites or sequences, you might want to disable plotting (or just exercise patience and wait). Knowing this, you can also predict only on certain input sites or a small subset (e.g. site_id1, site_id2), by adding --site-id site_id1 site_id2.
Test example with genomic regions as input
To create datasets based on genomic or transcript regions, we first need to download two additional files. Specifically, we need a GTF file (containing genomic annotations), as well as a .2bit file (containing the genomic sequences). Note that we used Ensembl GTF files to test RNAProt, and therefore recommend using these. You can find them here for many model organisms. The .2bit genome file we will download from UCSC. For this example, we choose the human genome + annotations (hg38 assembly):
If the output is cryptic instead, you need to do it again. Next we download some genomic RBP binding regions identified by eCLIP from ENCODE. The ENCODE website contains a huge collection of eCLIP datasets for various RBPs. For this example, we again download PUM2 binding sites, choosing the IDR peaks identified by ENCODE’s CLIPper pipeline (PUM2 K562 eCLIP dataset ID: ENCFF880MWQ, PUM2 K562 IDR peaks dataset ID: ENCFF880MWQ). We unzip it and change the format to 6-column BED which RNAProt likes best:
Note that we move the log2 fold change value from column 7 (original file) to column 5, which is used by RNAProt to filter and select sites in case of overlaps. By default, rnaprot gt removes overlapping sites by selecting only the highest-scoring site from a set of overlapping sites. To disable this, set --allow-overlaps. If there are no column 5 scores given (or all the same), filtering of overlaps is disabled by default. Moreover, positive sites that do not overlap with gene regions are by default filtered out too. To disable this, set --no-gene-filter.
Next we create a training dataset, by supplying the downloaded GTF and .2bit file:
Thanks to the given GTF file, the HTML report will now also include information on target gene regions and biotypes. Note that by default, rnaprot gt centers the input BED regions, and extends them on both sides by the set --seq-ext (by default 40). If you want to keep the original site lengths, set --mode 2 --seq-ext 0. In this case, you might also want to filter the --in sites by --max-len or --min-len, e.g. --max-len 100. Of course you can also extend the original sites, e.g. by setting --mode 2 --seq-ext 10. Alternatively, you can set --mode 3 to use the region upstream ends and extend by --seq-ext. Note that the negatives are generated randomly, so the same rnaprot gt call will produce a different set of random negative sites (unless negatives are supplied). To obtain the same random negatives repeatedly from identical calls, you have to set a fixed random seed number (e.g. --seed 1).
Now we can train a model and evaluate it just like in the example above:
For prediction, we could again use the folder generated by rnaprot gp from the above FASTA sequences + NORAD example. However, since we now have the GTF + genome .2bit file, we can also get its genomic or transcript region directly from these files (no need to download FASTA sequences or a gene BED file). As with rnaprot gt, rnaprot gp accepts sequences, genomic regions, or transcript regions as input. To get its genomic or transcript region, we just need its gene or transcript ID (as long as it is in the GTF file), and then use one of the two helper scripts installed together with RNAProt:
The two scripts also accept > 1 ID (either on the command line or more practically given as a file with one ID per row). In the case of NORAD, both transcript and gene region have the same length, since NORAD contains no introns and only one annotated isoform:
We can now use any of the two as input to rnaprot gp. As one would expect, extracting and providing the transcript region will result in predictions only on the transcript sequence (always excluding introns), while providing the gene region will result in predictions on the whole gene region (usually including introns). In general, we can specify any genomic or transcript (sub)region for prediction, as long as it is annotated in the GTF file. For now we will use the gene region, on which rnaprot predict will then predict and return peak regions directly with genomic coordinates (also contained inside the profiles for easier orientation):
Note that in this example we did not filter out sites from PUM2_K562_IDR_peaks.bed that overlap with the NORAD gene region prior to training. These sites indeed exist, as we can see in the rnaprot gt HTML report (Target region overlap statistics), so for an unbiased prediction we need remove them first. For this we use bedtools intersectBed, and again run rnaprot gt, rnaprot train, and rnaprot predict:
RNAProt supports various additional position(nucleotide)-wise features to learn from, such as secondary structure, region annotations (including user-defined ones), or conservation scores (see Documentation for details). For this we have to specify what features to include in rnaprot gt and rnaprot gp, and depending on the feature also provide additional files. For model training (rnaprot train) we can then specify what features to use for training, from the features included in rnaprot gt. This has the advantage that features need to be extracted or computed only once, and that various feature combinations and parameter settings can be with rnaprot train.
In this example, we want to include secondary structure on top of the sequence information. We will again use a provided dataset, containing 2,274 potential Roquin binding sites (also termed CDEs for constitutive decay elements) from Braun et al. 2018.
The CDEs were predicted using a biologically verified consensus structure consisting of a 6-8 bp long stem capped with a YRN (Y: C or U, R: A or G, N: any base) tri-nucleotide loop. We also note that the sequence conservation is rather low (specifically for the stem portion), making it an ideal test case for the benefits of including secondary structure information. We first generate a training set, by enabling structure calculation (--str) and using the CDE sites provided in the cloned repository folder:
Structure calculation can be further customized by changing the RNAplfold parameters (--plfold-u 3 --plfold-l 50 --plfold-w 70 by default). Regarding the type of structure information, RNAplfold calculates the probabilities of structural elements for each site position, which are then used as feature channels for training and prediction. For genomic or transcript --in sites, RNAProt automatically extends the sites by --plfold-w on both sides (or less at transcript ends) for a precise structure calculation.
Whether to use the probabilities or a one-hot encoding can be further specified in training (--str-mode, four options). Note that we use --allow-overlaps and --no-gene-filter, disabling the filtering of sites based on no gene overlap or overlap with other sites. These two options guarantee that all --in sites will be part of the generated training set. Now we want to train a model on the generated dataset:
Here we increased the maximum number of epochs to 300, since for smaller datasets model performance can sometimes still improve beyond the default 200 epochs. This can be easily monitored with --verbose-train enabled. In addition, increasing --patience might sometimes be necessary, to prevent model training with the model stuck early on in the training process from stopping (although we experienced this only very rarely).
Also note that if we do not specify what features to use, RNAProt will use all features present in CDE_sites_gt_out for training. Thus, to train a sequence-only model, we would need to specify:
To create a prediction set for the structure model, we use the UCP3 gene (transcript ID ENST00000314032), for which the authors validated two CDE sites in its 3’UTR (see Fig.2A blue and red hairpin).
rnaprot gp --in test/ENST00000314032.fa --train-in CDE_sites_str_train_out --out CDE_sites_str_gp_out
Note that rnaprot gp automatically detects what features were used for training the model, enabling structure prediction with set parameters for the prediction set generation. Depending on the additional feature, we thus might have to supply additional input files for extracting the respective feature information. This would be the case for conservation scores (--phastcons, --phylop), or user-defined features (--feat-in). After creating the prediction set we run a window prediction on the transcript:
In our case the model successfully predicted the two verified binding sites (all together 4 sites predicted) on the transcript (using threshold level --thr 1). The first verified loop is at transcript position 1,371 to 1,373 (loop nucleotides), the second loop from 1,404 to 1,406, with the second hairpin having a higher folding probability and score. To check this, we take a look at the reported peak regions (BED, TSV), or have a look at the plotted profiles, which show the sites annotated with transcript coordinates (or genomic coordinates in case of genomic sites as input) to quickly identify regions of interest.
The test/ folder also includes the model we used to predict, which you can easily apply yourself to compare:
This also shows how easy it is to share models. Once the model is trained, the rnaprot train --out folder can be copied and reused. To zip the train folder you can run:
zip my_cde_model.zip CDE_sites_str_train_out/*
Note however that predictions between two models can vary, since negative sites generation is random. Moreover, even models trained on the same positive and negative sites are slightly different from one another and thus can lead to slightly different predictions. This is because model training incorporates stochastic processes, like the random initialization of network weights, or the application of dropout.
RNAProt is divided into five different program modes: training set generation (rnaprot gt), model training (rnaprot train), model evaluation (rnaprot eval), prediction set generation (rnaprot gp), and model prediction (rnaprot predict).
An overview of the modes can be obtained by:
$ rnaprot -h
usage: rnaprot [-h] [-v] {train,eval,predict,gt,gp} ...
Modelling RBP binding preferences to predict RPB binding sites.
positional arguments:
{train,eval,predict,gt,gp}
Program modes
train Train a binding site prediction model
eval Evaluate properties learned from training set
predict Predict binding sites (whole sites or profiles)
gt Generate training data set
gp Generate prediction data set
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
The following sections describe each mode in more detail.
Training set generation
The following command line arguments are available in rnaprot gt mode:
$ rnaprot gt -h
usage: rnaprot gt [-h] --in str --out str [--gtf str] [--gen str]
[--mode {1,2,3}] [--mask-bed str] [--seq-ext int]
[--thr float] [--rev-filter] [--max-len int] [--min-len int]
[--keep-ids] [--allow-overlaps] [--no-gene-filter]
[--neg-comp-thr float] [--neg-factor {2,3,4,5}]
[--keep-add-neg] [--neg-in str] [--shuffle-k {1,2,3}]
[--seed int] [--report] [--theme {1,2}] [--eia] [--eia-ib]
[--eia-n] [--eia-all-ex] [--tr-list str] [--phastcons str]
[--phylop str] [--tra] [--tra-codons] [--tra-borders]
[--rra] [--str] [--plfold-u int] [--plfold-l int]
[--plfold-w int] [--feat-in str] [--feat-in-1h]
optional arguments:
-h, --help show this help message and exit
--gtf str Genomic annotations GTF file (.gtf or .gtf.gz)
--gen str Genomic sequences .2bit file
--mode {1,2,3} Define mode for --in BED site extraction. (1) Take the
center of each site, (2) Take the complete site, (3)
Take the upstream end for each site. Note that --min-
len applies only for --mode 2 (default: 1)
--mask-bed str Additional BED regions file (6-column format) for
masking negatives (e.g. all positive RBP CLIP sites)
--seq-ext int Up- and downstream sequence extension length of sites
(site definition by --mode) (default: 40)
--thr float Minimum site score (--in BED column 5) for filtering
(assuming higher score == better site) (default: None)
--rev-filter Reverse --thr filtering (i.e. the lower the better,
e.g. for p-values) (default: False)
--max-len int Maximum length of --in sites (default: 300)
--min-len int Minimum length of --in sites (only effective for
--mode 2). If length < --min-len, take center and
extend to --min-len. Use uneven numbers for equal up-
and downstream extension (default: 21)
--keep-ids Keep --in BED column 4 site IDs. Note that site IDs
have to be unique (default: False)
--allow-overlaps Do not select for highest-scoring sites in case of
overlapping sites (default: False)
--no-gene-filter Do not filter positives based on gene coverage (gene
annotations from --gtf) (default: False)
--neg-comp-thr float Sequence complexity (Shannon entropy) threshold for
filtering random negative regions (default: 0.5)
--neg-factor {2,3,4,5}
Determines number of initial random negatives to be
extracted (== --neg-factor n times # positives)
(default: 2)
--keep-add-neg Keep additional negatives (# controlled by --neg-
factor) instead of outputting same numbers of positive
and negative sites (default: False)
--neg-in str Negative genomic or transcript sites in BED (6-column
format) or FASTA format (unique IDs required). Use
with --in BED/FASTA. If not set, negatives are
generated by shuffling --in sequences (if --in FASTA)
or random selection of genomic or transcript sites (if
--in BED)
--shuffle-k {1,2,3} Supply k for k-nucleotide shuffling of --in sequences
to generate negative sequences (if no --neg-in
supplied) (default: 2)
--seed int Set a fixed random seed number (e.g. --seed 1) to
obtain the same random negative set for identical
rnaprot gt runs
--report Output an .html report providing various training set
statistics and plots (default: False)
--theme {1,2} Set theme for .html report (1: palm beach, 2: midnight
sunset) (default: 1)
required arguments:
--in str Genomic or transcript RBP binding sites file in BED
(6-column format) or FASTA format. If --in FASTA, only
--str is supported as additional feature. If --in BED,
--gtf and --gen become mandatory
--out str Output training data folder (== input folder to
rnaprot train)
additional annotation arguments:
--eia Add exon-intron annotations to genomic regions
(default: False)
--eia-ib Add intron border annotations to genomic regions (in
combination with --eia) (default: False)
--eia-n Label regions not covered by intron or exon regions as
N instead of labelling them as introns (I) (in
combination with --eia) (default: False)
--eia-all-ex Use all annotated exons in --gtf file, instead of
exons of most prominent transcripts defined by --tr-
list. Set this and --tr-list will be effective only
for --tra (default: False)
--tr-list str Supply file with transcript IDs (one ID per row) for
exon-intron labeling (using the corresponding exon
regions from --gtf). By default, exon regions of the
most prominent transcripts (automatically selected
from --gtf) are used (default: False)
--phastcons str Genomic .bigWig file with phastCons conservation
scores to add as annotations
--phylop str Genomic .bigWig file with phyloP conservation scores
to add as annotations
--tra Add transcript region annotations (5'UTR, CDS, 3'UTR,
None) to genomic and transcript regions (default:
False)
--tra-codons Add start and stop codon annotations to genomic or
transcript regions (in combination with --tra)
(default: False)
--tra-borders Add transcript and exon border annotations to
transcript regions (in combination with --tra)
(default: False)
--rra Add repeat region annotations for genomic or
transcript regions retrieved from --gen .2bit
(default: False)
--str Add secondary structure probabilities feature
(calculate with RNAplfold) (default: False)
--plfold-u int RNAplfold -u parameter value (default: 3)
--plfold-l int RNAplfold -L parameter value (default: 50)
--plfold-w int RNAplfold -W parameter value (default: 70)
--feat-in str Provide tabular file with additional position-wise
genomic region features (infos and paths to BED files)
to add
--feat-in-1h Use one-hot encoding for all additional position-wise
features from --feat-in table, ignoring type
definitions in --feat-in table (default: False)
Note that if genomic or transcript regions are supplied via --in, --gen and --gtf become required arguments.
Model training
The following command line arguments are available in rnaprot train mode:
$ rnaprot train -h
usage: rnaprot train [-h] --in IN_FOLDER --out OUT_FOLDER [--only-seq]
[--use-phastcons] [--use-phylop] [--use-eia] [--use-tra]
[--use-rra] [--use-str] [--str-mode {1,2,3,4}]
[--use-add-feat] [--cv] [--cv-k {5,10}]
[--val-size float] [--add-test] [--test-ids str]
[--keep-order] [--seed int] [--plot-lc] [--verbose-train]
[--force-cpu] [--epochs int] [--patience int]
[--batch-size int] [--lr float] [--weight-decay float]
[--n-rnn-layers int] [--n-hidden-dim int] [--dr float]
[--n-fc-layers {1,2}] [--model-type {1,2,3,4}] [--embed]
[--embed-dim int] [--run-bohb] [--bohb-n int]
[--bohb-min-budget int] [--bohb-max-budget int]
[--bohb-workers int] [--verbose-bohb]
optional arguments:
-h, --help show this help message and exit
required arguments:
--in IN_FOLDER Input training data folder (output of rnaprot gt)
--out OUT_FOLDER Model training results output folder
feature definition arguments:
--only-seq Use only sequence feature. By default all features
present in --in are used (default: False)
--use-phastcons Add phastCons conservation scores. Set --use-xxx to
define which features to add on top of sequence
feature (by default all --in features are used)
--use-phylop Add phyloP conservation scores. Set --use-xxx to
define which features to add on top of sequence
feature (by default all --in features are used)
--use-eia Add exon-intron annotations. Set --use-xxx to define
which features to add on top of sequence feature (by
default all --in features are used)
--use-tra Add transcript region annotations. Set --use-xxx to
define which features to add on top of sequence
feature (by default all --in features are used)
--use-rra Add repeat region annotations. Set --use-xxx to define
which features to add on top of sequence feature (by
default all --in features are used)
--use-str Add secondary structure features (type defined by
--str-mode). Set --use-xxx to define which features to
add on top of sequence feature (by default all --in
features are used)
--str-mode {1,2,3,4} Define secondary structure feature representation: 1)
use probabilities of five structural elements
(E,I,H,M,S) 2) same as 1) but encoded as one-hot
(element with highest probability gets 1, others 0) 3)
use unpaired probabilities 4) same as 3) but encoded
as one-hot (default: 1)
--use-add-feat Add additional feature annotations. Set --use-xxx to
define which features to add on top of sequence
feature (by default all --in features are used)
model definition arguments:
--cv Run cross validation in combination with set
hyperparameters to evaluate model generalization
performance (default: False)
--cv-k {5,10} Cross validation k for evaluating generalization
performance (use together with --cv) (default: 10)
--val-size float Validation set size for training final model as
percentage of all training sites. NOTE that if --add-
test is set, the test set will have the same size (so
if --val-size 0.2, train on 60 percent, validate on 20
percent, and test on 20 percent) (default: 0.2)
--add-test Use a part of the training set as a test set to
evaluate final model. Test set size is controlled by
--val-size (default: False)
--test-ids str Provide file with test IDs to be used as a test set
for testing final model. Test IDs need to be part of
--in training set. Not compatible with --add-test,
--cv, or --gm-cv
--keep-order Use same train-validation(-test) split for each call
to train final model. Test split only if --add-test or
--test-ids (default: False)
--seed int Set a fixed random seed number (e.g. --seed 1) to
obtain identical model results for identical rnaprot
train runs
--plot-lc Plot learning curves (training vs validation loss) for
each tested hyperparameter combination (default:
False)
--verbose-train Enable verbose output during model training to show
performance over epochs (default: False)
--force-cpu Run on CPU regardless of CUDA available or not
(default: False)
--epochs int Maximum number of training epochs (default: 200)
--patience int Number of epochs to wait for further improvement on
validation set before stopping (default: 30)
--batch-size int Gradient descent batch size (default: 50)
--lr float Learning rate of optimizer (default: 0.001)
--weight-decay float Weight decay of optimizer (default: 0.0005)
--n-rnn-layers int Number of RNN layers (default: 1)
--n-hidden-dim int Number of RNN layer dimensions (default: 32)
--dr float Rate of dropout applied after RNN layers (default:
0.5)
--n-fc-layers {1,2} Number of fully connected layers following RNN layers
(default: 1)
--model-type {1,2,3,4}
RNN model type to use. 1: GRU, 2: LSTM, 3: biGRU, 4:
biLSTM (default: 1)
--embed Use embedding layer for sequence feature, instead of
one-hot encoding (default: False)
--embed-dim int Dimension of embedding layer (default: 10)
--run-bohb Use BOHB to run a hyperparameter optimization. NOTE
that this will overwrite set hyperparameters, and
trains the final model with the found best
hyperparameter setting. ALSO NOTE that this will take
some time (!) (default: False)
--bohb-n int Number of BOHB iterations (default: 80)
--bohb-min-budget int
BOHB minimum budget (default: 5)
--bohb-max-budget int
BOHB maximum budget (default: 40)
--bohb-workers int Number of BOHB worker threads for local multi-core
parallel computing (default: 1)
--verbose-bohb Enable verbose output for BOHB hyperparameter
optimization. By default only warnings are print out
(default: False)
Model evaluation
The following command line arguments are available in rnaprot eval mode:
$ rnaprot eval -h
usage: rnaprot eval [-h] --train-in IN_TRAIN_FOLDER --gt-in IN_GT_FOLDER --out
OUT_FOLDER [--nr-top-profiles int]
[--lookup-profile LIST_LOOKUP_IDS [LIST_LOOKUP_IDS ...]]
[--bottom-up]
[--nr-top-sites LIST_NR_TOP_SITES [LIST_NR_TOP_SITES ...]]
[--motif-size LIST_MOTIF_SIZES [LIST_MOTIF_SIZES ...]]
[--report] [--add-train-in str] [--theme {1,2}]
[--plot-format {1,2}]
optional arguments:
-h, --help show this help message and exit
--nr-top-profiles int
Specify number of top predicted sites to plot profiles
for (default: 25)
--lookup-profile LIST_LOOKUP_IDS [LIST_LOOKUP_IDS ...]
Provide site ID(s) for which to plot the feature
profile in addition to --nr-top-profiles (e.g.
--lookup-profile site_id1 site_id2 ). Site ID needs to
be in positive set from --gt-in
--bottom-up Plot bottom profiles as well (default: False)
--nr-top-sites LIST_NR_TOP_SITES [LIST_NR_TOP_SITES ...]
Specify number(s) of top-predicted sites used for
motif extraction. Provide multiple numbers (e.g. --nr-
top-sites 100 200 500) to extract one motif plot from
each site set (default: 200)
--motif-size LIST_MOTIF_SIZES [LIST_MOTIF_SIZES ...]
Motif size(s) (widths) for extracting and plotting
motifs. Provide multiple sizes (e.g. --motif-size 5 7
9) to extract a motif for each size (default: 7)
--report Generate an .html report containing various additional
statistics and plots (default: False)
--add-train-in str Second model training folder (output of rnaprot train)
for comparing prediction scores of both models on
--gt-in positive dataset. Note that if dataset
features of the two models are not identical,
comparison might be less informative
--theme {1,2} Set theme for .html report (1: palm beach, 2: midnight
sunset) (default: 1)
--plot-format {1,2} Plotting format of motifs and profiles (does not
affect plots generated for --report). 1: png, 2: pdf
(default: 1)
required arguments:
--train-in IN_TRAIN_FOLDER
Input model training folder (output of rnaprot train)
--gt-in IN_GT_FOLDER Input training data folder (output of rnaprot gt)
--out OUT_FOLDER Evaluation results output folder
Prediction set generation
The following command line arguments are available in rnaprot gp mode:
$ rnaprot gp -h
usage: rnaprot gp [-h] --in str --train-in str --out str [--gtf str]
[--gen str] [--mode {1,2,3}] [--seq-ext int] [--gene-filter]
[--report] [--theme {1,2}] [--tr-list str] [--eia-all-ex]
[--phastcons str] [--phylop str] [--feat-in str]
optional arguments:
-h, --help show this help message and exit
--gtf str Genomic annotations GTF file (.gtf or .gtf.gz)
--gen str Genomic sequences .2bit file
--mode {1,2,3} Define mode for --in BED site extraction. (1) Take the
center of each site, (2) Take the complete site, (3) Take
the upstream end for each site. Use --seq-ext to extend
center sites again (default: 2)
--seq-ext int Up- and downstream sequence extension length of --in sites
(if --in BED, site definition by --mode) (default: False)
--gene-filter Filter --in sites based on gene coverage (gene annotations
from --gtf) (default: False)
--report Output an .html report providing various training set
statistics and plots (default: False)
--theme {1,2} Set theme for .html report (1: palm beach, 2: midnight
sunset) (default: 1)
required arguments:
--in str Genomic or transcript RBP binding sites file in BED
(6-column format) or FASTA format. If --in FASTA, only
--str is supported as additional feature. If --in BED,
--gtf and --gen become mandatory
--train-in str Training input folder (output folder of rnaprot train) to
extract the same features for --in sites which were used to
train the model (info stored in --train-in folder)
--out str Output prediction dataset folder (== input folder to
rnaprot predict)
additional annotation arguments:
--tr-list str Supply file with transcript IDs (one ID per row) for exon-
intron labeling (using the corresponding exon regions from
--gtf). By default, exon regions of the most prominent
transcripts (automatically selected from --gtf) are used
(default: False)
--eia-all-ex Use all annotated exons in --gtf file, instead of exons of
most prominent transcripts or exon defined by --tr-list.
Set this and --tr-list will be effective only for --tra.
NOTE that by default --eia-all-ex is disabled, even if
--train-in model was trained with --eia-all-ex (default:
False)
--phastcons str Genomic .bigWig file with phastCons conservation scores to
add as annotations
--phylop str Genomic .bigWig file with phyloP conservation scores to add
as annotations
--feat-in str Provide tabular file with additional position-wise genomic
region features (infos and paths to BED files) to add. BE
SURE to use the same file as used for generating the
training dataset (rnaprot gt --feat-in) for training the
model from --train-in!
Note that rnaprot gp will try to reuse file paths used for training the --train-in model. This includes --tr-list, --phastcons, --phylop, and --feat-in. These can be overwritten by setting providing the paths on the command line.
Model prediction
The following command line arguments are available in rnaprot predict mode:
$ rnaprot predict -h
usage: rnaprot predict [-h] --in IN_FOLDER --train-in IN_TRAIN_FOLDER --out
str [--mode {1,2}] [--plot-top-profiles]
[--plot-format {1,2}] [--thr {1,2,3}]
[--site-id LIST_SITE_IDS [LIST_SITE_IDS ...]]
optional arguments:
-h, --help show this help message and exit
--mode {1,2} Define prediction mode. (1) predict whole sites, (2)
predict binding sites on longer sequences using moving
window predictions (default: 1)
--plot-top-profiles Plot top window profiles (default: False)
--plot-format {1,2} Plotting format of top window profiles. 1: png, 2: pdf
(default: 1)
--thr {1,2,3} Define site score threshold level for reporting peak
regions in --mode 2 (window prediction). 1: relaxed,
2: standard, 3: strict (default: 2)
--site-id LIST_SITE_IDS [LIST_SITE_IDS ...]
Provide site ID(s) on which to predict (e.g. --site-id
site_id1 site_id2). By default predict on all --in
sites
required arguments:
--in IN_FOLDER Input prediction data folder (output of rnaprot gp)
--train-in IN_TRAIN_FOLDER
Input model training folder containing model file and
parameters (output of rnaprot train)
--out str Prediction results output folder
Supported features
RNAProt supports the following position-wise features, which can be utilized for training and prediction in addition to the sequence feature: secondary structure information (structural element probabilities), conservation scores (phastCons and phyloP), exon-intron annotation, transcript region annotation, repeat region annotation, and also user-defined region annotations. The following table lists the features available for each of the three input types (FASTA sequences, genomic regions BED, transcript regions BED):
Feature
Sequences
Genomic regions
Transcript regions
structure
YES
YES
YES
conservation scores
NO
YES
YES
exon-intron annotation
NO
YES
NO
transcript region annotation
NO
YES
YES
repeat region annotation
NO
YES
YES
user-defined
NO
YES
YES
Secondary structure information
RNAProt can include structure information in the form of unpaired probabilities for different loop contexts (probabilities for the nucleotide being paired or inside external, hairpin, internal or multi loops). The probabilities are calculated using the ViennaRNA Python 3 API (ViennaRNA 2.4.17) and RNAplfold with its sliding window approach, with user-definable parameters (by default these are window size = 70, maximum base pair span length = 50, and probabilities for regions of length u = 3, --plfold-u 3 --plfold-l 50 --plfold-w 70). In order to include structure information, --str has to be set when calling rnaprot gt. For training, rnaprot train by default uses all features it can find in the training set folder. To specify specific features for training, one can add --use-str (or in general –use-xxx). rnaprot train also offers four different modes for including the structure information (--str-mode). Here one can select between using all five context probabilities, using only paired and unpaired probabilities, or use the first two but encoded as one-hot.
Conservation scores
RNAProt supports two scores measuring evolutionary conservation (phastCons and phyloP). Conservation scores can be downloaded from the UCSC website, e.g. using the phastCons and phyloP scores generated from multiple sequence alignments of 99 vertebrate genomes to the human genome (hg38, phastCons100way and phyloP100way datasets). RNAProt accepts scores in bigWig (.bw) format (modes rnaprot gt and rnaprot gp, options --phastcons and --phylop).
Note that rnaprot gp reuses the set file paths used for training (rnaprot train), as long as the file paths are still valid. If not, it will complain, or you can just overwrite the paths by setting --phastcons and --phylop.
To assign conservation scores to transcript regions, RNAProt first maps the transcript regions to the genome, using the provided GTF file. To use the conservation features for rnaprot train, set --use-phastcons or --use-phylop. Otherwise, if no other feature is specified, rnaprot train will train on all present features.
Exon-intron annotation
Exon-intron annotation in the form of one-hot encoded exon and intron labels can also be added to the node feature vectors.
Labels are assigned to each binding site position by taking a set of genomic exon regions and overlapping it with the genomic binding sites using bedtools. To unambiguously assign labels, RNAProt by default uses the most prominent isoform for each gene. The most prominent isoform for each gene gets selected through hierarchical filtering of the transcript information present in the input GTF file (tested with GTF files from Ensembl): given that the transcript is part of the GENCODE basic gene set, RNAProt selects transcripts based on their transcript support level (highest priority), and by transcript length (longer isoform preferred). The extracted isoform exons are then used for region type assignment.
Note that this feature is only available for genomic regions, as it is not informative for transcript regions, which would contain only exon labels. Optionally, a user-defined isoform list can be supplied (--tr-list), substituting the list of most prominent isoforms for annotation. Regions not overlapping with introns or exons can also be labelled separately (instead of labelled as intron). In addition, instead of using the most prominent transcripts, --eia-all-ex allows the use of all exon regions from the GTF file for exon-intron labelling. For more details see mode options --eia, --eia-ib, --eia-n, and --eia-all-ex. To use exon-intron annotations for rnaprot train, set --use-eia. Otherwise, if no other feature is specified, rnaprot train will train a model with all present features.
Transcript region annotation
Similarly to exon-intron annotations, binding regions can be labelled based on their overlap with transcript regions. Labels are assigned based on UTR or CDS region overlap (5’UTR, CDS, 3’UTR, None), by taking the isoform information in the input GTF file. Again the list of most prominent isoforms is used for annotation, or alternatively a list of user-defined isoforms (--tr-list). Additional annotation options include start and stop codon or transcript and exon border labelling. For more details see mode options --tra, --tra-codons, and --tra-borders.
To use transcript region annotations for rnaprot train, set --use-tra. Otherwise, if no other feature is specified, rnaprot train will train a model with all present features.
Repeat region annotation
Repeat region annotation (--rra option) can also be added to the binding regions, analogously to other region type annotations. This information is derived directly from the 2bit formatted genomic sequence file supplied by --gen, where repeat regions identified by RepeatMasker and Tandem Repeats Finder are stored in lowercase letters.
To use repeat region annotations for rnaprot train, set --use-rra. Otherwise, if no other feature is specified, rnaprot train will train a model with all present features.
User-defined region annotations
User-defined features in the form of region information (BED) for annotating transcript or genomic input regions can also be supplied. For this rnaprot gt and rnaprot gp require a table file with a specific format (feature ID, feature type, feature format, BED file path) provided via --feat-in. The table format is defined in the Inputs section below. rnaprot gt also offers an option to force one-hot-encoding of all user-defined features (see mode option --feat-in-1h).
To utilize user-defined region annotations for rnaprot train, set --use-add-feat. Otherwise, if no other feature is specified, rnaprot train will train a model with all present features. User-defined features for sequences are also supported, where the input format of the feature files changes from BED to tabular (see description below).
Inputs
RNAProt accepts RBP binding sites in FASTA or BED format (transcript or genomic regions). The latter one also requires a genomic sequence file (.2bit format) and a genomic annotations file (GTF format).
Additional input files include BED files (negative sites or regions for masking), conservation scores in bigWig format, transcript lists, or user-defined feature tables.
Binding sites
RBP binding sites can be input in three different formats:
Note that RNAProt by default creates unique site IDs. Optionally, the original sequence or site IDs (BED column 4) can be kept (--keep-ids in rnaprot gt). Also note that BED column 5 contains the site score, which can be used for filtering (--thr). In case of p-values, reverse filtering can be enabled with --rev-filter (smaller value == better). Filtering by site length is also possible (--max-len, --min-len). Note that by default, --in sites in rnaprot gt are filtered based on gene coverage and overlaps with other --in sites! To disable these filters, set --no-gene-filter and --allow-overlaps (see mode options for more details).
BED files with genomic RBP binding regions for example can be downloaded from ENCODE (eCLIP CLIPper peak regions).
Genome sequence
The human genome .2bit formatted genomic sequence file can be downloaded here. This file is used to extract genomic and transcript region sequences, as well as repeat region annotations. For other organisms, have a look here. Note that it is easy to generate your own .2bit files from any given FASTA file.
Genome annotations
RNAProt was tested with GTF files obtained from Ensembl. See Ensembl’s download page to download the latest GTF file with genomic annotations. Note that RNAProt can directly read from gzipped GTF. The GTF file is used to extract gene region, exon-intron, and transcript region annotations.
User-defined region annotations
To supply user-defined region annotations, rnaprot gt (rnaprot gp) accepts a table file via --feat-in. Inside this, each row stores the information of one single feature to be added, beginning with a unique feature ID, the type of feature (C: categorical, N: numerical), the feature format (0: score, 1: probability, 2: p-value), and the path to the feature regions BED file. Assuming we want to use the CDE sites from above as a region feature for annotation, we would create a text file with the following content:
$ cat add_feat.in
CDE N 1 test/CDE_sites.bed
Note that the columns have to be tab-separated. Here we put “CDE” as the feature ID, “N” as feature type for numerical (since CDE_sites.bed BED file column 5 contains probabilities which we want to use for annotation), and “1” to indicate that the values are probabilities. Last but not least, we specify the path to the BED file in column 4. Now all --in site positions (rnaprot gt) which overlap with CDE_sites.bed get the numerical value (BED column 5 value) of the overlapping region inside CDE_sites.bed assigned. If we want a one-hot encoding instead, we need to specify:
$ cat add_feat.in
CDE C 0 test/CDE_sites.bed
Alternatively, we can also run rnaprot gt with --feat-in-1h, to turn all add_feat.in features into one-hot encoding. This means that overlapping site positions get a “1” assigned, and non-overlapping a “0”, just like for the standard region annotations (exon-intron regions, transcript regions, repeat regions). Note that rnaprot gp reuses the set add_feat.in file used for training (rnaprot train), as long as the file path is valid. In general, we suggest to use either one-hot encoding (C) or normalized BED column 5 values (e.g. probabilities from 0 to 1). If you set “N” and “2” (column 1 and 2, telling RNAProt that these are p-values), RNAProt will automatically convert them to probabilities, by using 1-p-value for the respective regions. As said we do not recommend using raw column 5 BED scores, since the values are not normalized, which likely will be suboptimal for learning.
User-defined features for sequences
In case FASTA sequences are provided via --in (rnaprot gt and rnaprot gp), the format of input files specified in the --feat-in changes from BED to tabular. In order for this to work, we also need to supply the respective negative sequences via --neg-in, since the additional feature information has to be given for both positive and negative sequences. Currently only numerical features are supported. A valid --feat-in table file would thus look like this:
$ cat add_feat.in
featx N 0 test/another_feat.in
Moreover, the format of the feature file should look like this:
We see that the first column contains the sequence IDs, which need to be identical to the ones supplied via --in and --neg-in. The second column contains the position-wise numerical feature values, which need to be comma-separated. The number of numerical feature values needs to be equal to the respective sequence length. This way, each sequence nucleotide can get one numerical feature value assigned. Note that the values will be used for training as they are, meaning that there is no further normalization applied to the values. Also note that for rnaprot gp a second table file is needed, pointing to the files which store the numeric values of the test set. This is different from using --feat-in with BED regions, where the files defined in the table file are the same for rnaprot gt and rnaprot gp, since they are genomic regions.
Additional inputs
Additional input files are (depending on set mode):
BED files (negative sites or regions for masking, 6-column BED format as described)
A transcript ID list file for exon-intron or transcript region annotation
Conservation scores in bigWig format
The transcript IDs list file (--tr-list option, for rnaprot gt and rnaprot gp) has the following format (one ID per row):
$ head -5 tr_list.in
ENST00000360876
ENST00000325888
ENST00000360876
ENST00000325888
ENST00000359863
Files containing phastCons and phyloP conservation scores can be downloaded from UCSC (for hg38 e.g. phastCons100way and phyloP100way). RNAProt accepts the files as inputs (modes rnaprot gt and rnaprot gp), with --phastcons and --phylop. Note that rnaprot gp reuses the set file paths used for training (rnaprot train), as long as the file paths are still valid. If not, it will complain, or you can just overwrite the paths by setting --phastcons and --phylop.
Outputs
Depending on the executed program mode, various output files are generated:
Reports on dataset statistics (.html) for rnaprot gt, rnaprot eval, and rnaprot gp.
Sequence and additional feature profiles (png, pdf) for rnaprot eval and rnaprot predict
Sequence and additional feature logos (.png, .pdf) for rnaprot eval
Whole site predictions (.out) for rnaprot predict (--mode 1)
Moving window peak region predictions (.bed, .tsv) for rnaprot predict (--mode 2)
HTML reports
For the dataset generation modes (rnaprot gt, rnaprot gp), HTML reports can be output which include detailed statistics and visualizations regarding the positive, negative, or test dataset (--report option). There are various comparative statistics available on: site lengths, sequence complexity, di-nucleotide distributions, k-mer statistics, target region biotype and overlap statistics, as well as additional statistics and visualizations for each chosen feature. The .html report file can be found in the mode output folder. For rnaprot eval, comparative statistics regarding positive and negative site scores are output, as well as a saliency peak distribution, or if --add-train-in is given a model comparison plot.
Logos and profiles
In model evaluation mode (rnaprot eval), sequence and additional feature logos are output, as well as position-wise profiles (including saliency map and single mutation effects track) for a subset of training sites to illustrate binding preferences.
Prediction files
In model prediction mode (rnaprot predict), whole-site (--mode 1) or moving window peak region (--mode 2) predictions are output. Optionally, the top scoring windows can also be plotted as profiles just like for rnaprot eval.
In --mode 2, peak_regions.bed contains the predicted peak regions on the reference (depending on input sequence, transcript, or genomic coordinates), in 11-column BED format:
Here the columns are: reference ID, reference peak region start position (0-based), reference peak region end position (1-based), peak ID, window score, strand (for transcript or sequence input always “+”), reference window start position (0-based), reference window end position (1-based), window score p-value (calculated from the positive training set scores distribution), reference top saliency peak position (1-based), top saliency peak score (saliency).
The same information can also be found in the peak_regions.tsv file (there all coordinates 1-based), which in addition contains the top saliency peak sequence and the window sequence:
RNAProt
RNAProt is a computational RBP binding site prediction framework based on recurrent neural networks (RNNs). Conceived as an end-to-end method, RNAProt includes all necessary functionalities, from dataset generation over model training to the evaluation of binding preferences and binding site prediction. Various input types and features are supported, accompanied by comprehensive statistics and visualizations to inform the user about datatset characteristics and learned model properties.
Citation
The RNAProt paper is available at GigaScience. Supplementary data including benchmark and training data sets can be downloaded from Zenodo.
Table of contents
The RNAProt framework
RNAProt utilizes RBP binding sites identified by CLIP-seq and related protocols to train an RNN-based model. The model is then used to predict new binding sites on given input RNA sequences. The following figure illustrates the RNAProt framework and its general workflow:
Yellow boxes mark necessary framework inputs, blue boxes the five program modes of RNAProt, and green boxes the framework outputs. Arrows show the dependencies between inputs, modes, and outputs. RNAProt accepts RBP binding sites in FASTA or BED format (transcript or genomic regions). The latter one also requires a genomic sequence file (.2bit format) and a genomic annotations file (GTF format). Additional inputs are available, depending on the binding site input type as well as the selected features. For more details on inputs, modes, supported features, and outputs, see the Documentation.
Installation
RNAProt was tested on Ubuntu (18.04 LTS), with Nvidia driver >=440, CUDA >=10, and various Nvidia graphics cards (RTX 2080 Ti, RTX 2070, GTX 1060, GTX 1030). We thus assume that you have a similar system available and running. While RNAProt runs fine without a dedicated GPU, we definitely recommend having an Nvidia graphics card with CUDA support for speeding up model training (specifically we recommend a >= GTX 1060 or a similar newer model, with >= 4 GB RAM). Regarding main memory, we recommend at least 8 GB RAM. In the following we show how to install RNAProt via Conda package (easiest way + recommended), or alternatively manually (not too difficult either). In any case, you first need Conda running on your computer.
Conda
If you do not have Conda yet, you can e.g. install miniconda, a free + lightweight Conda installer. Get miniconda here, choose the Python 3.8 Miniconda3 Linux 64-bit installer and follow the installation instructions. In the end, Conda should be evocable on the command line via (possibly in a different version):
Conda package installation
RNAProt is available as Conda package here. This is the most convenient way to install RNAProt, since Conda takes care of all the dependencies. Note however that the Conda package version might not always be the latest release (but we work hard to not let this happen).
We recommend to create a Conda environment inside which we will then install RNAProt:
Note that Python 3.8 is not a necessity (higher versions should work just as fine). If you are experiencing problems while running
conda install -c bioconda rnaprot(e.g. complaints about conflicting dependencies), the following commands should do the trick:This tells conda to explicitly look for packages in the specified channels, stored in the
.condarcconda configuration file.NOTE that the bioconda installation only includes the CPU version of RNAProt (no GPU support). If you have a GPU supporting CUDA and want to take advantage of it (very much recommended for fast model training!), just install in addition:
RNAProt was initially written and tested with pytorch 1.8. However, higher version numbers should work as well (i.e., you can simply discard the version number). Now RNAProt should be available inside the environment:
Finally, if you have a compatible GPU, we want to check whether the GPU (CUDA) is available for RNAProt:
In our test case this delivers:
This is great news, meaning that we can RNAProt with GPU support.
Manual installation
To manually install RNAProt, we first create a Conda environment (as described above). Once inside the environment, we need to install the following dependencies for the GPU version (GPU with CUDA support required):
If you don’t have a CUDA supporting GPU (and you’re not planning on getting one any time soon either), you don’t need to install the additional GPU dependencies. To install pyTorch without GPU support, simply exchange the above pytorch call with:
Concerning version numbers, RNAProt was tested with the following versions: pytorch=1.8.0, seaborn=0.11.1, viennarna=2.4.17, bedtools=2.30.0, logomaker=0.8, hpbandster=0.7.4, markdown=3.2.2, plotly=4.14.3, and scikit-learn=0.24.1. Note that so far we have not experienced problems with higher pytorch version numbers (i.e., simply discarding the version number should work just as well).
Finally, to install RNAProt, we simply clone the repository and execute the installation script inside the folder:
Now we can run RNAProt from any given folder (just remember to re-activate the environment once you open a new shell):
Test runs
Once installed, we can do some quick test runs.
Test example with FASTA sequences as input
We first train a sequence model, using a provided set of positive and negative FASTA sequences sampled from the PARCLIP PUM2 dataset (3,000 positives, 3,000 negatives, all sequences with length 81 nt). In the following we will mainly use default parameters, but note that there are many options available for each program mode. To learn more about the mode options, refer to the Documentation, or simply list all mode options, e.g. for
rnaprot train, by typing:Before training a model, we need to generate an RNAProt training dataset. For this we go to the cloned repository folder (clone + enter if not already there), and use the FASTA sequences supplied in the
testfolder as training data. To get training set statistics, we also enable--report:We can then take a look at the
report.rnaprot_gt.htmlinsidetest_gt_out, informing us about similarities and differences between the positive and negative set. The content of the HTML report depends on selected features (e.g. structure, conservation scores, region annotations), and the input type given tornaprot gt(FASTA sequences, genomic sites BED, or transcript sites BED). Here for example we can compare k-mer statistics of the positive and negative set, observing that the positives tend to contain more AA, UU, and AU repeat sites. This likely also contributes to the lower sequence complexity of the postive set.Next we train a model on the created dataset, using the default hyperparameters. For this we simply run
rnaprot trainwith thernaprot gtoutput folder as input. We also enable--verbose-train, to see the learning progress over the number of epochs:In the end we get a summary for the trained model, e.g. reporting the model validation AUC, the training runtime, and set hyperparameters. To obtain identical models from identical calls, you have to set a fixed random seed number (e.g.
--seed 1). This guarantees that the same initial random weights are set when training the network, as well as the same train-validation-test splits are used. Note that if you want to obtain the generalization peformance of the model on the given dataset, you need to runrnaprot trainin cross validation mode (10-fold by default) by adding--cv:To visualize what our just-trained model has learned, we next run
rnaprot eval, which requires both thernaprot gtandrnaprot trainoutput folders:This will plot a sequence logo informing us about global preferences, as well as profiles for the top 25 scoring sites (default setting). The profiles contain the saliency map and single mutations track, giving us an idea what local information the model regards as important for each of the 25 sites. As with the other modes, more options are available (e.g.
--reportfor additional statistics, comparing two models, or specifying motif sizes and which profiles to plot).Now that we have a model, we naturally want to use it for prediction. For this we first create a prediction dataset, choosing the lncRNA NORAD for window prediction. NORAD was shown to act as a decoy for PUMILIO proteins (PUM1/PUM2). We therefore use its provided FASTA sequence as input:
Note that the input can be any number of sequences, genomic regions, or transcript regions (also see examples below).
By default, RNAProt predicts whole sites, i.e., we would get one score returned for the whole lncRNA. To run the window prediction, we use
--mode 2, and also plot the top window profiles containing the reported peak regions:Now we can take a look at the predicted peak regions (BED, TSV), or observe the profiles just like for
rnaprot eval. The predicted peak regions are stored in BED format, as well as in a table file with additional information (.tsv). For details on output formats, see the Documentation. Note that while model prediction itself is very fast, plotting (especially getting the single mutation infos) takes some time for each peak region. So if you predict on a large set of input sites or sequences, you might want to disable plotting (or just exercise patience and wait). Knowing this, you can also predict only on certain input sites or a small subset (e.g. site_id1, site_id2), by adding--site-id site_id1 site_id2.Test example with genomic regions as input
To create datasets based on genomic or transcript regions, we first need to download two additional files. Specifically, we need a GTF file (containing genomic annotations), as well as a .2bit file (containing the genomic sequences). Note that we used Ensembl GTF files to test RNAProt, and therefore recommend using these. You can find them here for many model organisms. The .2bit genome file we will download from UCSC. For this example, we choose the human genome + annotations (hg38 assembly):
Unfortunately, we sometimes experienced cases where the GTF file was not fully downloaded. You can check this by browsing the file with:
We would expect something like this appearing as first rows:
If the output is cryptic instead, you need to do it again. Next we download some genomic RBP binding regions identified by eCLIP from ENCODE. The ENCODE website contains a huge collection of eCLIP datasets for various RBPs. For this example, we again download PUM2 binding sites, choosing the IDR peaks identified by ENCODE’s CLIPper pipeline (PUM2 K562 eCLIP dataset ID: ENCFF880MWQ, PUM2 K562 IDR peaks dataset ID: ENCFF880MWQ). We unzip it and change the format to 6-column BED which RNAProt likes best:
Note that we move the log2 fold change value from column 7 (original file) to column 5, which is used by RNAProt to filter and select sites in case of overlaps. By default,
rnaprot gtremoves overlapping sites by selecting only the highest-scoring site from a set of overlapping sites. To disable this, set--allow-overlaps. If there are no column 5 scores given (or all the same), filtering of overlaps is disabled by default. Moreover, positive sites that do not overlap with gene regions are by default filtered out too. To disable this, set--no-gene-filter.Next we create a training dataset, by supplying the downloaded GTF and .2bit file:
Thanks to the given GTF file, the HTML report will now also include information on target gene regions and biotypes. Note that by default,
rnaprot gtcenters the input BED regions, and extends them on both sides by the set--seq-ext(by default 40). If you want to keep the original site lengths, set--mode 2 --seq-ext 0. In this case, you might also want to filter the--insites by--max-lenor--min-len, e.g.--max-len 100. Of course you can also extend the original sites, e.g. by setting--mode 2 --seq-ext 10. Alternatively, you can set--mode 3to use the region upstream ends and extend by--seq-ext. Note that the negatives are generated randomly, so the samernaprot gtcall will produce a different set of random negative sites (unless negatives are supplied). To obtain the same random negatives repeatedly from identical calls, you have to set a fixed random seed number (e.g.--seed 1).Now we can train a model and evaluate it just like in the example above:
For prediction, we could again use the folder generated by
rnaprot gpfrom the above FASTA sequences + NORAD example. However, since we now have the GTF + genome .2bit file, we can also get its genomic or transcript region directly from these files (no need to download FASTA sequences or a gene BED file). As withrnaprot gt,rnaprot gpaccepts sequences, genomic regions, or transcript regions as input. To get its genomic or transcript region, we just need its gene or transcript ID (as long as it is in the GTF file), and then use one of the two helper scripts installed together with RNAProt:The two scripts also accept > 1 ID (either on the command line or more practically given as a file with one ID per row). In the case of NORAD, both transcript and gene region have the same length, since NORAD contains no introns and only one annotated isoform:
We can now use any of the two as input to
rnaprot gp. As one would expect, extracting and providing the transcript region will result in predictions only on the transcript sequence (always excluding introns), while providing the gene region will result in predictions on the whole gene region (usually including introns). In general, we can specify any genomic or transcript (sub)region for prediction, as long as it is annotated in the GTF file. For now we will use the gene region, on whichrnaprot predictwill then predict and return peak regions directly with genomic coordinates (also contained inside the profiles for easier orientation):Note that in this example we did not filter out sites from
PUM2_K562_IDR_peaks.bedthat overlap with the NORAD gene region prior to training. These sites indeed exist, as we can see in thernaprot gtHTML report (Target region overlap statistics), so for an unbiased prediction we need remove them first. For this we use bedtools intersectBed, and again runrnaprot gt,rnaprot train, andrnaprot predict:Test example with additional features
RNAProt supports various additional position(nucleotide)-wise features to learn from, such as secondary structure, region annotations (including user-defined ones), or conservation scores (see Documentation for details). For this we have to specify what features to include in
rnaprot gtandrnaprot gp, and depending on the feature also provide additional files. For model training (rnaprot train) we can then specify what features to use for training, from the features included inrnaprot gt. This has the advantage that features need to be extracted or computed only once, and that various feature combinations and parameter settings can be withrnaprot train.In this example, we want to include secondary structure on top of the sequence information. We will again use a provided dataset, containing 2,274 potential Roquin binding sites (also termed CDEs for constitutive decay elements) from Braun et al. 2018. The CDEs were predicted using a biologically verified consensus structure consisting of a 6-8 bp long stem capped with a YRN (Y: C or U, R: A or G, N: any base) tri-nucleotide loop. We also note that the sequence conservation is rather low (specifically for the stem portion), making it an ideal test case for the benefits of including secondary structure information. We first generate a training set, by enabling structure calculation (
--str) and using the CDE sites provided in the cloned repository folder:Structure calculation can be further customized by changing the RNAplfold parameters (
--plfold-u 3 --plfold-l 50 --plfold-w 70by default). Regarding the type of structure information, RNAplfold calculates the probabilities of structural elements for each site position, which are then used as feature channels for training and prediction. For genomic or transcript--insites, RNAProt automatically extends the sites by--plfold-won both sides (or less at transcript ends) for a precise structure calculation. Whether to use the probabilities or a one-hot encoding can be further specified in training (--str-mode, four options). Note that we use--allow-overlapsand--no-gene-filter, disabling the filtering of sites based on no gene overlap or overlap with other sites. These two options guarantee that all--insites will be part of the generated training set. Now we want to train a model on the generated dataset:Here we increased the maximum number of epochs to 300, since for smaller datasets model performance can sometimes still improve beyond the default 200 epochs. This can be easily monitored with
--verbose-trainenabled. In addition, increasing--patiencemight sometimes be necessary, to prevent model training with the model stuck early on in the training process from stopping (although we experienced this only very rarely). Also note that if we do not specify what features to use, RNAProt will use all features present inCDE_sites_gt_outfor training. Thus, to train a sequence-only model, we would need to specify:To create a prediction set for the structure model, we use the UCP3 gene (transcript ID ENST00000314032), for which the authors validated two CDE sites in its 3’UTR (see Fig.2A blue and red hairpin).
Note that
rnaprot gpautomatically detects what features were used for training the model, enabling structure prediction with set parameters for the prediction set generation. Depending on the additional feature, we thus might have to supply additional input files for extracting the respective feature information. This would be the case for conservation scores (--phastcons,--phylop), or user-defined features (--feat-in). After creating the prediction set we run a window prediction on the transcript:In our case the model successfully predicted the two verified binding sites (all together 4 sites predicted) on the transcript (using threshold level
--thr 1). The first verified loop is at transcript position 1,371 to 1,373 (loop nucleotides), the second loop from 1,404 to 1,406, with the second hairpin having a higher folding probability and score. To check this, we take a look at the reported peak regions (BED, TSV), or have a look at the plotted profiles, which show the sites annotated with transcript coordinates (or genomic coordinates in case of genomic sites as input) to quickly identify regions of interest. Thetest/folder also includes the model we used to predict, which you can easily apply yourself to compare:This also shows how easy it is to share models. Once the model is trained, the
rnaprot train --outfolder can be copied and reused. To zip the train folder you can run:Note however that predictions between two models can vary, since negative sites generation is random. Moreover, even models trained on the same positive and negative sites are slightly different from one another and thus can lead to slightly different predictions. This is because model training incorporates stochastic processes, like the random initialization of network weights, or the application of dropout.
Documentation
This documentation provides details on all the RNAProt (version 0.5) framework parts: program modes, supported features, inputs, and outputs.
Program modes
RNAProt is divided into five different program modes: training set generation (
rnaprot gt), model training (rnaprot train), model evaluation (rnaprot eval), prediction set generation (rnaprot gp), and model prediction (rnaprot predict).An overview of the modes can be obtained by:
The following sections describe each mode in more detail.
Training set generation
The following command line arguments are available in
rnaprot gtmode:Note that if genomic or transcript regions are supplied via
--in,--genand--gtfbecome required arguments.Model training
The following command line arguments are available in
rnaprot trainmode:Model evaluation
The following command line arguments are available in
rnaprot evalmode:Prediction set generation
The following command line arguments are available in
rnaprot gpmode:Note that
rnaprot gpwill try to reuse file paths used for training the--train-inmodel. This includes--tr-list,--phastcons,--phylop, and--feat-in. These can be overwritten by setting providing the paths on the command line.Model prediction
The following command line arguments are available in
rnaprot predictmode:Supported features
RNAProt supports the following position-wise features, which can be utilized for training and prediction in addition to the sequence feature: secondary structure information (structural element probabilities), conservation scores (phastCons and phyloP), exon-intron annotation, transcript region annotation, repeat region annotation, and also user-defined region annotations. The following table lists the features available for each of the three input types (FASTA sequences, genomic regions BED, transcript regions BED):
Secondary structure information
RNAProt can include structure information in the form of unpaired probabilities for different loop contexts (probabilities for the nucleotide being paired or inside external, hairpin, internal or multi loops). The probabilities are calculated using the ViennaRNA Python 3 API (ViennaRNA 2.4.17) and RNAplfold with its sliding window approach, with user-definable parameters (by default these are window size = 70, maximum base pair span length = 50, and probabilities for regions of length u = 3,
--plfold-u 3 --plfold-l 50 --plfold-w 70). In order to include structure information,--strhas to be set when callingrnaprot gt. For training,rnaprot trainby default uses all features it can find in the training set folder. To specify specific features for training, one can add--use-str(or in general –use-xxx).rnaprot trainalso offers four different modes for including the structure information (--str-mode). Here one can select between using all five context probabilities, using only paired and unpaired probabilities, or use the first two but encoded as one-hot.Conservation scores
RNAProt supports two scores measuring evolutionary conservation (phastCons and phyloP). Conservation scores can be downloaded from the UCSC website, e.g. using the phastCons and phyloP scores generated from multiple sequence alignments of 99 vertebrate genomes to the human genome (hg38, phastCons100way and phyloP100way datasets). RNAProt accepts scores in bigWig (.bw) format (modes
rnaprot gtandrnaprot gp, options--phastconsand--phylop). Note thatrnaprot gpreuses the set file paths used for training (rnaprot train), as long as the file paths are still valid. If not, it will complain, or you can just overwrite the paths by setting--phastconsand--phylop. To assign conservation scores to transcript regions, RNAProt first maps the transcript regions to the genome, using the provided GTF file. To use the conservation features forrnaprot train, set--use-phastconsor--use-phylop. Otherwise, if no other feature is specified,rnaprot trainwill train on all present features.Exon-intron annotation
Exon-intron annotation in the form of one-hot encoded exon and intron labels can also be added to the node feature vectors. Labels are assigned to each binding site position by taking a set of genomic exon regions and overlapping it with the genomic binding sites using bedtools. To unambiguously assign labels, RNAProt by default uses the most prominent isoform for each gene. The most prominent isoform for each gene gets selected through hierarchical filtering of the transcript information present in the input GTF file (tested with GTF files from Ensembl): given that the transcript is part of the GENCODE basic gene set, RNAProt selects transcripts based on their transcript support level (highest priority), and by transcript length (longer isoform preferred). The extracted isoform exons are then used for region type assignment. Note that this feature is only available for genomic regions, as it is not informative for transcript regions, which would contain only exon labels. Optionally, a user-defined isoform list can be supplied (
--tr-list), substituting the list of most prominent isoforms for annotation. Regions not overlapping with introns or exons can also be labelled separately (instead of labelled as intron). In addition, instead of using the most prominent transcripts,--eia-all-exallows the use of all exon regions from the GTF file for exon-intron labelling. For more details see mode options--eia,--eia-ib,--eia-n, and--eia-all-ex. To use exon-intron annotations forrnaprot train, set--use-eia. Otherwise, if no other feature is specified,rnaprot trainwill train a model with all present features.Transcript region annotation
Similarly to exon-intron annotations, binding regions can be labelled based on their overlap with transcript regions. Labels are assigned based on UTR or CDS region overlap (5’UTR, CDS, 3’UTR, None), by taking the isoform information in the input GTF file. Again the list of most prominent isoforms is used for annotation, or alternatively a list of user-defined isoforms (
--tr-list). Additional annotation options include start and stop codon or transcript and exon border labelling. For more details see mode options--tra,--tra-codons, and--tra-borders. To use transcript region annotations forrnaprot train, set--use-tra. Otherwise, if no other feature is specified,rnaprot trainwill train a model with all present features.Repeat region annotation
Repeat region annotation (
--rraoption) can also be added to the binding regions, analogously to other region type annotations. This information is derived directly from the 2bit formatted genomic sequence file supplied by--gen, where repeat regions identified by RepeatMasker and Tandem Repeats Finder are stored in lowercase letters. To use repeat region annotations forrnaprot train, set--use-rra. Otherwise, if no other feature is specified,rnaprot trainwill train a model with all present features.User-defined region annotations
User-defined features in the form of region information (BED) for annotating transcript or genomic input regions can also be supplied. For this
rnaprot gtandrnaprot gprequire a table file with a specific format (feature ID, feature type, feature format, BED file path) provided via--feat-in. The table format is defined in the Inputs section below.rnaprot gtalso offers an option to force one-hot-encoding of all user-defined features (see mode option--feat-in-1h). To utilize user-defined region annotations forrnaprot train, set--use-add-feat. Otherwise, if no other feature is specified,rnaprot trainwill train a model with all present features. User-defined features for sequences are also supported, where the input format of the feature files changes from BED to tabular (see description below).Inputs
RNAProt accepts RBP binding sites in FASTA or BED format (transcript or genomic regions). The latter one also requires a genomic sequence file (.2bit format) and a genomic annotations file (GTF format). Additional input files include BED files (negative sites or regions for masking), conservation scores in bigWig format, transcript lists, or user-defined feature tables.
Binding sites
RBP binding sites can be input in three different formats:
The file content should thus look like:
Note that RNAProt by default creates unique site IDs. Optionally, the original sequence or site IDs (BED column 4) can be kept (
--keep-idsinrnaprot gt). Also note that BED column 5 contains the site score, which can be used for filtering (--thr). In case of p-values, reverse filtering can be enabled with--rev-filter(smaller value == better). Filtering by site length is also possible (--max-len,--min-len). Note that by default,--insites inrnaprot gtare filtered based on gene coverage and overlaps with other--insites! To disable these filters, set--no-gene-filterand--allow-overlaps(see mode options for more details). BED files with genomic RBP binding regions for example can be downloaded from ENCODE (eCLIP CLIPper peak regions).Genome sequence
The human genome .2bit formatted genomic sequence file can be downloaded here. This file is used to extract genomic and transcript region sequences, as well as repeat region annotations. For other organisms, have a look here. Note that it is easy to generate your own .2bit files from any given FASTA file.
Genome annotations
RNAProt was tested with GTF files obtained from Ensembl. See Ensembl’s download page to download the latest GTF file with genomic annotations. Note that RNAProt can directly read from gzipped GTF. The GTF file is used to extract gene region, exon-intron, and transcript region annotations.
User-defined region annotations
To supply user-defined region annotations,
rnaprot gt(rnaprot gp) accepts a table file via--feat-in. Inside this, each row stores the information of one single feature to be added, beginning with a unique feature ID, the type of feature (C: categorical, N: numerical), the feature format (0: score, 1: probability, 2: p-value), and the path to the feature regions BED file. Assuming we want to use the CDE sites from above as a region feature for annotation, we would create a text file with the following content:Note that the columns have to be tab-separated. Here we put “CDE” as the feature ID, “N” as feature type for numerical (since
CDE_sites.bedBED file column 5 contains probabilities which we want to use for annotation), and “1” to indicate that the values are probabilities. Last but not least, we specify the path to the BED file in column 4. Now all--insite positions (rnaprot gt) which overlap withCDE_sites.bedget the numerical value (BED column 5 value) of the overlapping region insideCDE_sites.bedassigned. If we want a one-hot encoding instead, we need to specify:Alternatively, we can also run
rnaprot gtwith--feat-in-1h, to turn alladd_feat.infeatures into one-hot encoding. This means that overlapping site positions get a “1” assigned, and non-overlapping a “0”, just like for the standard region annotations (exon-intron regions, transcript regions, repeat regions). Note thatrnaprot gpreuses the setadd_feat.infile used for training (rnaprot train), as long as the file path is valid. In general, we suggest to use either one-hot encoding (C) or normalized BED column 5 values (e.g. probabilities from 0 to 1). If you set “N” and “2” (column 1 and 2, telling RNAProt that these are p-values), RNAProt will automatically convert them to probabilities, by using 1-p-value for the respective regions. As said we do not recommend using raw column 5 BED scores, since the values are not normalized, which likely will be suboptimal for learning.User-defined features for sequences
In case FASTA sequences are provided via
--in(rnaprot gtandrnaprot gp), the format of input files specified in the--feat-inchanges from BED to tabular. In order for this to work, we also need to supply the respective negative sequences via--neg-in, since the additional feature information has to be given for both positive and negative sequences. Currently only numerical features are supported. A valid--feat-intable file would thus look like this:Moreover, the format of the feature file should look like this:
We see that the first column contains the sequence IDs, which need to be identical to the ones supplied via
--inand--neg-in. The second column contains the position-wise numerical feature values, which need to be comma-separated. The number of numerical feature values needs to be equal to the respective sequence length. This way, each sequence nucleotide can get one numerical feature value assigned. Note that the values will be used for training as they are, meaning that there is no further normalization applied to the values. Also note that forrnaprot gpa second table file is needed, pointing to the files which store the numeric values of the test set. This is different from using--feat-inwith BED regions, where the files defined in the table file are the same forrnaprot gtandrnaprot gp, since they are genomic regions.Additional inputs
Additional input files are (depending on set mode):
The transcript IDs list file (
--tr-listoption, forrnaprot gtandrnaprot gp) has the following format (one ID per row):Files containing phastCons and phyloP conservation scores can be downloaded from UCSC (for hg38 e.g. phastCons100way and phyloP100way). RNAProt accepts the files as inputs (modes
rnaprot gtandrnaprot gp), with--phastconsand--phylop. Note thatrnaprot gpreuses the set file paths used for training (rnaprot train), as long as the file paths are still valid. If not, it will complain, or you can just overwrite the paths by setting--phastconsand--phylop.Outputs
Depending on the executed program mode, various output files are generated:
rnaprot gt,rnaprot eval, andrnaprot gp.rnaprot evalandrnaprot predictrnaprot evalrnaprot predict(--mode 1)rnaprot predict(--mode 2)HTML reports
For the dataset generation modes (
rnaprot gt,rnaprot gp), HTML reports can be output which include detailed statistics and visualizations regarding the positive, negative, or test dataset (--reportoption). There are various comparative statistics available on: site lengths, sequence complexity, di-nucleotide distributions, k-mer statistics, target region biotype and overlap statistics, as well as additional statistics and visualizations for each chosen feature. The .html report file can be found in the mode output folder. Forrnaprot eval, comparative statistics regarding positive and negative site scores are output, as well as a saliency peak distribution, or if--add-train-inis given a model comparison plot.Logos and profiles
In model evaluation mode (
rnaprot eval), sequence and additional feature logos are output, as well as position-wise profiles (including saliency map and single mutation effects track) for a subset of training sites to illustrate binding preferences.Prediction files
In model prediction mode (
rnaprot predict), whole-site (--mode 1) or moving window peak region (--mode 2) predictions are output. Optionally, the top scoring windows can also be plotted as profiles just like forrnaprot eval.In
--mode 2,peak_regions.bedcontains the predicted peak regions on the reference (depending on input sequence, transcript, or genomic coordinates), in 11-column BED format:Here the columns are: reference ID, reference peak region start position (0-based), reference peak region end position (1-based), peak ID, window score, strand (for transcript or sequence input always “+”), reference window start position (0-based), reference window end position (1-based), window score p-value (calculated from the positive training set scores distribution), reference top saliency peak position (1-based), top saliency peak score (saliency).
The same information can also be found in the
peak_regions.tsvfile (there all coordinates 1-based), which in addition contains the top saliency peak sequence and the window sequence: