SVHIP

This repository contains svhip.py, a script for predicting functional RNA elements from multiple sequence alignments. It supports generating training data, training machine learning models, slicing alignments into windows, and predicting classes (coding, non-coding, other) for alignment windows.

Usage

python svhip.py [Task] [Options]
Tasks: data, train, windows, predict, hexcalibrate

Global options (available for all tasks)

–threads (int, default: max(CPU_COUNT-1, 1)): Number of threads to allocate.
–seed (int, default: random integer at startup): Seed controlling randomized behavior (e.g. shuffling).

Task overview:

data → Generation of training data from multiple sequence alignments.
train → Train a prediction model on data generated with ‘data’ command.
windows → Cut an alignment into overlapping windows in preparation for alignments.
predict → Run a model prediction on windows generated with ‘windows’ program.
hexcalibrate → Train a hexamer frequency model which can be used for coding potential assessment.

svhip data

Purpose

Generate training data from coding and/or noncoding input sequences.
Align sequences (requires Clustal Omega), optionally generate a negative set (SISSIz if available, otherwise column shuffling), slice alignments into windows, and compute features into a TSV.

Behavior

Requires at least one of –noncoding or –coding.
Checks for clustalo availability; checks SISSIz availability unless –shuffle-control is given.
Seeds RNG with –seed.

Options

–noncoding (string): Input directory with FASTA file(s) of noncoding sequences (Requires at least 1 of ‘noncoding’, ‘coding’).
–coding (string): Input directory with FASTA file(s) of coding sequences (Requires at least 1 of ‘noncoding’, ‘coding’).
–other (string): Input directory with FASTA file(s) of random (intergenic) sequences (will be auto-generated via randomization if not supplied).
-o, –outfile (string): Name for the output file (Required).
-N, –negative (string): Path to a specific negative dataset; if empty, a negative set is auto-generated.
-d, –max-id (float, default: 0.95): Remove sequences above this identity threshold during preprocessing (interpreted as proportion; help text mentions percent).
-n, –num-sequences (int, default: 100): Number of sequences input alignments will be optimized towards.
-l, –window-length (int, default: 120): Window length for slicing alignments into overlapping windows.
-w, –windowslide (int, default: 40): Slide step size controlling window overlap.
-s, –samples (int, default: 10): Number of sampling runs per alignment/sequence count.
-a, –sample-attempts (int, default: 1000): Number of sampling attempts per alignment.
-c, –shuffle-control (store_true, default: False): Use simpler column-based shuffling instead of SISSIz.
-H, –hexamer-model (string, default: hexamer_models/Human_hexamer.tsv): Path to the statistical hexamer model to use.
-S, –no-structural-filter (string, default: False): Set to True to disable filtering windows by statistical significance of structure. Note: defined as action=”store” (string) although conceptually a boolean toggle.
-T, –tree (string, default: None): Path to a Newick-formatted species tree for the alignment. If None, a tree may be estimated.

Example

python svhip.py data --coding CodingDir --noncoding NoncodingDir -o features.tsv -n 200 -l 120 -w 40

svhip train

Purpose

Train a machine-learning model (RF, SVM, or LR) on features generated by the data task.
Supports optional hyperparameter optimization for SVM and RF.

Options

-i, –input (string): Input features file generated with data (Required).
-o, –outfile (string): Prefix for output model files (Required).
-M, –model (string, default: RF): Model type. One of RF (Random Forest), SVM (Support Vector Machine), LR (Logistic Regression).
–optimize-hyperparameters (store_true, default: False): Perform hyperparameter optimization.
–optimizer (string, default: randomwalk): Hyperparameter search strategy: gridsearch (exhaustive) or randomwalk (faster).

SVM hyperparameters (when model=SVM and optimization enabled)

–low-c (int, default: 1): Lowest C value to try.
–high-c (int, default: 100): Highest C value to try.
–low-gamma (int, default: 1): Lowest gamma value to try.
–high-gamma (int, default: 100): Highest gamma value to try.
–hyperparameter-steps (int, default: 10): Number of values per hyperparameter (evenly spaced).
–logscale (store_true, default: False): Use logarithmic scaling for the parameter grid.
–logbase (int, default: 2): Logarithmic base if –logscale is set.

Random Forest hyperparameters (when model=RF and optimization enabled)

–min-trees (int, default: 100): Minimum number of trees (n_estimators) to consider.
–max-trees (int, default: 500): Maximum number of trees (n_estimators) to consider.
–min-samples-split (int, default: 2): Minimum samples required to split an internal node.
–max-samples-split (int, default: 16): Maximum samples to split an internal node.
–min-samples-leaf (int, default: 1): Minimum samples required at a leaf node.
–max-samples-leaf (int, default: 16): Maximum samples at a leaf node.

Example

python svhip.py train -i features.tsv -o RF_classifier -M RF --optimize-hyperparameters --optimizer randomwalk

svhip windows

Purpose

Slice an existing alignment into overlapping windows, filtering sequences by identity and gaps.

Options

-i, –input (string): Input alignment file (Required).
-o, –outfile (string): Output alignment file for windows (Required).
-l, –length (int, default: 120): Window length.
-s, –slide (int, default: 80): Slide step size for overlap.
–min-id (float, default: 0.5): Minimum pairwise identity of sequences to keep.
–max-id (float, default: 0.95): Maximum pairwise identity of sequences to keep.
–opt-id (float, default: 0.8): Target identity to optimize sequence selection.
-n, –num-seqs (int, default: 6): Maximum number of sequences per window.
-g, –max-gaps (float, default: 0.75): Maximum fraction of gaps in the reference sequence.

Example

python svhip.py windows -i input.aln -o WINDOWS.aln -l 120 -s 80 --min-id 0.5 --opt-id 0.8 -n 6

svhip predict

Purpose

Predict class labels (coding, non-coding, other) for windows cut from an input alignment using a trained model and hexamer model.
Supports MAF or Clustal input; when input ends with .maf, genome coordinates are preserved and can be exported as BED.
Processes windows in blocks for efficiency; can scan both strands.

Options

-i, –input (string): Input alignment file, MAF or Clustal (Required).
-o, –outfile (string): Output TSV file (Required).
-M, –model-path (string, default: “”): Path to the trained model file (Required).
-T, –tree (string, default: None): Path to a Newick-formatted species tree; if None, one may be estimated.
-H, –hexamer-model (string, default: hexamer_models/Human_hexamer.tsv): Path to the hexamer score model.
–both-strands (store_true, default: False): Screen both forward and reverse strands.
–bed (store_true, default: False): Merge overlapping annotations and write a BED file. IMPORTANT: Requires MAF input for genomic coordinates.
–windows-per-block (int, default: 50): Number of windows processed per block before writing results.

Example

python svhip.py predict -i query.maf -o predictions.tsv -M RF_classifier.model -H hexamer_models/Human_hexamer.tsv --both-strands --bed

svhip hexcalibrate

Purpose

Calibrate a hexamer model from coding and noncoding sequences; writes a tab-delimited model file.

Options

-c, –coding (string): Fasta file of coding transcripts (must be in-frame).
-n, –noncoding (string): Fasta file of noncoding sequences.
-o, –outfile (string): Output TSV file for the calibrated hexamer model.

Example

python svhip.py hexcalibrate -c coding.fa -n noncoding.fa -o Human_hexamer.tsv

External tools and notes

Clustal Omega (clustalo) must be available in PATH for data generation and alignment steps.
SISSIz is used for negative control generation when available; if not present or if –shuffle-control is set, a simpler column-shuffling approach is used instead.
Randomization is controlled by –seed. If not provided, a random seed is generated at start.
When using predict with –bed, ensure the input is MAF to include genomic coordinates.