目录

SVHIP

This repository contains svhip.py, a script for predicting functional RNA elements from multiple sequence alignments. It supports generating training data, training machine learning models, slicing alignments into windows, and predicting classes (coding, non-coding, other) for alignment windows.

Usage

  • python svhip.py [Task] [Options]
  • Tasks: data, train, windows, predict, hexcalibrate

Global options (available for all tasks)

  • –threads (int, default: max(CPU_COUNT-1, 1)): Number of threads to allocate.
  • –seed (int, default: random integer at startup): Seed controlling randomized behavior (e.g. shuffling).

Task overview:

  • data → Generation of training data from multiple sequence alignments.
  • train → Train a prediction model on data generated with ‘data’ command.
  • windows → Cut an alignment into overlapping windows in preparation for alignments.
  • predict → Run a model prediction on windows generated with ‘windows’ program.
  • hexcalibrate → Train a hexamer frequency model which can be used for coding potential assessment.

svhip data

Purpose

  • Generate training data from coding and/or noncoding input sequences.
  • Align sequences (requires Clustal Omega), optionally generate a negative set (SISSIz if available, otherwise column shuffling), slice alignments into windows, and compute features into a TSV.

Behavior

  • Requires at least one of –noncoding or –coding.
  • Checks for clustalo availability; checks SISSIz availability unless –shuffle-control is given.
  • Seeds RNG with –seed.

Options

  • –noncoding (string): Input directory with FASTA file(s) of noncoding sequences (Requires at least 1 of ‘noncoding’, ‘coding’).
  • –coding (string): Input directory with FASTA file(s) of coding sequences (Requires at least 1 of ‘noncoding’, ‘coding’).
  • –other (string): Input directory with FASTA file(s) of random (intergenic) sequences (will be auto-generated via randomization if not supplied).
  • -o, –outfile (string): Name for the output file (Required).
  • -N, –negative (string): Path to a specific negative dataset; if empty, a negative set is auto-generated.
  • -d, –max-id (float, default: 0.95): Remove sequences above this identity threshold during preprocessing (interpreted as proportion; help text mentions percent).
  • -n, –num-sequences (int, default: 100): Number of sequences input alignments will be optimized towards.
  • -l, –window-length (int, default: 120): Window length for slicing alignments into overlapping windows.
  • -w, –windowslide (int, default: 40): Slide step size controlling window overlap.
  • -s, –samples (int, default: 10): Number of sampling runs per alignment/sequence count.
  • -a, –sample-attempts (int, default: 1000): Number of sampling attempts per alignment.
  • -c, –shuffle-control (store_true, default: False): Use simpler column-based shuffling instead of SISSIz.
  • -H, –hexamer-model (string, default: hexamer_models/Human_hexamer.tsv): Path to the statistical hexamer model to use.
  • -S, –no-structural-filter (string, default: False): Set to True to disable filtering windows by statistical significance of structure. Note: defined as action=”store” (string) although conceptually a boolean toggle.
  • -T, –tree (string, default: None): Path to a Newick-formatted species tree for the alignment. If None, a tree may be estimated.

Example

python svhip.py data --coding CodingDir --noncoding NoncodingDir -o features.tsv -n 200 -l 120 -w 40

svhip train

Purpose

  • Train a machine-learning model (RF, SVM, or LR) on features generated by the data task.
  • Supports optional hyperparameter optimization for SVM and RF.

Options

  • -i, –input (string): Input features file generated with data (Required).
  • -o, –outfile (string): Prefix for output model files (Required).
  • -M, –model (string, default: RF): Model type. One of RF (Random Forest), SVM (Support Vector Machine), LR (Logistic Regression).
  • –optimize-hyperparameters (store_true, default: False): Perform hyperparameter optimization.
  • –optimizer (string, default: randomwalk): Hyperparameter search strategy: gridsearch (exhaustive) or randomwalk (faster).

SVM hyperparameters (when model=SVM and optimization enabled)

  • –low-c (int, default: 1): Lowest C value to try.
  • –high-c (int, default: 100): Highest C value to try.
  • –low-gamma (int, default: 1): Lowest gamma value to try.
  • –high-gamma (int, default: 100): Highest gamma value to try.
  • –hyperparameter-steps (int, default: 10): Number of values per hyperparameter (evenly spaced).
  • –logscale (store_true, default: False): Use logarithmic scaling for the parameter grid.
  • –logbase (int, default: 2): Logarithmic base if –logscale is set.

Random Forest hyperparameters (when model=RF and optimization enabled)

  • –min-trees (int, default: 100): Minimum number of trees (n_estimators) to consider.
  • –max-trees (int, default: 500): Maximum number of trees (n_estimators) to consider.
  • –min-samples-split (int, default: 2): Minimum samples required to split an internal node.
  • –max-samples-split (int, default: 16): Maximum samples to split an internal node.
  • –min-samples-leaf (int, default: 1): Minimum samples required at a leaf node.
  • –max-samples-leaf (int, default: 16): Maximum samples at a leaf node.

Example

python svhip.py train -i features.tsv -o RF_classifier -M RF --optimize-hyperparameters --optimizer randomwalk

svhip windows

Purpose

  • Slice an existing alignment into overlapping windows, filtering sequences by identity and gaps.

Options

  • -i, –input (string): Input alignment file (Required).
  • -o, –outfile (string): Output alignment file for windows (Required).
  • -l, –length (int, default: 120): Window length.
  • -s, –slide (int, default: 80): Slide step size for overlap.
  • –min-id (float, default: 0.5): Minimum pairwise identity of sequences to keep.
  • –max-id (float, default: 0.95): Maximum pairwise identity of sequences to keep.
  • –opt-id (float, default: 0.8): Target identity to optimize sequence selection.
  • -n, –num-seqs (int, default: 6): Maximum number of sequences per window.
  • -g, –max-gaps (float, default: 0.75): Maximum fraction of gaps in the reference sequence.

Example

python svhip.py windows -i input.aln -o WINDOWS.aln -l 120 -s 80 --min-id 0.5 --opt-id 0.8 -n 6

svhip predict

Purpose

  • Predict class labels (coding, non-coding, other) for windows cut from an input alignment using a trained model and hexamer model.
  • Supports MAF or Clustal input; when input ends with .maf, genome coordinates are preserved and can be exported as BED.
  • Processes windows in blocks for efficiency; can scan both strands.

Options

  • -i, –input (string): Input alignment file, MAF or Clustal (Required).
  • -o, –outfile (string): Output TSV file (Required).
  • -M, –model-path (string, default: “”): Path to the trained model file (Required).
  • -T, –tree (string, default: None): Path to a Newick-formatted species tree; if None, one may be estimated.
  • -H, –hexamer-model (string, default: hexamer_models/Human_hexamer.tsv): Path to the hexamer score model.
  • –both-strands (store_true, default: False): Screen both forward and reverse strands.
  • –bed (store_true, default: False): Merge overlapping annotations and write a BED file. IMPORTANT: Requires MAF input for genomic coordinates.
  • –windows-per-block (int, default: 50): Number of windows processed per block before writing results.

Example

python svhip.py predict -i query.maf -o predictions.tsv -M RF_classifier.model -H hexamer_models/Human_hexamer.tsv --both-strands --bed

svhip hexcalibrate

Purpose

  • Calibrate a hexamer model from coding and noncoding sequences; writes a tab-delimited model file.

Options

  • -c, –coding (string): Fasta file of coding transcripts (must be in-frame).
  • -n, –noncoding (string): Fasta file of noncoding sequences.
  • -o, –outfile (string): Output TSV file for the calibrated hexamer model.

Example

python svhip.py hexcalibrate -c coding.fa -n noncoding.fa -o Human_hexamer.tsv

External tools and notes

  • Clustal Omega (clustalo) must be available in PATH for data generation and alignment steps.
  • SISSIz is used for negative control generation when available; if not present or if –shuffle-control is set, a simpler column-shuffling approach is used instead.
  • Randomization is controlled by –seed. If not provided, a random seed is generated at start.
  • When using predict with –bed, ensure the input is MAF to include genomic coordinates.
关于

用于结构变异检测的比对工具,支持多种测序数据类型

12.4 MB
邀请码
    Gitlink(确实开源)
  • 加入我们
  • 官网邮箱:gitlink@ccf.org.cn
  • QQ群
  • QQ群
  • 公众号
  • 公众号

版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9 京公网安备 11010802032778号