目录

RNAVirHost: a machine learning-based method for predicting hosts of RNA viruses through viral genomes

Viruses are obligate intracellular parasites that depend on living organisms for their replication and survival. While the application of metagenomic high-throughput sequencing technologies have facilitated the discovery of the viral dark matter, how to determine the hosts of the metagenome-originated viruses remains challenging owing to the complex composition of the metagenomic sequencing samples. The high genetic diversity of RNA viruses poses a great challenge to the alignment-based methods.

Here, we introduce RNAVirHost, a machine learning-based tool that predicts the hosts of RNA viruses solely based on viral genomes. It takes complete RNA viral genomes as input and predicts the natural host groups from kingdom level to order level.

RNAVirHost is designed as a two-layer classification framework to hierarcically predict the host groups of query viruses. Layer 1 contains 5 branches (Chordata, Invertebrate, Viridiplantae, Fungi, Bacteria), which are categorized into kingdom and phylum level. Layer 2, designed for Chordata subtree, has 10 leaves, which are at the class and order level. RNAVirHost utilizes a hierarchical approach to predict the host lineage along the tree. It combines various factors, including virustaxonomic information, viral genomic traits, and sequence homology, to achieve accurate host prediction. To provide different levels of confidence in the predictions, RNAVirHost incorporates prediction score cutoffs. This cutoff allows users to classify the predictions into different confidence level.

Dependency:

  • python 3.x
  • BLAST 2.12.0
  • Prodigal 2.6.3
  • xgboost 1.7.4
  • pandas 2.0.3
  • scikit-learn 1.1.3
  • biopython 1.83
  • numpy 1.23.5

Quick Installation

  1. We highly recommend using conda/mamba to install all dependencies.

    conda install -c bioconda rnavirhost
  2. Or you could build the environment from GitHub:

    git clone https://github.com/GreyGuoweiChen/VirHost.git
    cd VirHost
    
    # Create the environment and install the dependencies using conda or mamba
    mamba env create -f environment.yml
    
    # Activate the environment
    conda activate rnavirhost
    
    # Distribute RNAVirHost to your conda environment
    pip install .
  3. (Alternative) You could install it using pip to avoid all the dependencies, but may fail.

    pip install rnavirhost
    # You can check the installation by calling:
    conda list rnavirhost

Usage:

To enable RNAVirHost’s prediction, the viral taxonomic information is required and its format (.csv) is as below. | | y|virus order | | ————- | ————- | | NC_000858.1 | Ortervirales | | NC_019922.1 | Norzivirales | | … | … |

The file comprises (r+1) rows and two columns, where (r+1) represents the r sequences in the query fasta file and an additional header row. The first column denotes the accession numbers of sequences (sequences’ ID), and the second column represents the corresponding order labels. If a label falls outside our designated range, it is acceptable to input them to the host prediction stage. We will output the corresponding sequence ID in a separate file.

  1. For user convenience, we provide a simple alignment-based method for classifying virus sequences at the order level using BLASTN. The code for this method is shown below. Generally, the order-level predictions provided by BLASTN are sufficiently accurate. However, if users desire to refine the classification using other programs, they can follow the file format outlined in the table above.

    rnavirhost classify_order [-i INPUT_CONTIG] [-o TAXONOMIC_RESULT]
  2. The input files include the fasta file and the corresponding taxonomic information table.

    rnavirhost predict [-i INPUT_CONTIG] [--taxa TAXONOMIC_INFORMATION_OF_INPUT] [-o OUTPUT_DIRECTORY]

Programs’ Option

classify_order:

-i: The input contig file in fasta format.
-o: The taxonoic information of query viruses at order level (default: RVH_taxa.csv).

predict:

-i, --input: The input contig file in fasta format.
--taxa: The input csv file corresponding to the virus order labels of the queries (default: RVH_taxa.csv).
-o, --output: The output directory (default: RVH_result).

Output

The output format is as: | | y|virus order | pred|L1 | pred|L2 | evidence | | ————- | ————- | ————- | ————- | ————- | | NC_000858.1 | Ortervirales | Chordata | Primates | pred_high_confidence | | NC_001426.1 | Norzivirales | Bacteria | | assign | | … | … | … | … | … |

The evidence list has 5 labels, including pred_high_confidence, pred_low_confidence, assign, BLASTn, and unclassified. The “pred” prefix represent that the prediction is made by the learning model. If the prediction score is higher than the built-in score cutoff, we regard it as a high-confidence prediction and thus give “pred_high_confidence”; otherwise, we will give “pred_low_confidence”. The “assign” evidence means that the reference viruses in the same order with the query virus infect a specific host group. So we assign the host group to the virus order without prediction.

Besides, RNAVirHost encodes the query sequences at the protein level. To obtain the protein level representation, we translate the sequences into proteins by prodigal first. For those sequences do not encode proteins, it is challenging to generate a reliable result, and RNAVirHost does not accept these non-coding sequences for higher confidence. However, for user’s convenience, we try to predict hosts of these sequences adopting the best alignment strategy by BLASTn. We aligned the query virus against our reference database, assign the host as the label of the best hit reference, and give “BLASTn” evidence. Finally, we give the “unclassified” evidence to query viruses which do not have order information.

Example

# create the taxonomic file first
rnavirhost classify_order -i test/test.fasta 

# predict the host of query viruses
rnavirhost predict -i test/test.fasta -o RVH_result
关于

用于识别RNA病毒与宿主相互作用的生物信息学工具

60.2 MB
邀请码
    Gitlink(确实开源)
  • 加入我们
  • 官网邮箱:gitlink@ccf.org.cn
  • QQ群
  • QQ群
  • 公众号
  • 公众号

版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9 京公网安备 11010802032778号