NSCCN/virmet：病毒和微生物检测流程，用于测序数据中的病原体识别。

VirMet

VirMet is a software designed to help users running viral metagenomics (mNGS) experiments.

For full Documentation, please Read the Docs.

VirMet is called with a command-subcommand syntax. All the possible subcommands are:

fetch: download all databases
update: update viral database
index: index all genomes
wolfpack: analyze a Miseq run or file
covplot: plot coverage for a specific organism

Some help can be obtained with virmet <subcommand> -h or simply virmet -h:

virmet -h
usage: virmet <command> [options]

positional arguments:
  {fetch,update,index,wolfpack,covplot}
                        available sub-commands
    fetch               download databases
    update              update viral database
    index               index genomes
    wolfpack            analyze a Miseq run
    covplot             create coverage plot

options:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit

Run virmet subcommand -h for more help.

Installation

Bioconda

VirMet is available through Bioconda, a channel for the conda package manager. Once conda is installed and the channels are set up, conda install virmet installs the package with all its dependencies.

Preparation

VirMet contains programs to download and index the genome sequences, instructions here.

Running a virus scan

Once the fetch and index subcommands have ben run and the databases are downloaded and indexed, users can use Wolfpack, the main subcommand of VirMet.

A very simple way to use Wolfpack is the following: virmet wolfpack --run path_to_run_directory

With that, sequencing reads are filtered (quality-control), decontaminated, and finally blasted against a (large) set of viral sequences.

Virmet Wolfpack provides several outputs that can be found in the output folder virmet_output_RUN_NAME, located at the current working directory.

For example, if users have a directory named Exp01 with files:

exp_01/ABC-DNA_S9.fastq.gz
exp_01/DEF-RNA_S11.fastq.gz

and run virmet wolfpack --run Exp01, their results will appear into virmet_output_Exp01.

The most important output files are:

Orgs_species_found.csv: table showing the viral organisms identified per FASTQ file as well as the read counts matching each organism. It looks as follows:

species                accn        reads  stitle               ssciname                  covered_region  seq_len  sample        run
Tunavirus T1           MK213796.1  455      Phage T1             Escherichia phage T1      30152              48836       ABC-DNA_S9    Exp01
Varicellovirus a5       KY559403.2  79      Alphaherpesvirus 5   Alphaherpesvirus 5      321              137741   ABC-DNA_S9    Exp01
Lentivirus humimdef1   MT222957.1  13      HIV-1 isolate X        HIV-1                  141              9372       DEF-RNA_S11  Exp01
Emesvirus zinderi       EF204940.1  6265      Phage MS2 isolate Y  Escherichia phage MS2  2836              3569       DEF-RNA_S11  Exp01

with each column meaning:

species: scientific name of the species corresponding to the database sequence.
accn: accession number of the viral species corresponding to the database sequence.
reads: number of reads assigned to this specific sequence.
stitle: title of the sequence in the database (fasta header).
ssciname: scientific name of the sequence.
covered_region: number of nucleotides covered by at least one read.
seq_len: length of the sequence.
sample: name of the FASTQ file that lead to the results.
run: name of the sequencing run or main folder.

Run_reads_summary.tsv: summary of all reads analyzed per sample, showing the number of reads passing each step of the pipeline (e.g., QC, decontamination), and the number of reads matching the human, bovine, bacterial, fungal and viral database. It looks as follows:

category              reads        sample         run
raw_reads              2656448    ABC-DNA_S9     Exp01
trimmed_too_short      46        ABC-DNA_S9     Exp01
low_entropy              25        ABC-DNA_S9     Exp01
low_quality              1081        ABC-DNA_S9     Exp01
passing_filter          2655296    ABC-DNA_S9     Exp01
matching_humanGRCh38  339020    ABC-DNA_S9     Exp01
matching_bt_ref          1319509    ABC-DNA_S9     Exp01
matching_bacteria      46024        ABC-DNA_S9     Exp01
matching_fungi          10        ABC-DNA_S9     Exp01
matching_other_cells  7403        ABC-DNA_S9     Exp01
reads_to_blast          943330    ABC-DNA_S9     Exp01
viral_reads              70579        ABC-DNA_S9     Exp01
undetermined_reads      872751    ABC-DNA_S9     Exp01
raw_reads              2202164    DEF-RNA_S11     Exp01
trimmed_too_short      259441    DEF-RNA_S11     Exp01
low_entropy              18        DEF-RNA_S11     Exp01
low_quality              1277        DEF-RNA_S11     Exp01
passing_filter          1941428    DEF-RNA_S11     Exp01
matching_humanGRCh38  526874    DEF-RNA_S11     Exp01
matching_bt_ref          819466    DEF-RNA_S11     Exp01
matching_bacteria      8721        DEF-RNA_S11     Exp01
matching_fungi          21        DEF-RNA_S11     Exp01
matching_other_cells  299        DEF-RNA_S11     Exp01
reads_to_blast          586047    DEF-RNA_S11     Exp01
viral_reads              168247    DEF-RNA_S11     Exp01
undetermined_reads      417800    DEF-RNA_S11     Exp01

In addition to these main outputs, Wolfpack creates inside virmet_output_RUN_NAME a subdirectory for each sample (FASTQ file). There, users can find these additional files:

Unique.tsv.gz
Viral_reads.fastq.gz
Undetermined_reads.fastq.gz
Orgs_list.tsv
Stats.tsv
Fastp.html
Folders named as viral organisms
Other.err

Files orgs_list.tsv and stats.tsv report the main output of the tool for each sample. They are the same as Orgs_species_found.csv and Run_reads_summary.tsv, respectively, but containing information only of one sample. Therefore, the last two columns (sample and run) are missing.

The unique.tsv.gz reports all BLAST hits to the viral database. These are not the same as in orgs_list.tsv because VirMet further filters the BLAST hits (unique.tsv.gz) to ensure that only those with ≥75% pident and ≥75% qcov are ultimately considered viral reads for diagnosis.

As the names say, viral_reads.fastq.gz and undetermined_reads.fastq.gz contain, respectively, reads identified as of viral origin and reads not matching any of the considered genomes.

Finally, fastp.html shows the quality-control statistics and information about the QC-filtering step.

In the decontamination step, reads are aligned against the human genome first, those matching are discarded while those not matching are aligned against the bovine genome, and so on. In each step, some files ending with .err are generated, and they can be used for inspection, if needed. However, they do not provide valuable information for the users (unless there is any error) and can be removed if desired.

Besides all these files, Wolfpack automatically creates (unless disabled with --nocovplot) a few folders with the names of viral organisms. Such folders contain the coverage plots of the viral assignments, named Viral_organism_coverage.pdf. It is recommended to manually have a look at them before a final viral diagnosis.

Overall, users can expect the following structure for the outputs:

virmet_output_Exp01/
│
├── ABC-DNA_S9
│   ├── Virus_X/
│   │    └── Virus_X_coverage.pdf
│   ├── unique.tsv.gz, viral_reads.fastq.gz, undetermined_reads.fastq.gz
│   ├── orgs_list.tsv, stats.tsv, fastp.html
│   └── Others.err
│ 
├── DEF-RNA_S11
│   ├── Virus_Y/
│   │   └── Virus_Y_coverage.pdf
│   ├── Virus_Z/
│   │   └── Virus_Z_coverage.pdf
│   ├── unique.tsv.gz, viral_reads.fastq.gz, undetermined_reads.fastq.gz
│   ├── orgs_list.tsv, stats.tsv, fastp.html
│   └── Others.err
│
├── run_reads_summary.tsv
└── orgs_species_found.tsv

Please, see VirMet Documentation for a more extensive explanation on how to use fetch, index, wolfpack and covplot subcommands.