PHIST

Phage-Host Interaction Search Tool

A tool to predict prokaryotic hosts for phage (meta)genomic sequences. PHIST links viruses to hosts based on the number of k-mers shared between their sequences.

Quick start

git clone --recurse-submodules https://github.com/refresh-bio/PHIST

cd PHIST
make

./phist.py ./example/virus ./example/host ./out/

Installation

PHIST uses Kmer-db as a submodule, therefore a recursive repository clone must be performed:

git clone --recurse-submodules https://github.com/refresh-bio/PHIST

Under Linux/OS X the package can be built by running MAKE in the project directory (G++ 5.3 tested):

cd PHIST
make

Under Windows one have to build Visual Studio 2015 solutions on kmer-db and utils subdirectories (use Release 64-bit configuration, as Python script depends on the default VS output directory structure).

Usage

PHIST takes as input genomic sequences of viruses and candidate hosts in FASTA files (gzipped or not). Virus genomes may be provided in a single FASTA file or in a directory containing multiple FASTA files (one genome per file). Candidate host genomes should be stored individually in a directory (one genome per FASTA file) (see example).

./phist.py [options] <virus_path> <host_dir> <out_dir>

Positional arguments:

virus_path Input FASTA file or directory with files (plain or gzip)
host_dir Input directory w/ host FASTA files (plain or gzip)
out_dir Output directory (will be created if it does not exist)

Options:

-k <kmer-length> k-mer length (default: 25, max: 30)
-t <num-threads> Number of threads (default: number of cores)
-h, --help Show this help message and exit
--keep_temp Keep temporary kmer-db files [False]
--version Show tool’s version number and exit

Usage example

./phist.py example/virus/ example/host/ out/

./phist.py example/virus_multifasta.fna example/host/ out/

Output format

PHIST outputs two CSV files. One containing a table of common k-mers between phages and hosts, and second file with virus-host predictions.

Common k-mers table

The common_kmers.csv file stores numbers of common k-mers between phages (in columns) and hosts (in rows) in a sparse form. Specifically, zeros are omitted while non-zero k-mer counts are represented as pairs (column_number : value) with 1-based column indexing. Thus, rows may have different number of elements, e.g.:


kmer-length: k fraction: f	phages	φ₁	φ₂	…	φ_n
hosts	total-kmers	\|φ₁\|	\|φ₂\|	…	\|φ_n\|
h₁	\|h₁\|	i₁₁ : \|h₁ ∩ φ_i₁₁\|	i₁₂ : \|h₁ ∩ φ_i₁₂\|
h₂	\|h₂\|	i₂₁ : \|h₂ ∩ φ_i₂₁\|	i₂₂ : \|h₂ ∩ φ_i₂₂\|	i₂₃ : \|h₂ ∩ φ_i₂₃\|
h₂	\|h₂\|
…	…	…
h_m	\|h_m\|	i_m1 : \|h_m ∩ φ_{i_m1}\|

where:

k - k-mer length,
φ₁, φ₂, …, φ_n - phage names,
h₁, h₂, …, h_m - host names,
|a| - number of k-mers in sample a,
|a ∩ b| - number of k-mers common for samples a and b.

Host predictions

The predictions.csv file assigns each phage to its most likely host (i.e., the one having most k-mers in common). If there are multiple potential hosts with same number of common k-mers, all are reported. Each virus-host interaction is followed by p-value and adjusted p-value for multiple comparisons.

phage	host	common k-mers	p-value	adj. p-value
φ₁	host( φ₁)	\|φ₁ ∩ host(φ₁)\|	…	…
φ₂	host( φ₂)	\|φ₂ ∩ host(φ₂)\|	…	…
φ₃	host₁( φ₃)	\|φ₃ ∩ host₁(φ₃)\|	…	…
φ₃	host₂( φ₃)	\|φ₃ ∩ host₂(φ₃)\|	…	…
…	…	…	…	…

Further analysis

The utils/matcher tool retrieves the list of all exact matches of legnth >= k for a given pair of phage and host FASTA sequences. The matches are provided with their coordinates in the viral and corresponding bacterial genome (a reversed interval in the latter indicates a reverse complement match).

Usage

./utils/matcher [options] <virus> <host> <output>

Positional arguments:

virus virus FASTA file (gzipped or not),
host host FASTA file (gzipped or not),
output output CSV file

Options:

-k --k <kmer-length> k-mer length (default: 25, max: 30, may be different than the one used in the PHIST execution),

Example

./utils/matcher example/virus/NC_024123.fna example/host/NC_017548.fna shared_regions.csv

example/virus/NC_024123.fna,example/host/NC_017548.fna
NC_024123.1:52942-52968,NC_017548.1:1456873-1456847
NC_024123.1:52970-53009,NC_017548.1:1456845-1456806
NC_024123.1:53011-53102,NC_017548.1:1456804-1456713
NC_024123.1:53107-53147,NC_017548.1:1456708-1456668
NC_024123.1:53830-53854,NC_017548.1:2647971-2647947
NC_024123.1:54794-54827,NC_017548.1:679998-679965

Citing

Zielezinski A, Deorowicz S, Gudyś A. PHIST: fast and accurate prediction of prokaryotic hosts from metagenomic viral sequences, Bioinformatics. 2022, 38(5):1447-9. doi:10.1093/bioinformatics/btab837.