A tool to predict prokaryotic hosts for phage (meta)genomic sequences. PHIST links viruses to hosts based on the number of k-mers shared between their sequences.
Quick start
git clone --recurse-submodules https://github.com/refresh-bio/PHIST
cd PHIST
make
./phist.py ./example/virus ./example/host ./out/
Installation
PHIST uses Kmer-db as a submodule, therefore a recursive repository clone must be performed:
Under Linux/OS X the package can be built by running MAKE in the project directory (G++ 5.3 tested):
cd PHIST
make
Under Windows one have to build Visual Studio 2015 solutions on kmer-db and utils subdirectories (use Release 64-bit configuration, as Python script depends on the default VS output directory structure).
Usage
PHIST takes as input genomic sequences of viruses and candidate hosts in FASTA files (gzipped or not). Virus genomes may be provided in a single FASTA file or in a directory containing multiple FASTA files (one genome per file). Candidate host genomes should be stored individually in a directory (one genome per FASTA file) (see example).
PHIST outputs two CSV files. One containing a table of common k-mers between phages and hosts, and second file with virus-host predictions.
Common k-mers table
The common_kmers.csv file stores numbers of common k-mers between phages (in columns) and hosts (in rows) in a sparse form. Specifically, zeros are omitted while non-zero k-mer counts are represented as pairs (column_number : value) with 1-based column indexing. Thus, rows may have different number of elements, e.g.:
kmer-length: k fraction: f
phages
φ1
φ2
…
φn
hosts
total-kmers
|φ1|
|φ2|
…
|φn|
h1
|h1|
i11 : |h1 ∩ φi11|
i12 : |h1 ∩ φi12|
h2
|h2|
i21 : |h2 ∩ φi21|
i22 : |h2 ∩ φi22|
i23 : |h2 ∩ φi23|
h2
|h2|
…
…
…
hm
|hm|
im1 : |hm ∩ φim1|
where:
k - k-mer length,
φ1, φ2, …, φn - phage names,
h1, h2, …, hm - host names,
|a| - number of k-mers in sample a,
|a ∩ b| - number of k-mers common for samples a and b.
Host predictions
The predictions.csv file assigns each phage to its most likely host (i.e., the one having most k-mers in common). If there are multiple potential hosts with same number of common k-mers, all are reported. Each virus-host interaction is followed by p-value and adjusted p-value for multiple comparisons.
phage
host
common k-mers
p-value
adj. p-value
φ1
host( φ1)
|φ1 ∩ host(φ1)|
…
…
φ2
host( φ2)
|φ2 ∩ host(φ2)|
…
…
φ3
host1( φ3)
|φ3 ∩ host1(φ3)|
…
…
φ3
host2( φ3)
|φ3 ∩ host2(φ3)|
…
…
…
…
…
…
…
Further analysis
The utils/matcher tool retrieves the list of all exact matches of legnth >= k for a given pair of phage and host FASTA sequences. The matches are provided with their coordinates in the viral and corresponding bacterial genome (a reversed interval in the latter indicates a reverse complement match).
Usage
./utils/matcher [options] <virus> <host> <output>
Positional arguments:
virus virus FASTA file (gzipped or not),
host host FASTA file (gzipped or not),
output output CSV file
Options:
-k --k <kmer-length>k-mer length (default: 25, max: 30, may be different than the one used in the PHIST execution),
Zielezinski A, Deorowicz S, Gudyś A. PHIST: fast and accurate prediction of prokaryotic hosts from metagenomic viral sequences, Bioinformatics. 2022, 38(5):1447-9. doi:10.1093/bioinformatics/btab837.
PHIST
Phage-Host Interaction Search Tool
A tool to predict prokaryotic hosts for phage (meta)genomic sequences. PHIST links viruses to hosts based on the number of k-mers shared between their sequences.
Quick start
Installation
PHIST uses Kmer-db as a submodule, therefore a recursive repository clone must be performed:
Under Linux/OS X the package can be built by running MAKE in the project directory (G++ 5.3 tested):
Under Windows one have to build Visual Studio 2015 solutions on kmer-db and utils subdirectories (use Release 64-bit configuration, as Python script depends on the default VS output directory structure).
Usage
PHIST takes as input genomic sequences of viruses and candidate hosts in FASTA files (gzipped or not). Virus genomes may be provided in a single FASTA file or in a directory containing multiple FASTA files (one genome per file). Candidate host genomes should be stored individually in a directory (one genome per FASTA file) (see example).
Positional arguments:
virus_pathInput FASTA file or directory with files (plain or gzip)host_dirInput directory w/ host FASTA files (plain or gzip)out_dirOutput directory (will be created if it does not exist)Options:
-k <kmer-length>k-mer length (default: 25, max: 30)-t <num-threads>Number of threads (default: number of cores)-h, --helpShow this help message and exit--keep_tempKeep temporary kmer-db files [False]--versionShow tool’s version number and exitUsage example
Output format
PHIST outputs two CSV files. One containing a table of common k-mers between phages and hosts, and second file with virus-host predictions.
Common k-mers table
The common_kmers.csv file stores numbers of common k-mers between phages (in columns) and hosts (in rows) in a sparse form. Specifically, zeros are omitted while non-zero k-mer counts are represented as pairs (column_number : value) with 1-based column indexing. Thus, rows may have different number of elements, e.g.:
where:
Host predictions
The predictions.csv file assigns each phage to its most likely host (i.e., the one having most k-mers in common). If there are multiple potential hosts with same number of common k-mers, all are reported. Each virus-host interaction is followed by p-value and adjusted p-value for multiple comparisons.
Further analysis
The
utils/matchertool retrieves the list of all exact matches of legnth >= k for a given pair of phage and host FASTA sequences. The matches are provided with their coordinates in the viral and corresponding bacterial genome (a reversed interval in the latter indicates a reverse complement match).Usage
Positional arguments:
virusvirus FASTA file (gzipped or not),hosthost FASTA file (gzipped or not),outputoutput CSV fileOptions:
-k --k <kmer-length>k-mer length (default: 25, max: 30, may be different than the one used in the PHIST execution),Example
Citing
Zielezinski A, Deorowicz S, Gudyś A. PHIST: fast and accurate prediction of prokaryotic hosts from metagenomic viral sequences, Bioinformatics. 2022, 38(5):1447-9. doi:10.1093/bioinformatics/btab837.