Bouras G., Grigson S.R., Mirdita M., Heinzinger M., Papudeshi B., Mallawaarachchi V., Green R., Kim S.R., Mihalia V., Psaltis A.J., Wormald P-J., Vreugde S., Steinegger M., Edwards R.A.
Protein Structure Informed Bacteriophage Genome Annotation with Phold Nucleic Acids Research, Volume 54, Issue 1, 13 January 2026 https://doi.org/10.1093/nar/gkaf1448
phold uses the ProstT5 protein language model to rapidly translate protein amino acid sequences to the 3Di token alphabet used by Foldseek. Foldseek is then used to search these against a database of over 1.36 million phage protein structures mostly predicted using Colabfold.
Alternatively, you can specify protein structures that you have pre-computed for your phage(s) instead of using ProstT5 using the parameters --structures and --structure_dir with phold compare.
phold strongly outperforms sequence-based homology phage annotation tools like Pharokka, particularly for less characterised phages such as those from metagenomic datasets.
If you have already annotated your phage(s) with Pharokka, phold takes the Genbank output of Pharokka as an input option, so you can easily update the annotation with more functional predictions!
phynteny uses phage synteny (the conserved gene order across phages) to assign hypothetical phage proteins to a PHROG category - it might help you add extra PHROG category annotations to hypothetical genes remaining after you run phold.
Pharokka, Phold and Phynteny are complimentary tools and when used together, they substantially increase the annotation rate of your phage genome
The below plot shows the annotation rate of different tools across 4 benchmarked datasets ((a) INPHARED 1419, (b) Cook, (c) Crass and (d) Tara - see the Phold preprint for more information)
The final Phynteny plots combine the benefits of annotation with Pharokka (with HMM, the second violin) followed by Phold (with structures, the fourth violin) followed by Phynteny
Phold plot Wasm App
We recommending running the web app to generate phold plot genomic maps using WebAssembly (Wasm) in your broswer - no data ever leaves your machine!
You will need to first run Phold and upload the GenBank file via the button
This was built during the WebAssembly workshop at ABACBS2025 - for more, you can find the website here
Recent Updates
v1.2.0 Update (8 January 2026)
Improved ProstT5 3Di prediction throughput for phold run, phold predict and phold proteins-predict due to smarter batching implmentations
Addition of phold autotune subcommand to detect an appropriate --batch_size for your hardware
You can also use --autotune with phold run, phold predict and phold proteins-predict to automatically detect and use the optimal --batch_size (only recommended for large datasets with thousands of proteins)
For more details (particularly if you are using a non-NVIDIA GPU), check out the installation documentation.
The best way to install phold is using conda via miniforge, as this will install Foldseek (the only non-Python dependency) along with the Python dependencies.
If you are have a different non-NVIDIA GPU, or have trouble with pytorch, see this link for more instructions. If you have an older version of CUDA installed, then you might find this link useful.
Once phold is installed, to download and install the database run:
phold install -t 8
If you have an NVIDIA GPU and can take advantage of Foldseek’s GPU acceleration, instead run
phold install -t 8 --foldseek_gpu
Note: You will need at least 8GB of free space (the phold databases including ProstT5 are just over 8GB uncompressed).
Quick Start
phold takes a GenBank format file output from pharokka or from NCBI Genbank as its input by default.
If you are running phold on a local work station with GPU available, using phold run is recommended. It runs both phold predict and phold compare
phold run -i tests/test_data/NC_043029.gbk -o test_output_phold -t 8
If you have an NVIDIA GPU available, add --foldseek_gpu
If you do not have any GPU available, add --cpu.
phold run will run in a reasonable time for small datasets with CPU only (e.g. <5 minutes for a 50kbp phage). With GPU it should complete in under 1 minute.
phold predict will complete much faster if a GPU is available, and is necessary for large metagenomic datasets to run in a reasonable time.
In a cluster environment where GPUs are scarce, for large datasets it may be most efficient to run phold in 2 steps for optimal resource usage.
Predict the 3Di sequences with ProstT5 using phold predict. This is massively accelerated if a GPU available.
phold_3di.fasta containing the 3Di sequences for each CDS
phold_per_cds_predictions.tsv containing detailed annotation information on every CDS
phold_all_cds_functions.tsv containing counts per contig of CDS in each PHROGs category, VFDB, CARD, ACRDB and Defensefinder databases (similar to the pharokka_cds_functions.tsv from Pharokka)
phold.gbk, which contains a GenBank format file including these annotations, and keeps any other genomic features (tRNA, CRISPR repeats, tmRNAs) included from the pharokka Genbank input file if provided
Usage
Usage: phold [OPTIONS] COMMAND [ARGS]...
Options:
-h, --help Show this message and exit.
-V, --version Show the version and exit.
Commands:
autotune Determines optimal batch size for 3Di prediction with
citation Print the citation(s) for this tool
compare Runs Foldseek vs phold db
createdb Creates foldseek DB from AA FASTA and 3Di FASTA input...
install Installs ProstT5 model and phold database
plot Creates Phold Circular Genome Plots
predict Uses ProstT5 to predict 3Di tokens - GPU recommended
proteins-compare Runs Foldseek vs phold db on proteins input
proteins-predict Runs ProstT5 on a multiFASTA input - GPU recommended
remote Uses Foldseek API to run ProstT5 then Foldseek locally
run phold predict then comapare all in one - GPU recommended
Usage: phold run [OPTIONS]
phold predict then comapare all in one - GPU recommended
Options:
-h, --help Show this message and exit.
-V, --version Show the version and exit.
-i, --input PATH Path to input file in Genbank format or
nucleotide FASTA format [required]
-o, --output PATH Output directory [default: output_phold]
-t, --threads INTEGER Number of threads [default: 1]
-p, --prefix TEXT Prefix for output files [default: phold]
-d, --database TEXT Specific path to installed phold database
-f, --force Force overwrites the output directory
--autotune Run autotuning to detect and automatically
use best batch size for your hardware.
Recommended only if you have a large dataset
(e.g. thousands of proteins), or else
autotuning will add rather than save runtime.
--batch_size INTEGER batch size for ProstT5. [default: 1]
--cpu Use cpus only.
--omit_probs Do not output per residue 3Di probabilities
from ProstT5. Mean per protein 3Di
probabilities will always be output.
--save_per_residue_embeddings Save the ProstT5 embeddings per resuide in a
h5 file
--save_per_protein_embeddings Save the ProstT5 embeddings as means per
protein in a h5 file
--mask_threshold FLOAT Masks 3Di residues below this value of
ProstT5 confidence for Foldseek searches
[default: 25]
--finetune Use gbouras13/ProstT5Phold encoder + CNN
model both finetuned on phage proteins
--vanilla Use vanilla CNN model (trained on CASP14)
with ProstT5Phold encoder instead of the one
trained on phage proteins
--hyps Use this to only annotate hypothetical
proteins from a Pharokka GenBank input
-e, --evalue FLOAT Evalue threshold for Foldseek [default:
1e-3]
-s, --sensitivity FLOAT Sensitivity parameter for foldseek [default:
9.5]
--keep_tmp_files Keep temporary intermediate files,
particularly the large foldseek_results.tsv
of all Foldseek hits
--card_vfdb_evalue FLOAT Stricter E-value threshold for Foldseek CARD
and VFDB hits [default: 1e-10]
--separate Output separate GenBank files for each contig
--max_seqs INTEGER Maximum results per query sequence allowed to
pass the prefilter. You may want to reduce
this to save disk space for enormous datasets
[default: 1000]
--ultra_sensitive Runs phold with maximum sensitivity by
skipping Foldseek prefilter. Not recommended
for large datasets.
--extra_foldseek_params TEXT Extra foldseek search params
--custom_db TEXT Path to custom database
--foldseek_gpu Use this to enable compatibility with
Foldseek-GPU search acceleration
--restart Use this to restart phold from 'Processing
Foldseek output' after foldseek_results.tsv
is generated
Plotting
phold plot will allow you to create Circos plots with pyCirclize for all your phage(s). For example:
Bouras G, Grigson SR, Mirdita M, Heinzinger M, Papudeshi B, Mallawaarachchi V, Green R, Kim SR, Mihalia V, Psaltis AJ, Wormald P-J, Vreugde S, Steinegger M, Edwards RA: “Protein Structure Informed Bacteriophage Genome Annotation with Phold”, Nucleic Acids Research, Volume 54, Issue 1, 13 January 2026, gkaf1448, https://doi.org/10.1093/nar/gkaf1448
Please be sure to cite the following core dependencies and PHROGs database - citing all bioinformatics tools that you use helps us, so helps you get better bioinformatics tools:
ProstT5 - (https://github.com/mheinzinger/ProstT5) [Michael Heinzinger, Konstantin Weissenow, Joaquin Gomez Sanchez, Adrian Henkel, Martin Steinegger, Burkhard Rost. ProstT5: Bilingual language model for protein sequence and structure. NAR Genomics and Bioinformatics (2024) doi:10.1101/2023.07.23.550085
PHROGs - (https://phrogs.lmge.uca.fr) [Terzian P., Olo Ndela E., Galiez C., Lossouarn J., Pérez Bucio R.E., Mom R., Toussaint A., Petit M.A., Enault F., “PHROG : families of prokaryotic virus proteins clustered using remote homology”, NAR Genomics and Bioinformatics, (2021) https://doi.org/10.1093/nargab/lqab067
Please also consider citing these supplementary databases where relevant:
CARD - Alcock B.P. et al, CARD 2023: expanded curation, support for machine learning, and resistome prediction at the Comprehensive Antibiotic Resistance Database Nucleic Acids Research (2022) https://doi.org/10.1093/nar/gkac920
VFDB - Chen L., Yang J., Yao Z., Sun L., Shen Y., Jin Q., “VFDB: a reference database for bacterial virulence factors”, Nucleic Acids Research (2005) https://doi.org/10.1093/nar/gki008
Defensefinder - F. Tesson, R. Planel, A. Egorov, H. Georjon, H. Vaysset, B. Brancotte, B. Néron, E. Mordret, A Bernheim, G. Atkinson, J. Cury. A Comprehensive Resource for Exploring Antiphage Defense: DefenseFinder Webservice, Wiki and Databases. bioRxiv (2024) https://doi.org/10.1101/2024.01.25.577194
Netflax - Karin Ernits, Chayan Kumar Saha, Tetiana Brodiazhenko, Bhanu Chouhan, Aditi Shenoy, Jessica A. Buttress, Julián J. Duque-Pedraza, Veda Bojar, Jose A. Nakamoto, Tatsuaki Kurata, Artyom A. Egorov, Lena Shyrokova, Marcus J. O. Johansson, Toomas Mets, Aytan Rustamova, Jelisaveta Džigurski, Tanel Tenson, Abel Garcia-Pino, Henrik Strahl, Arne Elofsson, Vasili Hauryliuk, and Gemma C. Atkinson, The structural basis of hyperpromiscuity in a core combinatorial network of type II toxin–antitoxin and related phage defense systems. PNAS (2023) https://doi.org/10.1073/pnas.2305393120
Netflax - Karin Ernits, Chayan Kumar Saha, Tetiana Brodiazhenko, Bhanu Chouhan, Aditi Shenoy, Jessica A. Buttress, Julián J. Duque-Pedraza, Veda Bojar, Jose A. Nakamoto, Tatsuaki Kurata, Artyom A. Egorov, Lena Shyrokova, Marcus J. O. Johansson, Toomas Mets, Aytan Rustamova, Jelisaveta Džigurski, Tanel Tenson, Abel Garcia-Pino, Henrik Strahl, Arne Elofsson, Vasili Hauryliuk, and Gemma C. Atkinson, The structural basis of hyperpromiscuity in a core combinatorial network of type II toxin–antitoxin and related phage defense systems. PNAS (2023) https://doi.org/10.1073/pnas.2305393120
phold - Phage Annotation using Protein Structures
pholdis a sensitive annotation tool for bacteriophage genomes and metagenomes using protein structural homology.To learn more about
phold, please read our manuscript:https://academic.oup.com/nar/article/54/1/gkaf1448/8415830
pholduses the ProstT5 protein language model to rapidly translate protein amino acid sequences to the 3Di token alphabet used by Foldseek. Foldseek is then used to search these against a database of over 1.36 million phage protein structures mostly predicted using Colabfold.Alternatively, you can specify protein structures that you have pre-computed for your phage(s) instead of using ProstT5 using the parameters
--structuresand--structure_dirwithphold compare.pholdstrongly outperforms sequence-based homology phage annotation tools like Pharokka, particularly for less characterised phages such as those from metagenomic datasets.If you have already annotated your phage(s) with Pharokka,
pholdtakes the Genbank output of Pharokka as an input option, so you can easily update the annotation with more functional predictions!Tutorial
Check out the
pholdtutorial at https://phold.readthedocs.io/en/latest/tutorial/.Google Colab Notebooks
If you don’t want to install
pholdlocally, you can run it without any code using one of the following Google Colab notebooks:pharokka+phold+phyntenyuse this linkphold.Phold plot Wasm App
phold plotgenomic maps using WebAssembly (Wasm) in your broswer - no data ever leaves your machine!Recent Updates
v1.2.0 Update (8 January 2026)
phold run,phold predictandphold proteins-predictdue to smarter batching implmentationsphold autotunesubcommand to detect an appropriate--batch_sizefor your hardware--autotunewithphold run,phold predictandphold proteins-predictto automatically detect and use the optimal--batch_size(only recommended for large datasets with thousands of proteins)Table of Contents
Documentation
Check out the full documentation at https://phold.readthedocs.io.
Installation
For more details (particularly if you are using a non-NVIDIA GPU), check out the installation documentation.
The best way to install
pholdis using conda via miniforge, as this will install Foldseek (the only non-Python dependency) along with the Python dependencies.To install
pholdusing conda:To utilise
pholdwith GPU, a GPU compatible version ofpytorchmust be installed. By default conda will install a CPU-only version.If you have an NVIDIA GPU, please try:
If you have a Mac running an Apple Silicon chip (M1/M2/M3/M4),
pholdshould be able to use the GPU. Please try:If you are have a different non-NVIDIA GPU, or have trouble with
pytorch, see this link for more instructions. If you have an older version of CUDA installed, then you might find this link useful.Once
pholdis installed, to download and install the database run:If you have an NVIDIA GPU and can take advantage of Foldseek’s GPU acceleration, instead run
pholddatabases including ProstT5 are just over 8GB uncompressed).Quick Start
pholdtakes a GenBank format file output from pharokka or from NCBI Genbank as its input by default.pholdon a local work station with GPU available, usingphold runis recommended. It runs bothphold predictandphold compareIf you have an NVIDIA GPU available, add
--foldseek_gpuIf you do not have any GPU available, add
--cpu.phold runwill run in a reasonable time for small datasets with CPU only (e.g. <5 minutes for a 50kbp phage). With GPU it should complete in under 1 minute.phold predictwill complete much faster if a GPU is available, and is necessary for large metagenomic datasets to run in a reasonable time.In a cluster environment where GPUs are scarce, for large datasets it may be most efficient to run
pholdin 2 steps for optimal resource usage.phold predict. This is massively accelerated if a GPU available.pholdstructure database with Foldseek usingphold compare. This does not utilise a GPU.Output
phold_3di.fastacontaining the 3Di sequences for each CDSphold_per_cds_predictions.tsvcontaining detailed annotation information on every CDSphold_all_cds_functions.tsvcontaining counts per contig of CDS in each PHROGs category, VFDB, CARD, ACRDB and Defensefinder databases (similar to thepharokka_cds_functions.tsvfrom Pharokka)phold.gbk, which contains a GenBank format file including these annotations, and keeps any other genomic features (tRNA, CRISPR repeats, tmRNAs) included from thepharokkaGenbank input file if providedUsage
Plotting
phold plotwill allow you to create Circos plots with pyCirclize for all your phage(s). For example:Citation
Please cite our preprint:
Please be sure to cite the following core dependencies and PHROGs database - citing all bioinformatics tools that you use helps us, so helps you get better bioinformatics tools:
Please also consider citing these supplementary databases where relevant:
pholdHarutyun Sahakyan, Kira S. Makarova, and Eugene V. Koonin. Search for Origins of Anti-CRISPR Proteins by Structure Comparison. The CRISPR Journal (2023)