Foldseek enables fast and sensitive comparisons of large protein structure sets, supporting monomer and multimer searches, as well as clustering. It runs on CPU, supports GPU acceleration for faster searches, and optionally allows ultra-fast and sensitive comparisons directly from protein sequence inputs using a language model, bypassing the need for structures.
[!NOTE]
We recently added support for GPU-accelerated protein sequence and profile searches. This requires an NVIDIA GPU of the Ampere generation or newer for full speed, however, also works at reduced speed for Turing-generation GPUs. The bioconda- and precompiled binaries will not work on older GPU generations (e.g. Volta or Pascal).
Memory requirements
For optimal software performance, consider three options based on your RAM and search requirements:
With Cα info (default).
Use this formula to calculate RAM - (6 bytes Cα + 1 3Di byte + 1 AA byte) * (database residues). The 54M AFDB50 entries require 151GB.
Without Cα info.
By disabling --sort-by-structure-bits 0, RAM requirement reduces to 35GB. However, this alters hit rankings and final scores but not E-values. Structure bits are mostly relevant for hit ranking for E-value > 10^-1.
Single query searches.
Use the --prefilter-mode 1, which isn’t memory-limited and computes all optimal ungapped alignments. This option optimally utilizes foldseek’s multithreading capabilities for single queries and supports GPU acceleration.
Tutorial Video
A Foldseek tutorial covering the webserver and command-line usage is available here.
Documentation
Many of Foldseek’s modules (subprograms) rely on MMseqs2. For more information about these modules, refer to the MMseqs2 wiki. For documentation specific to Foldseek, checkout the Foldseek wiki here.
Quick start
Search
The easy-search module allows to query one or more single-chain proteins, formatted in as protein structures in PDB/mmCIF format (flat or gzipped) or as protein sequnece in fasta, against a target database, folder or individual single-chain protein structures (for multi-chain proteins see complexsearch). The default alignment information output is a tab-separated file but Foldseek also supports Superposed Cα PDBs and HTML.
The default output fields are: query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits but they can be customized with the --format-output option e.g., --format-output "query,target,qaln,taln" returns the query and target accessions and the pairwise alignments in tab-separated format. You can choose many different output columns.
Code
Description
query
Query sequence identifier
target
Target sequence identifier
qca
Calpha coordinates of the query
tca
Calpha coordinates of the target
alntmscore
TM-score of the alignment
qtmscore
TM-score normalized by the query length
ttmscore
TM-score normalized by the target length
u
Rotation matrix (computed to by TM-score)
t
Translation vector (computed to by TM-score)
lddt
Average LDDT of the alignment
lddtfull
LDDT per aligned position
prob
Estimated probability for query and target to be homologous (e.g. being within the same SCOPe superfamily)
Foldseek’s --format-mode 5 generates PDB files with all target Cα atoms superimposed onto the query structure based on the aligned coordinates.
For each pairwise alignment it will write its own PDB file, so be careful when using this options for large searches.
Interactive HTML
Locally run Foldseek can generate an HTML search result, similar to the one produced by the webserver by specifying --format-mode 3
Adjust sensitivity to speed trade-off; lower is faster, higher more sensitive (fast: 7.5, default: 9.5)
–num-iterations
Sensitivity
Enables iterative search to find more distantly related hits (Default: off). Recommended --num-iterations 0 optimized version
–exhaustive-search
Sensitivity
Skips prefilter and performs an all-vs-all alignment (more sensitive but much slower)
–max-seqs
Sensitivity
Adjust the amount of prefilter handed to alignment; increasing it can lead to more hits (default: 1000)
-e
Sensitivity
List matches below this E-value (range 0.0-inf, default: 0.001); increasing it reports more distant structures
–cluster-search
Sensitivity
For clustered databases like AFDB50, CATH50 trigger a cluster search: 0: search only representatives (fast), 1: align and report also all members of a cluster (default: 0)
List matches above this fraction of aligned (covered) residues (see –cov-mode) (default: 0.0); higher coverage = more global alignment
–cov-mode
Alignment
0: coverage of query and target, 1: coverage of target, 2: coverage of query
–gpu
Performance
Enables fast GPU-accelerated ungapped prefilter (--prefilter-mode 1) (default: off), ignores -s. Use --gpu 1 to enable.
Alignment Mode
By default, Foldseek uses its local 3Di+AA structural alignment, but it also supports realigning hits using the global TMalign or local LoLalign, as well as rescoring alignments using TMscore or LoLscore respectively.
If alignment type is set to tmalign (--alignment-type 1), the results will be sorted by the TMscore normalized by query length. The TMscore is used for reporting two fields: the e-value=(qTMscore+tTMscore)/2 and the score=(qTMscore*100). All output fields (e.g., pident, fident, and alnlen) are calculated based on the TMalign alignment.
If alignment type is set to lolalign (--alignment-type 3), the result will be sorted by the LoLscore, a novel alignment log-odds score without length normalization. When set to single domain mode (--lolalign-multidomain 0) the query and target lengths are incorporated. The e-value is a normalized LoLscore (<= 1) while the score is unnormalized. All output fields (e.g., pident, fident, and alnlen) are calculated based on the LoLalign alignment.
Databases
The databases command downloads pre-generated databases like PDB or AlphaFoldDB.
The target database can be pre-processed by createdb. This is useful when searching multiple times against the same set of target structures.
foldseek createdb example/ targetDB
foldseek createindex targetDB tmp #OPTIONAL generates and stores the index on disk
foldseek easy-search example/d1asha_ targetDB aln.m8 tmpFolder
Create custom database from protein sequence (FASTA)
Create a structural database from FASTA files using the ProstT5 protein language model. It runs by default on CPU and is about 400-4000x compared to predicted structures by ColabFold.
However, this database will contain only the predicted 3Di structural sequences without additional structural details.
As a result, it supports monomer search and clustering, but does not enable features requiring Cα information, such as --alignment-type 1, TM-score or LDDT output.
Accelerate inference by one to two magnitudes using GPU(s) (--gpu 1)
foldseek createdb db.fasta db --prostt5-model weights --gpu 1
Use the CUDA_VISIBLE_DEVICES variable to select the GPU device(s).
CUDA_VISIBLE_DEVICES=0 to use GPU 0.
CUDA_VISIBLE_DEVICES=0,1 to use GPUs 0 and 1.
Pad database for fast GPU search
GPU searches require the database to be reformatted, with padding added to each sequence using the makepaddedseqdb command. The padded database can be used for both CPU and GPU searches.
# Prepare the database for GPU search
foldseek makepaddedseqdb db db_pad
# Perform GPU search
foldseek search db db_pad result_dir --gpu 1
Cluster
The easy-cluster algorithm is designed for structural clustering by assigning structures to a representative protein structure using structural alignment. It accepts input in either as protein structures as PDB/mmCIF or protein sequences as fasta format, with support for both flat and gzipped files. By default, easy-cluster generates three output files with the following prefixes: (1) _clu.tsv, (2) _repseq.fasta, and (3) _allseq.fasta. The first file (1) is a tab-separated file describing the mapping from representative to member, while the second file (2) contains only representative sequences, and the third file (3) includes all cluster member sequences.
foldseek easy-cluster example/ res tmp -c 0.9
Output Cluster
Tab-separated cluster
The provided format represents protein structure clustering in a tab-separated, two-column layout (representative and member). Each line denotes a cluster-representative and cluster-member relationship, signifying that the member shares significant structural similarity with the representative, and thus belongs to the same cluster.
The _repseq.fasta contains all representative protein sequences of the clustering.
>Q0KJ32
MAGA....R
>E3HQM9
MCAT...Q
All member fasta
In the _allseq.fasta file all sequences of the cluster are present. A new cluster is marked by two identical name lines of the representative sequence, where the first line stands for the cluster and the second is the name line of the first cluster sequence. It is followed by the fasta formatted sequences of all its members.
List matches above this fraction of aligned (covered) residues (see –cov-mode) (default: 0.0); higher coverage = more global alignment
–cov-mode
Alignment
0: coverage of query and target, 1: coverage of target, 2: coverage of query
–min-seq-id
Alignment
the minimum sequence identity to be clustered
–tmscore-threshold
Alignment
accept alignments with an alignment TMscore > thr
–tmscore-threshold-mode
Alignment
normalize TMscore by 0: alignment, 1: representative, 2: member length
–lddt-threshold
Alignment
accept alignments with an alignment LDDT score > thr
Multimersearch
The easy-multimersearch module is designed for querying one or more protein complex (multi-chain) structures (supported input formats: PDB/mmCIF, flat or gzipped) against a target database of protein complex structures. It reports the similarity metrices between the complexes (e.g., the TMscore).
Using Multimersearch
The examples below use files that can be found in the example directory, which is part of the Foldseek repo, if you clone it.
If you use the precompiled version of the software, you can download the files directly: 1tim.pdb.gz and 8tim.pdb.gz.
For a pairwise alignment of complexes using easy-multimersearch, run the following command:
foldseek easy-multimersearch example/1tim.pdb.gz example/8tim.pdb.gz result tmpFolder
Foldseek easy-multimersearch can also be used for searching one or more query complexes against a target database:
By default, easy-multimersearch reports the output alignment in a tab-separated file.
The default output fields are: query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits,complexassignid but they can be customized with the --format-output option e.g., --format-output "query,target,complexqtmscore,complexttmscore,complexassignid" alters the output to show specific scores and identifiers.
Code
Description
Commons
query
Query sequence identifier
target
Target sequence identifier
Only for scorecomplex
complexqtmscore
TM-score of Complex alignment normalized by the query length
complexttmscore
TM-score of Complex alignment normalized by the target length
complexu
Rotation matrix of Complex alignment (computed to by TM-score)
complext
Translation vector of Complex alignment (computed to by TM-score)
qcomplexcoverage
Average coverage of Complex alignment normalized by the query length
tcomplexcoverage
Average coverage of Complex alignment normalized by the target length
qchaintms
TM-score of each chains for Complex alignment ormalized by the query length
tchaintms
TM-score of each chains for Complex alignment ormalized by the target length
The easy-multimercluster module is designed for multimer-level structural clustering(supported input formats: PDB/mmCIF, flat or gzipped). By default, easy-multimercluster generates three output files with the following prefixes: (1) _cluster.tsv, (2) _rep_seq.fasta and (3) _cluster_report. The first file (1) is a tab-separated file describing the mapping from representative multimer to member, while the second file (2) contains only representative sequences. The third file (3) is also a tab-separated file describing filtered alignments.
Make sure chain names in PDB/mmcIF files does not contain underscores(_).
The _cluster_report contains qcoverage, tcoverage, multimer qTm, multimer tTm, interface lddt, ustring, tstring of alignments after filtering and before clustering.
The query and target coverages here represent the sum of the coverages of all aligned chains, divided by the total query and target multimer length respectively.
Important multimer cluster parameters
Option
Category
Description
-e
Sensitivity
List matches below this E-value (range 0.0-inf, default: 0.001); increasing it reports more distant structures
List matches above this fraction of aligned (covered) residues (see –cov-mode) (default: 0.0); higher coverage = more global alignment
–cov-mode
Alignment
0: coverage of query and target (cluster multimers only with same chain numbers), 1: coverage of target, 2: coverage of query
–multimer-tm-threshold
Alignment
accept alignments with multimer alignment TMscore > thr
–chain-tm-threshold
Alignment
accept alignments if every single chain TMscore > thr
–interface-lddt-threshold
Alignment
accept alignments with an interface LDDT score > thr
Main Modules
easy-search fast protein structure search
easy-cluster fast protein structure clustering
easy-multimersearch fast protein multimer-level structure search
easy-multimercluster fast protein multimer-level structure clustering
createdb create a database from protein structures (PDB,mmCIF, mmJSON)
databases download pre-assembled databases
Examples
Faster Search with GPU Acceleration
Foldseek’s prefilter on a 4090 GPU is four times faster than a 64-core CPU. To use GPU-based ungapped alignment for faster prefiltering, ensure you have a CUDA-enabled GPU and specify the --gpu option:
Use the CUDA_VISIBLE_DEVICES variable to select the GPU device(s).
CUDA_VISIBLE_DEVICES=0 to use GPU 0.
CUDA_VISIBLE_DEVICES=0,1 to use GPUs 0 and 1.
Fast structure search from FASTA input
Protein sequences can be directly searched without requiring existing protein structures by using ProstT5, which is approximately 400–4000x faster than predicting structures with ColabFold.
Read more here.
Output format aln_tmscore.tsv: query and target identifiers, TMscore, translation(3) and rotation vector=(3x3)
Query centered multiple sequence alignment
Foldseek can output multiple sequence alignments in a3m format using the following commands.
To convert a3m to FASTA format, the following script can be used reformat.pl (reformat.pl in.a3m out.fas).
Foldseek
Foldseek enables fast and sensitive comparisons of large protein structure sets, supporting monomer and multimer searches, as well as clustering. It runs on CPU, supports GPU acceleration for faster searches, and optionally allows ultra-fast and sensitive comparisons directly from protein sequence inputs using a language model, bypassing the need for structures.
Publications
van Kempen M, Kim S, Tumescheit C, Mirdita M, Lee J, Gilchrist CLM, Söding J, and Steinegger M. Fast and accurate protein structure search with Foldseek. Nature Biotechnology, doi:10.1038/s41587-023-01773-0 (2023)
Barrio-Hernandez I, Yeo J, Jänes J, Mirdita M, Gilchrist CLM, Wein T, Varadi M, Velankar S, Beltrao P and Steinegger M. Clustering predicted structures at the scale of the known protein universe. Nature, doi:10.1038/s41586-023-06510-w (2023)
Kim W, Mirdita M, Levy Karin E, Gilchrist CLM, Schweke H, Söding J, Levy E, and Steinegger M. Rapid and sensitive protein complex alignment with Foldseek-Multimer. Nature Methods, doi:10.1038/s41592-025-02593-7 (2025)
Kallenborn F, Chacon A, Hundt C, Sirelkhatim H, Didi K, Cha S, Dallago C, Mirdita M, Schmidt B, Steinegger M: GPU-accelerated homology search with MMseqs2. bioRxiv, doi: 10.1101/2024.11.13.623350 (2024)
Table of Contents
Webserver
Search your protein structures against the AlphaFoldDB and PDB in seconds using the Foldseek webserver (code): search.foldseek.com 🚀
Installation
Other precompiled binaries are available at https://mmseqs.com/foldseek.
Memory requirements
For optimal software performance, consider three options based on your RAM and search requirements:
With Cα info (default). Use this formula to calculate RAM -
(6 bytes Cα + 1 3Di byte + 1 AA byte) * (database residues). The 54M AFDB50 entries require 151GB.Without Cα info. By disabling
--sort-by-structure-bits 0, RAM requirement reduces to 35GB. However, this alters hit rankings and final scores but not E-values. Structure bits are mostly relevant for hit ranking for E-value > 10^-1.Single query searches. Use the
--prefilter-mode 1, which isn’t memory-limited and computes all optimal ungapped alignments. This option optimally utilizes foldseek’s multithreading capabilities for single queries and supports GPU acceleration.Tutorial Video
A Foldseek tutorial covering the webserver and command-line usage is available here.
Documentation
Many of Foldseek’s modules (subprograms) rely on MMseqs2. For more information about these modules, refer to the MMseqs2 wiki. For documentation specific to Foldseek, checkout the Foldseek wiki here.
Quick start
Search
The
easy-searchmodule allows to query one or more single-chain proteins, formatted in as protein structures in PDB/mmCIF format (flat or gzipped) or as protein sequnece in fasta, against a target database, folder or individual single-chain protein structures (for multi-chain proteins see complexsearch). The default alignment information output is a tab-separated file but Foldseek also supports Superposed Cα PDBs and HTML.Output Search
Tab-separated
The default output fields are:
query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bitsbut they can be customized with the--format-outputoption e.g.,--format-output "query,target,qaln,taln"returns the query and target accessions and the pairwise alignments in tab-separated format. You can choose many different output columns.Check out the MMseqs2 documentation for additional output format codes.
Superpositioned Cα only PDB files
Foldseek’s
--format-mode 5generates PDB files with all target Cα atoms superimposed onto the query structure based on the aligned coordinates. For each pairwise alignment it will write its own PDB file, so be careful when using this options for large searches.Interactive HTML
Locally run Foldseek can generate an HTML search result, similar to the one produced by the webserver by specifying
--format-mode 3Important search parameters
--num-iterations 0optimized version--prefilter-mode 1) (default: off), ignores-s. Use--gpu 1to enable.Alignment Mode
By default, Foldseek uses its local 3Di+AA structural alignment, but it also supports realigning hits using the global TMalign or local LoLalign, as well as rescoring alignments using TMscore or LoLscore respectively.
If alignment type is set to tmalign (
--alignment-type 1), the results will be sorted by the TMscore normalized by query length. The TMscore is used for reporting two fields: the e-value=(qTMscore+tTMscore)/2 and the score=(qTMscore*100). All output fields (e.g., pident, fident, and alnlen) are calculated based on the TMalign alignment.If alignment type is set to lolalign (
--alignment-type 3), the result will be sorted by the LoLscore, a novel alignment log-odds score without length normalization. When set to single domain mode (--lolalign-multidomain 0) the query and target lengths are incorporated. The e-value is a normalized LoLscore (<= 1) while the score is unnormalized. All output fields (e.g., pident, fident, and alnlen) are calculated based on the LoLalign alignment.Databases
The
databasescommand downloads pre-generated databases like PDB or AlphaFoldDB.We currently support the following databases:
Create custom databases and indexes
The target database can be pre-processed by
createdb. This is useful when searching multiple times against the same set of target structures.Create custom database from protein sequence (FASTA)
Create a structural database from FASTA files using the ProstT5 protein language model. It runs by default on CPU and is about 400-4000x compared to predicted structures by ColabFold. However, this database will contain only the predicted 3Di structural sequences without additional structural details. As a result, it supports monomer search and clustering, but does not enable features requiring Cα information, such as
--alignment-type 1, TM-score or LDDT output.Accelerate inference by one to two magnitudes using GPU(s) (
--gpu 1)CUDA_VISIBLE_DEVICESvariable to select the GPU device(s).CUDA_VISIBLE_DEVICES=0to use GPU 0.CUDA_VISIBLE_DEVICES=0,1to use GPUs 0 and 1.Pad database for fast GPU search
GPU searches require the database to be reformatted, with padding added to each sequence using the
makepaddedseqdbcommand. The padded database can be used for both CPU and GPU searches.Cluster
The
easy-clusteralgorithm is designed for structural clustering by assigning structures to a representative protein structure using structural alignment. It accepts input in either as protein structures as PDB/mmCIF or protein sequences as fasta format, with support for both flat and gzipped files. By default, easy-cluster generates three output files with the following prefixes: (1)_clu.tsv, (2)_repseq.fasta, and (3)_allseq.fasta. The first file (1) is a tab-separated file describing the mapping from representative to member, while the second file (2) contains only representative sequences, and the third file (3) includes all cluster member sequences.Output Cluster
Tab-separated cluster
The provided format represents protein structure clustering in a tab-separated, two-column layout (representative and member). Each line denotes a cluster-representative and cluster-member relationship, signifying that the member shares significant structural similarity with the representative, and thus belongs to the same cluster.
Representative fasta
The
_repseq.fastacontains all representative protein sequences of the clustering.All member fasta
In the
_allseq.fastafile all sequences of the cluster are present. A new cluster is marked by two identical name lines of the representative sequence, where the first line stands for the cluster and the second is the name line of the first cluster sequence. It is followed by the fasta formatted sequences of all its members.Important cluster parameters
Multimersearch
The
easy-multimersearchmodule is designed for querying one or more protein complex (multi-chain) structures (supported input formats: PDB/mmCIF, flat or gzipped) against a target database of protein complex structures. It reports the similarity metrices between the complexes (e.g., the TMscore).Using Multimersearch
The examples below use files that can be found in the
exampledirectory, which is part of the Foldseek repo, if you clone it. If you use the precompiled version of the software, you can download the files directly: 1tim.pdb.gz and 8tim.pdb.gz.For a pairwise alignment of complexes using
easy-multimersearch, run the following command:Foldseek
easy-multimersearchcan also be used for searching one or more query complexes against a target database:Multimer Search Output
Tab-separated-complex
By default,
easy-multimersearchreports the output alignment in a tab-separated file. The default output fields are:query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits,complexassignidbut they can be customized with the--format-outputoption e.g.,--format-output "query,target,complexqtmscore,complexttmscore,complexassignid"alters the output to show specific scores and identifiers.Example Output:
Complex Report
easy-multimersearchalso generates a report (prefixed_report), which provides a summary of the inter-complex chain matching, including identifiers, chains, TMscores, rotation matrices, translation vectors, and assignment IDs. The report includes the following fields: | Column | Description | | — | — | | 1 | Identifier of the query complex | | 2 | Identifier of the target complex | | 3 | Comma separated matched chains in the query complex | | 4 | Comma separated matched chains in the target complex | | 5 | TM score normalized by query length [0-1] | | 6 | TM score normalized by target length [0-1] | | 7 | Comma separated nine rotation matrix (U) values | | 8 | Comma separated three translation vector (T) values | | 9 | Complex alignment ID |Example Output:
Multimercluster
The
easy-multimerclustermodule is designed for multimer-level structural clustering(supported input formats: PDB/mmCIF, flat or gzipped). By default, easy-multimercluster generates three output files with the following prefixes: (1)_cluster.tsv, (2)_rep_seq.fastaand (3)_cluster_report. The first file (1) is a tab-separated file describing the mapping from representative multimer to member, while the second file (2) contains only representative sequences. The third file (3) is also a tab-separated file describing filtered alignments.Make sure chain names in PDB/mmcIF files does not contain underscores(_).
Output MultimerCluster
Tab-separated multimercluster
Representative multimer fasta
Filtered search result
The
_cluster_reportcontainsqcoverage, tcoverage, multimer qTm, multimer tTm, interface lddt, ustring, tstringof alignments after filtering and before clustering.The query and target coverages here represent the sum of the coverages of all aligned chains, divided by the total query and target multimer length respectively.
Important multimer cluster parameters
Main Modules
easy-searchfast protein structure searcheasy-clusterfast protein structure clusteringeasy-multimersearchfast protein multimer-level structure searcheasy-multimerclusterfast protein multimer-level structure clusteringcreatedbcreate a database from protein structures (PDB,mmCIF, mmJSON)databasesdownload pre-assembled databasesExamples
Faster Search with GPU Acceleration
Foldseek’s prefilter on a 4090 GPU is four times faster than a 64-core CPU. To use GPU-based ungapped alignment for faster prefiltering, ensure you have a CUDA-enabled GPU and specify the
--gpuoption:CUDA_VISIBLE_DEVICESvariable to select the GPU device(s).CUDA_VISIBLE_DEVICES=0to use GPU 0.CUDA_VISIBLE_DEVICES=0,1to use GPUs 0 and 1.Fast structure search from FASTA input
Protein sequences can be directly searched without requiring existing protein structures by using ProstT5, which is approximately 400–4000x faster than predicting structures with ColabFold. Read more here.
The translation with ProstT5 can be accelerated by using GPU(s) (
--gpu 1) and multiple GPUs can be used by setting theCUDA_VISIBLE_DEVICESvariable.Rescore aligments using TMscore
The easiest way to get the alignment TMscore normalized by min(alnLen,qLen,targetLen) as well as a rotation matrix is through the following command:
Alternatively, it is possible to compute TMscores for the kind of alignment output (e.g., 3Di+AA) using the following commands:
Output format
aln_tmscore.tsv: query and target identifiers, TMscore, translation(3) and rotation vector=(3x3)Query centered multiple sequence alignment
Foldseek can output multiple sequence alignments in a3m format using the following commands. To convert a3m to FASTA format, the following script can be used reformat.pl (
reformat.pl in.a3m out.fas).For a non-query centered multiple sequence alignment please check out Foldmason.