# download and compile
git clone https://github.com/lh3/miniprot
cd miniprot && make
# test file
./miniprot test/DPP3-hs.gen.fa.gz test/DPP3-mm.pep.fa.gz > aln.paf # PAF output
./miniprot --gff test/DPP3-hs.gen.fa.gz test/DPP3-mm.pep.fa.gz > aln.gff # GFF3+PAF output
# general command line: index and align in one go (-I sets max intron size based on genome size)
./miniprot -Iut16 --gff genome.fna protein.faa > aln.gff
# general command line: index first and then align (recommended)
./miniprot -t16 -d genome.mpi genome.fna
./miniprot -Iut16 --gff genome.mpi protein.faa > aln.gff
# minisplice integration with pre-trained model for vertebrate and insect
wget -O https://zenodo.org/records/15670304/files/vi2-7k.tgz | tar zxf -
minisplice predict -t16 -c vi2-7k.kan.cali vi2-7k.kan genome.fa.gz > score.tsv
miniprot -Iut16 --gff -j2 --spsc=score.tsv genome.fa.gz proteins.faa > align.gff
# output format
man ./miniprot.1
Miniprot aligns a protein sequence against a genome with affine gap penalty,
splicing and frameshift. It is primarily intended for annotating protein-coding
genes in a new species using known genes from other species. Miniprot is
similar to GeneWise and Exonerate in functionality but
it can map proteins to whole genomes and is much faster at the residue
alignment step.
Miniprot is not optimized for mapping distant homologs because distant homologs
are less informative to gene annotations. Nonetheless, it is still possible to
tune seeding parameters to achieve higher sensitivity at the cost of
performance.
Users’ Guide
Installation
Miniprot requires SSE2 or NEON instructions and only works on x86_64 or ARM
CPUs. It depends on zlib for parsing gzip’d input files. To compile
miniprot, type make in the source code directory. This will produce a
standalone executable miniprot. This executable is all you need to invoke
miniprot.
For some unknown reason, the default gcc-4.8.5 on CentOS 7 may compile a binary
that is very slow on certain sequences but gcc-10.3.0 has more stable
performance. If possible, use a more recent gcc to compile miniprot.
Usage
To run miniprot, use
miniprot -t8 ref-file protein.faa > output.paf
where ref-file can either be a genome in the FASTA format or a pre-built
index generated by
miniprot -t8 -d ref.mpi ref.fna
Because miniprot indexing is slow and memory intensive, it is recommended to
pre-build the index. FASTA input files can be optionally compressed with gzip.
Miniprot outputs alignment in the protein PAF format. Different from the more
common nucleotide PAF format, miniprot uses more CIGAR operators to encode
introns and frameshifts. Please refer to the manpage for detailed explanation.
For convenience, miniprot can also output GFF3 with option --gff:
miniprot -t8 --gff -d ref.mpi ref.fna > out.gff
The detailed alignment is embedded in ##PAF lines in the GFF3 output. You can
also get detailed residue alignment with --aln.
If you are aligning proteins to a whole genome, it is recommended to add option
-I to let miniprot automatically set the maximum intron size. You can also
use -G to explicitly specify the max intron size.
Miniprot can optionally take splice scores computed with minisplice.
For vertebrate and insect which have pre-trained minisplice models,
you can compute splice scores with minisplice and feed the scores to miniprot:
Translate the reference genome to amino acids in six phases and filter out
ORFs shorter than 45bp. Reduce 20 amino acids to 13 distinct integers and
extract random open syncmers of 6aa in length. By default, miniprot selects
20% of 6-mers in average. For a reduced 6-mer at reference position x,
keep the 6-mer and floor(x/256) in a dense hash table. This concludes the
indexing step.
Given a protein sequence as query, extract 6-mer syncmers on the protein,
look up the index for seed matches and apply minimap2-like chaining. This
first round of chaining is approximate as the reference positions have been
binned during indexing.
For each chain in step 2, redo seeding and chaining with sliding 5-mers from
both the reference and the protein in the original chain. Miniprot uses all
reduced 5-mers for this second round of chaining.
Choose top 100 (see -N) chains. Filter out anchors around potential
introns or long gaps. Perform striped dynamic programming between remaining
anchors and also extend from the first or last anchors. This gives the final
alignment.
Citing miniprot
If you use miniprot, please cite:
Li, H. (2023) Protein-to-genome alignment with miniprot. Bioinformatics, 39, btad014 [PMID: 36648328].
The preprint is available at
arXiv:2210.08052, which
additionally shows metrics on MetaEuk. Please note that the published paper
evaluated miniprot-0.7. The latest version may report different numbers.
Limitations
The initial conditions of dynamic programming are not technically correct,
which may result in suboptimal residue alignment in rare cases.
Support for non-splicing alignment needs to be improved.
Getting Started
Table of Contents
Introduction
Miniprot aligns a protein sequence against a genome with affine gap penalty, splicing and frameshift. It is primarily intended for annotating protein-coding genes in a new species using known genes from other species. Miniprot is similar to GeneWise and Exonerate in functionality but it can map proteins to whole genomes and is much faster at the residue alignment step.
Miniprot is not optimized for mapping distant homologs because distant homologs are less informative to gene annotations. Nonetheless, it is still possible to tune seeding parameters to achieve higher sensitivity at the cost of performance.
Users’ Guide
Installation
Miniprot requires SSE2 or NEON instructions and only works on x86_64 or ARM CPUs. It depends on zlib for parsing gzip’d input files. To compile miniprot, type
makein the source code directory. This will produce a standalone executableminiprot. This executable is all you need to invoke miniprot.For some unknown reason, the default gcc-4.8.5 on CentOS 7 may compile a binary that is very slow on certain sequences but gcc-10.3.0 has more stable performance. If possible, use a more recent gcc to compile miniprot.
Usage
To run miniprot, use
where
ref-filecan either be a genome in the FASTA format or a pre-built index generated byBecause miniprot indexing is slow and memory intensive, it is recommended to pre-build the index. FASTA input files can be optionally compressed with gzip.
Miniprot outputs alignment in the protein PAF format. Different from the more common nucleotide PAF format, miniprot uses more CIGAR operators to encode introns and frameshifts. Please refer to the manpage for detailed explanation.
For convenience, miniprot can also output GFF3 with option
--gff:The detailed alignment is embedded in
##PAFlines in the GFF3 output. You can also get detailed residue alignment with--aln.If you are aligning proteins to a whole genome, it is recommended to add option
-Ito let miniprot automatically set the maximum intron size. You can also use-Gto explicitly specify the max intron size.Miniprot can optionally take splice scores computed with minisplice. For vertebrate and insect which have pre-trained minisplice models, you can compute splice scores with minisplice and feed the scores to miniprot:
Algorithm overview
Translate the reference genome to amino acids in six phases and filter out ORFs shorter than 45bp. Reduce 20 amino acids to 13 distinct integers and extract random open syncmers of 6aa in length. By default, miniprot selects 20% of 6-mers in average. For a reduced 6-mer at reference position
x, keep the 6-mer andfloor(x/256)in a dense hash table. This concludes the indexing step.Given a protein sequence as query, extract 6-mer syncmers on the protein, look up the index for seed matches and apply minimap2-like chaining. This first round of chaining is approximate as the reference positions have been binned during indexing.
For each chain in step 2, redo seeding and chaining with sliding 5-mers from both the reference and the protein in the original chain. Miniprot uses all reduced 5-mers for this second round of chaining.
Choose top 100 (see
-N) chains. Filter out anchors around potential introns or long gaps. Perform striped dynamic programming between remaining anchors and also extend from the first or last anchors. This gives the final alignment.Citing miniprot
If you use miniprot, please cite:
The preprint is available at arXiv:2210.08052, which additionally shows metrics on MetaEuk. Please note that the published paper evaluated miniprot-0.7. The latest version may report different numbers.
Limitations
The initial conditions of dynamic programming are not technically correct, which may result in suboptimal residue alignment in rare cases.
Support for non-splicing alignment needs to be improved.