Strobealign is a read mapper that is typically significantly faster than other read mappers while achieving comparable or better accuracy, see the performance evaluation.
Features
Map single-end and paired-end reads
Multithreading support
Fast indexing (<1 minute for a human-sized reference genome using four cores)
On-the-fly indexing by default. Optionally create an on-disk index.
Output in standard SAM format or produce even faster results by writing PAF (without alignments)
Strobealign is most suited for read lengths between 100 and 500 bp
Background
Strobealign achieves its speedup by using a dynamic seed size obtained from syncmer-thinned strobemers.
To compile strobealign from sources, you need a somewhat recent Rust
version, which you can obtain via
Rustup.
At the time of writing,
the Rust versions included in current Debian and Ubuntu releases are too old.
When Rust is available, you can compile strobealign with
Experimental Python bindings were available in the C++ version of strobealign
(until version 0.17.0) and have not been ported to Rust.
We may add them back, but until then,
you will need to get strobealign 0.17.0,
and can then install them with pip install ..
The only documentation for the moment are the tests in tests/*.py.
Usage
To align paired-reads against a reference FASTA and produce a sorted BAM file,
using eight threads:
In alignment mode, strobealign produces SAM output. By piping the output
directly into samtools, the above commands avoid creating potentially large
intermediate SAM files and also reduce disk I/O.
To produce unsorted BAM, use samtools view instead of samtools sort.
Mapping-only mode
The command-line option -x switches strobealign into mapping-only mode,
in which it will output PAF
files instead of SAM. For example:
igzip is a faster version of gzip that is part of
ISA-L.
If it is not available, replace it with pigz or regular gzip in the
command.
PAF output includes only mapped reads. Unmapped reads are omitted. This is also
true for paired-end reads: If one of the reads is unmapped, only the mapped one
is output.
The command-line option --aemb switches strobealign into abundance estimation
mode, intended for metagenomic binning.
In this mode, strobealign outputs a single table with abundance values in
tab-separated value format instead of SAM or PAF.
The output table contains one row for each contig of the reference.
The first column is the reference/contig id and the second its abundance.
The abundance is the number of bases mapped to a contig, divided by the length
of the contig. When a read maps to multiple locations, each of the n locations
is weighted 1/n. Note that this is in contrast to alignment mode, where by
default strobealign would output only a single alignment for each read
(this can be changed with -N).
Further columns may be added to this table in future versions of strobealign.
Command-line options
Please run strobealign --help to see the most up-to-date list of command-line
options. Some important ones are:
-r: Mean read length. If given, this overrides the read length estimated
from the input file(s). This is usually only required in combination with
--create-index, see index files.
-t N, --threads=N: Use N threads (both for mapping and indexing).
--eqx: Emit = and X CIGAR operations instead of M.
-x: Only map reads, do not do no base-level alignment. This switches the
output format from SAM to PAF.
--aemb: Output estimated abundance value of each contig, see section above.
--rg-id=ID: Add RG tag to each SAM record.
--rg=TAG:VALUE: Add read group metadata to the SAM header. This can be
specified multiple times. Example: --rg-id=1 --rg=SM:mysamle --rg=LB:mylibrary.
-N INT: Output up to INT secondary alignments. By default, no secondary
alignments are output.
-U: Suppress output of unmapped single-end reads and pairs in which both
reads are unmapped.
--use-index: Use a pre-generated index instead of generating a new one.
--create-index, -i: Generate a strobemer index file (.sti) and write it
to disk next to the input reference FASTA. Do not map reads. If read files are
provided, they are used to estimate read length. See index files.
Read profiles and canonical read lengths
Strobealign needs to build an index of the reference before it can map reads to it.
The optimal indexing parameters depend on the type of reads that are to be mapped.
Strobealign by default uses pre-defined sets of parameters that are optimized
for different read lengths. These canonical read lengths are
50, 75, 100, 125, 150, 250 and 400. When deciding which of the pre-defined
indexing parameter sets to use, strobealign chooses one whose canonical
read length is close to the average read length of the input.
The average read length of the input is normally estimated from the first
500 reads, but can also be explicitly set with -r.
In addition, it is possible to choose a profile optimized for “noisy”, that is,
error-prone reads. Use -P noisy on the command line to select it.
This is equivalent to -k 16 -s 12 -l 2 -u 2 -m 100.
We found these settings to increase accuracy on error-prone reads,
but note this comes at the cost of runtime.
Index files
Pre-computing an index (.sti)
By default, strobealign creates a new index every time the program is run.
On current CPUs (and using multiple cores), indexing a human-sized genome takes
less than 1 minute, which is not long compared to mapping many millions of reads.
However, for repeatedly mapping small libraries, it may be faster to pre-generate
an index on disk and use that.
To create an index, use the --create-index option.
Since strobealign needs to know the read profile, either provide some reads on
the command line as if you wanted to map them:
This will use a read-length based profile based on the estimated read length.
You can also set the read length explicitly with -r:
strobealign --create-index -t 8 -r 150 ref.fa
Or use the “noisy” profile with -P noisy:
strobealign --create-index -t 8 -P noisy ref.fa
This creates a file named ref.fa.X.sti containing the strobemer index,
where X is an identifier for the read profile (such as r50, r100, noisy).
To use the index when mapping, provide option --use-index when doing the
actual mapping:
If you want to use the “noisy” profile, you also need to specify -P noisy
during mapping.
Note that the .sti files are usually tied to a specific strobealign version.
That is, when upgrading strobealign, the .sti files need to be regenerated.
Strobealign detects whether the index was created with an incompatible
version and refuses to load it.
Index files are about four times as large as the reference.
Explanation
Multi-context seeds
Strobealign uses randstrobes as seeds, which in our case consist of two k-mers
(“strobes”) that are somewhat close to each other. When a seed is looked up
in the index, it is only found if both strobes match. By changing the way in
which the index is stored in v0.15.0, it became possible to support
multi-context seeds. With those changes, strobealign falls back to looking
up only one of the strobes (a “partial seed”) if the full seed cannot be found.
This results in better mapping rate and accuracy for read lengths of up to
about 200 nt.
Usage of multi-context seeds is enabled by default in strobealign since v0.16.0.
The strategy is to first search for all full seeds of the query and fall back to
partial seeds if no seeds could be found.
A slightly more accurate, but slower mode of using multi-context seeds is
available by using option --mcs: With it, the strategy is changed to a
fallback per seed: If an individual full seed cannot be found, its partial
version is looked up in the index.
Collinear chaining
Strobealign uses collinear chaining as its default mapping and alignment method,
replacing the previous NAM approach. The collinear chaining algorithm reproduces
the method used in Minimap2.
Collinear chaining works by splitting strobemer hits into anchors (two anchors for full
hits, one for partial hits) and then constructing chains using these anchors with
a scoring function. Chains are created in O(N×h) time complexity, where N is the
number of anchors and h is a constant set at 50 by default.
Several command line options allow fine tuning of the chaining behavior: -H
controls the chaining look-back window heuristic, --gd sets the diagonal gap cost,
--gl sets the gap length cost, --vp determines the best-chain score threshold,
and --sg controls the maximum skip distance on the reference. The previous NAM method
remains available via the --nams flag.
See Performance evaluation for some measurements of mapping
accuracy and runtime using strobealign 0.7.
Citation
If using v0.17.0 or later:
Tolstoganov, I., Martin, M., Buchin, N., and Sahlin, K. Multi-context seeds enable fast and high-accuracy read mapping. Genome Biol (2026). https://doi.org/10.1186/s13059-026-04017-x
strobealign: A fast short-read aligner
Strobealign is a read mapper that is typically significantly faster than other read mappers while achieving comparable or better accuracy, see the performance evaluation.
Features
Background
Strobealign achieves its speedup by using a dynamic seed size obtained from syncmer-thinned strobemers.
For details, refer to Strobealign: flexible seed size enables ultra-fast and accurate read alignment. The paper describes v0.7.1 of the program.
For an introduction, see also the 📺 RECOMB-Seq video from 2022: “Flexible seed size enables ultra-fast and accurate read alignment” (12 minutes). For a more detailed presentation of the underlying seeding mechanism in strobealign (strobemers) see 📺 “Efficient sequence similarity searches with strobemers” (73 minutes).
Table of contents
Installation
Conda
Strobealign is available from Bioconda.
From source
To compile strobealign from sources, you need a somewhat recent Rust version, which you can obtain via Rustup. At the time of writing, the Rust versions included in current Debian and Ubuntu releases are too old.
When Rust is available, you can compile strobealign with
The resulting binary will then be available at
target/release/strobealign.See the contributing instructions for how to compile strobealign as a developer.
Python bindings
Experimental Python bindings were available in the C++ version of strobealign (until version 0.17.0) and have not been ported to Rust. We may add them back, but until then, you will need to get strobealign 0.17.0, and can then install them with
pip install .. The only documentation for the moment are the tests intests/*.py.Usage
To align paired-reads against a reference FASTA and produce a sorted BAM file, using eight threads:
For single-end reads:
For mixed reads (the input file can contain both single and paired-end reads):
In alignment mode, strobealign produces SAM output. By piping the output directly into
samtools, the above commands avoid creating potentially large intermediate SAM files and also reduce disk I/O.To produce unsorted BAM, use
samtools viewinstead ofsamtools sort.Mapping-only mode
The command-line option
-xswitches strobealign into mapping-only mode, in which it will output PAF files instead of SAM. For example:igzipis a faster version of gzip that is part of ISA-L. If it is not available, replace it withpigzor regulargzipin the command.PAF output includes only mapped reads. Unmapped reads are omitted. This is also true for paired-end reads: If one of the reads is unmapped, only the mapped one is output.
Abundance estimation mode (for metagenomic binning)
The command-line option
--aembswitches strobealign into abundance estimation mode, intended for metagenomic binning. In this mode, strobealign outputs a single table with abundance values in tab-separated value format instead of SAM or PAF.Paired-end example:
The output table contains one row for each contig of the reference. The first column is the reference/contig id and the second its abundance.
The abundance is the number of bases mapped to a contig, divided by the length of the contig. When a read maps to multiple locations, each of the n locations is weighted 1/n. Note that this is in contrast to alignment mode, where by default strobealign would output only a single alignment for each read (this can be changed with
-N).Further columns may be added to this table in future versions of strobealign.
Command-line options
Please run
strobealign --helpto see the most up-to-date list of command-line options. Some important ones are:-r: Mean read length. If given, this overrides the read length estimated from the input file(s). This is usually only required in combination with--create-index, see index files.-t N,--threads=N: Use N threads (both for mapping and indexing).--eqx: Emit=andXCIGAR operations instead ofM.-x: Only map reads, do not do no base-level alignment. This switches the output format from SAM to PAF.--aemb: Output estimated abundance value of each contig, see section above.--rg-id=ID: Add RG tag to each SAM record.--rg=TAG:VALUE: Add read group metadata to the SAM header. This can be specified multiple times. Example:--rg-id=1 --rg=SM:mysamle --rg=LB:mylibrary.-N INT: Output up to INT secondary alignments. By default, no secondary alignments are output.-U: Suppress output of unmapped single-end reads and pairs in which both reads are unmapped.--use-index: Use a pre-generated index instead of generating a new one.--create-index,-i: Generate a strobemer index file (.sti) and write it to disk next to the input reference FASTA. Do not map reads. If read files are provided, they are used to estimate read length. See index files.Read profiles and canonical read lengths
Strobealign needs to build an index of the reference before it can map reads to it.
The optimal indexing parameters depend on the type of reads that are to be mapped.
Strobealign by default uses pre-defined sets of parameters that are optimized for different read lengths. These canonical read lengths are 50, 75, 100, 125, 150, 250 and 400. When deciding which of the pre-defined indexing parameter sets to use, strobealign chooses one whose canonical read length is close to the average read length of the input.
The average read length of the input is normally estimated from the first 500 reads, but can also be explicitly set with
-r.In addition, it is possible to choose a profile optimized for “noisy”, that is, error-prone reads. Use
-P noisyon the command line to select it. This is equivalent to-k 16 -s 12 -l 2 -u 2 -m 100. We found these settings to increase accuracy on error-prone reads, but note this comes at the cost of runtime.Index files
Pre-computing an index (
.sti)By default, strobealign creates a new index every time the program is run. On current CPUs (and using multiple cores), indexing a human-sized genome takes less than 1 minute, which is not long compared to mapping many millions of reads. However, for repeatedly mapping small libraries, it may be faster to pre-generate an index on disk and use that.
To create an index, use the
--create-indexoption. Since strobealign needs to know the read profile, either provide some reads on the command line as if you wanted to map them:This will use a read-length based profile based on the estimated read length. You can also set the read length explicitly with
-r:Or use the “noisy” profile with
-P noisy:This creates a file named
ref.fa.X.sticontaining the strobemer index, whereXis an identifier for the read profile (such asr50,r100,noisy). To use the index when mapping, provide option--use-indexwhen doing the actual mapping:If you want to use the “noisy” profile, you also need to specify
-P noisyduring mapping..stifiles are usually tied to a specific strobealign version. That is, when upgrading strobealign, the.stifiles need to be regenerated. Strobealign detects whether the index was created with an incompatible version and refuses to load it.Explanation
Multi-context seeds
Strobealign uses randstrobes as seeds, which in our case consist of two k-mers (“strobes”) that are somewhat close to each other. When a seed is looked up in the index, it is only found if both strobes match. By changing the way in which the index is stored in v0.15.0, it became possible to support multi-context seeds. With those changes, strobealign falls back to looking up only one of the strobes (a “partial seed”) if the full seed cannot be found. This results in better mapping rate and accuracy for read lengths of up to about 200 nt.
Usage of multi-context seeds is enabled by default in strobealign since v0.16.0. The strategy is to first search for all full seeds of the query and fall back to partial seeds if no seeds could be found.
A slightly more accurate, but slower mode of using multi-context seeds is available by using option
--mcs: With it, the strategy is changed to a fallback per seed: If an individual full seed cannot be found, its partial version is looked up in the index.Collinear chaining
Strobealign uses collinear chaining as its default mapping and alignment method, replacing the previous NAM approach. The collinear chaining algorithm reproduces the method used in Minimap2.
Collinear chaining works by splitting strobemer hits into anchors (two anchors for full hits, one for partial hits) and then constructing chains using these anchors with a scoring function. Chains are created in O(N×h) time complexity, where N is the number of anchors and h is a constant set at 50 by default.
Several command line options allow fine tuning of the chaining behavior:
-Hcontrols the chaining look-back window heuristic,--gdsets the diagonal gap cost,--glsets the gap length cost,--vpdetermines the best-chain score threshold, and--sgcontrols the maximum skip distance on the reference. The previous NAM method remains available via the--namsflag.Changelog
See Changelog.
Contributing
See Contributing.
Evaluation
See Performance evaluation for some measurements of mapping accuracy and runtime using strobealign 0.7.
Citation
If using v0.17.0 or later:
Tolstoganov, I., Martin, M., Buchin, N., and Sahlin, K. Multi-context seeds enable fast and high-accuracy read mapping. Genome Biol (2026). https://doi.org/10.1186/s13059-026-04017-x
If using v0.16.1 or earlier:
Sahlin, K. Strobealign: flexible seed size enables ultra-fast and accurate read alignment. Genome Biol 23, 260 (2022). https://doi.org/10.1186/s13059-022-02831-7
License
Strobealign is available under the MIT license, see LICENSE.