A minimap2 SMRT wrapper for PacBio data:
native PacBio data in ⇨ native PacBio BAM out.
pbmm2 is a SMRT C++ wrapper for minimap2‘s C API.
Its purpose is to support native PacBio in- and output, provide sets of
recommended parameters, generate sorted output on-the-fly, and postprocess alignments.
Sorted output can be used directly for polishing using GenomicConsensus,
if BAM has been used as input to pbmm2.
Benchmarks show that pbmm2 outperforms BLASR in sequence identity,
number of mapped bases, and especially runtime. pbmm2 is the official
replacement for BLASR.
Binary Availability
Latest version can be installed via bioconda package pbmm2.
Please refer to our official pbbioconda page
for information on Installation, Support, License, Copyright, and Disclaimer.
The number of alignment threads can be specified with -j,--num-threads.
If not specified, the maximum number of threads will be used, minus one thread for BAM IO
and minus the number of threads specified for sorting.
Sorting
Sorted output can be generated using --sort.
Percentage: By default, 25% of threads specified with -j, maximum 8, are used for sorting. Example: --sort -j 12, 9 threads for alignment, 3 threads for sorting.
Manual override: To override the default percentage, -J,--sort-threads defines the explicit number of threads
used for on-the-fly sorting. Example: --sort -j 12 -J 4, 12 threads for alignment, 4 threads for sorting.
The memory allocated per sort thread can be defined with -m,--sort-memory, accepting suffixes M,G.
Temporary files during sorting are stored in the current working directory,
unless explicitly defined with environment variable TMPDIR.
The path used for temporary files is also printed if --log-level DEBUG is set.
Benchmarks on human data have shown that 4 sort threads are recommended, but no more
than 8 threads can be effectively leveraged, even with 70 cores used for alignment.
It is recommended to provide more memory to each of a few sort threads, to avoid disk IO pressure,
than providing less memory to each of many sort threads.
Input file types
Following compatibility table shows allowed input file types, output file types,
compatibility with GenomicConsensus, and recommended --preset choice.
More info about our dataset XML specification.
Input
Output
GC
Preset
.bam (aligned or unaliged)
.bam
Y
.fasta / .fa / .fasta.gz / .fa.gz
.bam
N
.fastq / .fq / .fastq.gz / .fq.gz
.bam
N
.Q20.fastq / Q20.fastq.gz
.bam
N
CCS
bam.fofn
.bam
N
fasta.fofn
.bam
N
fastq.fofn
.bam
N
.subreadset.xml
.bam \ .alignmentset.xml
Y
.consensusreadset.xml
.bam \ .consensusalignmentset.xml
Y
CCS
.transcriptset.xml
.bam \ .transcriptalignmentset.xml
Y
ISOSEQ
FASTA/Q input
In addition to native PacBio BAM input, reads can also be provided in FASTA
and FASTQ formats, as shown above.
With FASTA/Q input, option --rg sets the read group. Example call:
ls *.subreads.bam > mymovies.fofn
pbmm2 align hg38.fasta mymovies.fofn hg38.mymovies.bam
FAQ
Which minimap2 version is used?
pbmm2 ≥v1.13.0: minimap2 v2.26
pbmm2 <v1.13.0: minimap2 v2.15
When are pbi files created?
Whenever the output is of type xml, a pbi file is being generated.
When are BAM index files created?
For sorted output via --sort, a bai file is being generated per default.
You can switch to csi for larger genomes with --bam-index CSI or skip
index generation completely with --bam-index NONE.
What are parameter sets and how can I override them?
Per default, pbmm2 uses recommended parameter sets to simplify the plethora
of possible combinations. For this, we currently offer:
SUBREAD
CCS or HIFI (default)
ISOSEQ
UNROLLED
Parameter sets vary based on pbmm2 version and are explained in --help.
If you want to override any of the parameters of your chosen set,
please use the respective options:
-k k-mer size (no larger than 28). [-1]
-w Minimizer window size. [-1]
-u Disable homopolymer-compressed k-mer (compression is active for SUBREAD & UNROLLED presets).
-A Matching score. [-1]
-B Mismatch penalty. [-1]
-z Z-drop score. [-1]
-Z Z-drop inversion score. [-1]
-r Bandwidth used in chaining and DP-based alignment. [-1]
-g Stop chain enlongation if there are no minimizers in N bp. [-1]
For the piece-wise linear gap penalties, use the following overrides, whereas
a k-long gap costs min{o+ke,O+kE}:
-o,--gap-open-1 Gap open penalty 1. [-1]
-O,--gap-open-2 Gap open penalty 2. [-1]
-e,--gap-extend-1 Gap extension penalty 1. [-1]
-E,--gap-extend-2 Gap extension penalty 2. [-1]
-L,--lj-min-ratio Long join flank ratio. [-1]
For ISOSEQ, you can override additional parameters:
-G Max intron length (changes -r). [-1]
-C Cost for a non-canonical GT-AG splicing. [-1]
--no-splice-flank Do not prefer splice flanks GT-AG.
If you have suggestions for our default parameters or ideas for a new
parameter set, please open a GitHub issue!
What other special parameters are used implicitly?
To achieve similar alignment behavior like blasr, we implicitly use following
minimap2 parameters:
no secondary alignments are produced per default (overridable with --secondary)
What sequence identity filters does pbmm2 offer?
The idea of removing spurious or low-quality alignments is straightforward,
but the exact definition of a threshold is tricky and
varies between tools and applications. More on sequence identity
from Heng Li. pbmm2 offers following filters:
--min-concordance-perc, legacy mapped concordance filter, inherited from its predecessor BLASR (hidden option)
--min-gap-comp-id-perc, a gap-compressed sequence identity filter accounting insertions and deletions as single events only (default)
By default, (3) is set to 70%, (1) and (2) are deactivated.
The problem with (1) the mapped concordance filter is that it also removes
biological structural variations, such as true insertions and deletions
w.r.t. used reference; it is only appropriate if applied to resequencing
data of haploid organisms.
The (2) sequence identity is the BLAST identity, a very natural metric for filtering.
The (3) gap-compressed sequence identity filter is very similar to (2),
but accounts insertions and deletions as single events only and
is the fairest metric when it comes to assess the actual error rate. All three filters are combined with AND, meaning an alignment has to pass all
three thresholds.
How do you define mapped concordance?
The --min-concordance-perc option, whereas concordance is defined as
will remove alignments that do not pass the provided threshold in percent. This is the default filter.
You can deactivate this filter with --min-gap-comp-id-perc 0.
What is repeated matches trimming?
A repeated match is, when the same query interval is shared between a primary
and supplementary alignment. This can happen for translocations, where breakends
share the same flanking sequence:
And sometimes, when a LINE gets inserted, the flanks are/get duplicated leading
to complicated alignments, where we see a split read sharing a duplication.
The inserted region itself, mapping to a random other LINE in the reference
genome, may also share sequence similarity to the flanks:
To get the best alignments, minimap2 decides that two alignments may use up to
50% (default) of the same query bases. This does not work for PacBio, because we
see pbmm2 as a blasr replacement and require that a single base may never be
aligned twice. Minimap2 offers a feature to enforce a query interval overlap
to 0%. What happens now if a query interval gets used in two alignments,
one or both get flagged as secondary and get filtered.
This leads to yield loss and more importantly missing SVs in the alignment.
Papers like this
present dynamic programming approaches to find the optimal split to
uniquely map query intervals, while maximizing alignment scores. We don’t have
per base alignment scores available, thus our approach is much simpler.
We align the read, find overlapping query intervals, and trim non-primary alignments
in order as provided by minimap2; trimming here means that pbmm2 rewrites the cigar
and the reference coordinates on-the-fly. This allows us to increase number
of mapped bases, slightly reduce identity, but boost SV recall rate.
As for any two alignments of the same data with different mappers, alignments
will differ. This is because of many reasons, but mainly a combination of
different scoring functions and seeding techniques.
How does sorting work?
We integrated samtools sort code into pbmm2 to use it as on-the-fly sorting.
This allows pbmm2 to skip writing unsorted BAM as output and thus save
one round-trip of writing and reading unsorted BAM to disk, minimizing disk IO
pressure.
Is pbmm2 unsorted + samtools sort faster than pbmm2 --sort?
This highly depends on your filesystem.
Our tests are showing that there is no clear winner;
runtimes differ up to 10% in either directions, depending on read length distribution,
genome length and complexity, disk IO pressure, and possibly further unknown factors.
For very small genomes post-alignment sorting is faster,
but for larger genomes like rice or human on-the-fly sorting is faster.
Keep in mind, scalability is not only about runtime, but also disk IO pressure.
We recommend to use on-the-fly sorting via pbmm2 align --sort.
Can I get alignment statistics?
If you use --log-level INFO, after alignment is done, you get following
alignment metrics:
Mapped Reads: 1529671
Alignments: 3087717
Mapped Bases: 28020786811
Mean Sequence Identity: 88.4%
Max Mapped Read Length : 122989
Mean Mapped Read Length : 35597.9
Is there any benchmark information, like timings and peak memory consumption?
If you use --log-level INFO, after alignment is done, you get following
timing and memory information:
Index Build/Read Time: 22s 327ms
Alignment Time: 5s 523ms
Sort Merge Time: 344ms 927us
BAI Generation Time: 150ms
PBI Generation Time: 161ms 120us
Run Time: 28s 392ms
CPU Time: 39s 653ms
Peak RSS: 12.5847 GB
Can I get progress output?
If you use --log-level DEBUG, you will following reports:
If you are interested in unrolled alignments that is, align the full-length
ZMW read or the HQ region of a ZMW against an unrolled template, please use
--zmw or --hqregion with *.subreadset.xml as input that contains
one *.subreads.bam and one *.scraps.bam file. Keep in mind, to unroll the
reference on your own.
This is beta feature and still in development.
How can I set the sample name?
You can override the sample name (SM field in RG tag) for all read groups
with --sample.
If not provided, sample names derive from the dataset input with order of
precedence: SM field in input read group, biosample name, well sample name, UnnamedSample.
If the input is a BAM file and --sample has not been used, the SM field will
be populated with UnnamedSample.
Can I split output by sample name?
Yes, --split-by-sample generates one output BAM file per sample name, with
the sample name as file name infix, if there is more than one aligned sample name.
Can I remove all those extra per base and pulse tags?
Yes, --strip removes following extraneous tags if the input is BAM,
but the resulting output BAM file cannot be used as input into GenomicConsensus:
dq, dt, ip, iq, mq, pa, pc, pd, pe, pg, pm, pq, pt, pv, pw, px, sf, sq, st
Where are the unmapped reads?
Per default, unmapped reads are omitted. You can add them to the output BAM file
with --unmapped.
Can I output secondary alignments?
Use --secondary to enable secondary alignment output. Secondary alignments
are independent alternate mappings that skip repeated matches trimming and
do not participate in SA tag generation.
Use --max-secondary-alns N to retain at most N secondary alignments prior
to filtering (default: 5). This option is only effective with --secondary.
Can I output at maximum the N best alignments per read?
Use -N, --best-n. If set to 0, default, maximum filtering is disabled.
Is there a way to only align one subread per ZMW?
Using --median-filter, only the subread closest to the median subread length
per ZMW is being aligned.
Preferably, full-length subreads flanked by adapters are chosen.
What is --collapse-homopolymers?
The idea behind --collapse-homopolymers is to collapse any two or more
consecutive bases of the same type. In this mode, the reference is collapsed and
written to disk with the same prefix as your output alignment and appended
with suffix .ref.collapsed.fasta. In addition, each read is collapsed
before alignment. This mode cannot be combined with .mmi input.
Known issues
Due to multithreading the ouput alignment ordering can differ between multiple runs
with the same input parameters. The same can occur even with option --sort for
records that align to the same target sequence, the same position within that target,
and in the same orientation, which are the only fields that samtools sort uses.
Full Changelog
26.1.99
Add --secondary
26.1.0
Update ISOSEQ preset parameters
Add AS Alignment score tag
Improve error messages
1.17.0
Support ultra-high memory Linux systems
Strip SA tags from input
1.16.0
SMRT Link 25.1 release
Do not warn about aligned input for minimizer sorted unaligned BAM input
There are currently no plans to update the source code on GitHub, but binary releases will continue regularly
1.14.99
Remove minimum DP length threshold for HiFi preset
1.14.0:
Add option --include-fail-reads for datasets
1.13.1:
Documentation changes, included in SMRT Link v13.0
1.13.0:
Update minimap2 to version 2.26
1.12.0:
Set --preset CCS as default
Change repeated matches trimming to adhere to minimap2 alignment ordering
1.11.0:
Strip HiFi kinetics tags
1.10.0:
Allow reverse-complemented unaligned records as input
Allow infix, but not flanking, spaces in sample name
Do not allow overwriting input files
Store number of mismatches, tag NM
1.9.0:
Print dependency versions
Set unaligned MAPQ to 0
1.8.0:
Add support for *.fsa files
1.7.0:
Set TLEN, for information only
Trim insertions, deletions, and mismatches from the alignment flanks
1.6.0:
SA tag contains full cigar; use --short-sa-cigar to use legacy version
Sanitize bio sample names
1.5.0:
Hide --min-concordance-perc and --min-id-perc
Change default identity filter to --min-gap-comp-id-perc
1.4.0:
Official SMRT Link v10 release
Case-insensitive --preset
Read groups without SM tag are labelled as UnnamedSample
1.3.0:
New internal features for HiFi assembly
htslib 1.10 support
1.2.1:
Abort if input fofn contains non-existing files
Add new filters --min-id-perc and --min-gap-comp-id-perc
Updated CLI UX
Add -g to control minimap2’s max_gap
Add --bam-index
1.1.0:
Add support for gzipped FASTA and FASTQ
Allow multiple input files via .fofn
Add --collapse-homopolymers
Use TMPDIR env variable to set path for temporary files
Minor memory leak fix, if you used the API directly
1.0.0:
First stable release, included in SMRT Link v7.0
Minor documentation changes
0.12.0:
Enable --unmapped to add unmapped records to output
Add repeated matches trimming
Add BAI for sorted output
Allow 0 value overrides
Abort if insufficient memory is available for sorting
Idempotence. Alignment of alignments results in identical alignments
Use different technique to get tmpfile pipe
Median filter does not log to DEBUG
0.10.0:
Add --preset CCS
Allow disabling of homopolymer-compressed k-mer -u
Adjust concordance metric to be identical to SMRT Link
Add reference fasta to dataset output
Output run timings and peak memory
Change CLI UX
No overlapping query intervals
Use BioSample or WellSample name from input dataset
Drop fake @SQ checksum
Add SA tag
0.9.0:
Add --sort
Add --preset ISOSEQ
Add --median-filter
Acknowledgements
Many thanks to Heng Li for a pleasant API experience and
to Lance Hepler for the initial idea and code.
Disclaimer
THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED “AS IS,” WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES.
pbmm2
A minimap2 SMRT wrapper for PacBio data: native PacBio data in ⇨ native PacBio BAM out.
pbmm2 is a SMRT C++ wrapper for minimap2‘s C API. Its purpose is to support native PacBio in- and output, provide sets of recommended parameters, generate sorted output on-the-fly, and postprocess alignments. Sorted output can be used directly for polishing using GenomicConsensus, if BAM has been used as input to pbmm2. Benchmarks show that pbmm2 outperforms BLASR in sequence identity, number of mapped bases, and especially runtime. pbmm2 is the official replacement for BLASR.
Binary Availability
Latest version can be installed via bioconda package
pbmm2.Please refer to our official pbbioconda page for information on Installation, Support, License, Copyright, and Disclaimer.
Latest Version
Version 26.1.99: Full changelog here
Usage
pbmm2 offers following tools
Typical workflows
Index
Indexing is optional, but recommended if you use the same reference with the same
--presetmultiple times.Notes:
-k,-w, nor-uinpbmm2 align!-H(homopolymer-compressed k-mer) is always on for SUBREAD and UNROLLED presets and can be disabled with-u..mmifiles inpbmm2 align.Align
The output argument is optional. If not provided, BAM output is streamed to stdout.
Alignment Parallelization
The number of alignment threads can be specified with
-j,--num-threads. If not specified, the maximum number of threads will be used, minus one thread for BAM IO and minus the number of threads specified for sorting.Sorting
Sorted output can be generated using
--sort.Percentage: By default, 25% of threads specified with
-j, maximum 8, are used for sorting. Example:--sort -j 12, 9 threads for alignment, 3 threads for sorting.Manual override: To override the default percentage,
-J,--sort-threadsdefines the explicit number of threads used for on-the-fly sorting. Example:--sort -j 12 -J 4, 12 threads for alignment, 4 threads for sorting.The memory allocated per sort thread can be defined with
-m,--sort-memory, accepting suffixesM,G.Temporary files during sorting are stored in the current working directory, unless explicitly defined with environment variable
TMPDIR. The path used for temporary files is also printed if--log-level DEBUGis set.Benchmarks on human data have shown that 4 sort threads are recommended, but no more than 8 threads can be effectively leveraged, even with 70 cores used for alignment. It is recommended to provide more memory to each of a few sort threads, to avoid disk IO pressure, than providing less memory to each of many sort threads.
Input file types
Following compatibility table shows allowed input file types, output file types, compatibility with GenomicConsensus, and recommended
--presetchoice. More info about our dataset XML specification..bam(aligned or unaliged).bam.fasta/.fa/.fasta.gz/.fa.gz.bam.fastq/.fq/.fastq.gz/.fq.gz.bam.Q20.fastq/Q20.fastq.gz.bamCCSbam.fofn.bamfasta.fofn.bamfastq.fofn.bam.subreadset.xml.bam\.alignmentset.xml.consensusreadset.xml.bam\.consensusalignmentset.xmlCCS.transcriptset.xml.bam\.transcriptalignmentset.xmlISOSEQFASTA/Q input
In addition to native PacBio BAM input, reads can also be provided in FASTA and FASTQ formats, as shown above.
With FASTA/Q input, option
--rgsets the read group. Example call:All three reference file formats
.fasta,.referenceset.xml, and.mmican be combined with FASTA/Q input.Multiple input files
pbmm2 supports the
.fofnfile type (File Of File Names), containing the same datatype. Supported are.fofnfiles with FASTA, FASTQ, or BAM.Examples:
FAQ
Which minimap2 version is used?
When are
pbifiles created?Whenever the output is of type
xml, apbifile is being generated.When are BAM index files created?
For sorted output via
--sort, abaifile is being generated per default. You can switch tocsifor larger genomes with--bam-index CSIor skip index generation completely with--bam-index NONE.What are parameter sets and how can I override them?
Per default, pbmm2 uses recommended parameter sets to simplify the plethora of possible combinations. For this, we currently offer:
SUBREADCCSorHIFI(default)ISOSEQUNROLLEDParameter sets vary based on pbmm2 version and are explained in
--help.If you want to override any of the parameters of your chosen set, please use the respective options:
For the piece-wise linear gap penalties, use the following overrides, whereas a k-long gap costs min{o+ke,O+kE}:
For
ISOSEQ, you can override additional parameters:If you have suggestions for our default parameters or ideas for a new parameter set, please open a GitHub issue!
What other special parameters are used implicitly?
To achieve similar alignment behavior like blasr, we implicitly use following minimap2 parameters:
-YCGwith-LX/=cigars instead ofMwith--eqx--secondary)What sequence identity filters does pbmm2 offer?
The idea of removing spurious or low-quality alignments is straightforward, but the exact definition of a threshold is tricky and varies between tools and applications. More on sequence identity from Heng Li.
pbmm2 offers following filters:
--min-concordance-perc, legacy mapped concordance filter, inherited from its predecessor BLASR (hidden option)--min-id-perc, a sequence identity percentage filter defined as the BLAST identity (hidden option)--min-gap-comp-id-perc, a gap-compressed sequence identity filter accounting insertions and deletions as single events only (default)By default, (3) is set to 70%, (1) and (2) are deactivated. The problem with (1) the mapped concordance filter is that it also removes biological structural variations, such as true insertions and deletions w.r.t. used reference; it is only appropriate if applied to resequencing data of haploid organisms. The (2) sequence identity is the BLAST identity, a very natural metric for filtering. The (3) gap-compressed sequence identity filter is very similar to (2), but accounts insertions and deletions as single events only and is the fairest metric when it comes to assess the actual error rate.
All three filters are combined with
AND, meaning an alignment has to pass all three thresholds.How do you define mapped concordance?
The
--min-concordance-percoption, whereas concordance is defined aswill remove alignments that do not pass the provided threshold in percent.
You can deactivate this filter with
--min-concordance-perc 0.How do you define identity?
The
--min-id-percoption, whereas sequence identity is defined as the BLAST identitywill remove alignments that do not pass the provided threshold in percent.
You can deactivate this filter with
--min-id-perc 0.How do you define gap-compressed identity?
The
--min-gap-comp-id-perc, -yoption, whereas gap-compressed identity is defined aswill remove alignments that do not pass the provided threshold in percent.
This is the default filter. You can deactivate this filter with
--min-gap-comp-id-perc 0.What is repeated matches trimming?
A repeated match is, when the same query interval is shared between a primary and supplementary alignment. This can happen for translocations, where breakends share the same flanking sequence:
And sometimes, when a LINE gets inserted, the flanks are/get duplicated leading to complicated alignments, where we see a split read sharing a duplication. The inserted region itself, mapping to a random other LINE in the reference genome, may also share sequence similarity to the flanks:
To get the best alignments, minimap2 decides that two alignments may use up to 50% (default) of the same query bases. This does not work for PacBio, because we see pbmm2 as a blasr replacement and require that a single base may never be aligned twice. Minimap2 offers a feature to enforce a query interval overlap to 0%. What happens now if a query interval gets used in two alignments, one or both get flagged as secondary and get filtered. This leads to yield loss and more importantly missing SVs in the alignment.
Papers like this present dynamic programming approaches to find the optimal split to uniquely map query intervals, while maximizing alignment scores. We don’t have per base alignment scores available, thus our approach is much simpler. We align the read, find overlapping query intervals, and trim non-primary alignments in order as provided by minimap2; trimming here means that pbmm2 rewrites the cigar and the reference coordinates on-the-fly. This allows us to increase number of mapped bases, slightly reduce identity, but boost SV recall rate.
What SAM tags are added by pbmm2?
pbmm2 adds following tags to each aligned record:
mc, stores mapped concordance percentage between 0.0 and 100.0, if the filter was usedmg, stores gap-compressed sequence identity percentage between 0.0 and 100.0, if the filter was usedmi, stores sequence identity percentage between 0.0 and 100.0, if the filter was usedrm, is set to1if an alignment has been manipulated by repeated matches trimmingWhy is the output different from BLASR?
As for any two alignments of the same data with different mappers, alignments will differ. This is because of many reasons, but mainly a combination of different scoring functions and seeding techniques.
How does sorting work?
We integrated
samtools sortcode into pbmm2 to use it as on-the-fly sorting. This allows pbmm2 to skip writing unsorted BAM as output and thus save one round-trip of writing and reading unsorted BAM to disk, minimizing disk IO pressure.Is
pbmm2 unsorted+samtools sortfaster thanpbmm2 --sort?This highly depends on your filesystem. Our tests are showing that there is no clear winner; runtimes differ up to 10% in either directions, depending on read length distribution, genome length and complexity, disk IO pressure, and possibly further unknown factors. For very small genomes post-alignment sorting is faster, but for larger genomes like rice or human on-the-fly sorting is faster. Keep in mind, scalability is not only about runtime, but also disk IO pressure.
We recommend to use on-the-fly sorting via
pbmm2 align --sort.Can I get alignment statistics?
If you use
--log-level INFO, after alignment is done, you get following alignment metrics:Is there any benchmark information, like timings and peak memory consumption?
If you use
--log-level INFO, after alignment is done, you get following timing and memory information:Can I get progress output?
If you use
--log-level DEBUG, you will following reports:That is:
Can I perform unrolled alignment?
If you are interested in unrolled alignments that is, align the full-length ZMW read or the HQ region of a ZMW against an unrolled template, please use
--zmwor--hqregionwith*.subreadset.xmlas input that contains one*.subreads.bamand one*.scraps.bamfile. Keep in mind, to unroll the reference on your own. This is beta feature and still in development.How can I set the sample name?
You can override the sample name (SM field in RG tag) for all read groups with
--sample. If not provided, sample names derive from the dataset input with order of precedence: SM field in input read group, biosample name, well sample name,UnnamedSample. If the input is a BAM file and--samplehas not been used, the SM field will be populated withUnnamedSample.Can I split output by sample name?
Yes,
--split-by-samplegenerates one output BAM file per sample name, with the sample name as file name infix, if there is more than one aligned sample name.Can I remove all those extra per base and pulse tags?
Yes,
--stripremoves following extraneous tags if the input is BAM, but the resulting output BAM file cannot be used as input into GenomicConsensus:dq, dt, ip, iq, mq, pa, pc, pd, pe, pg, pm, pq, pt, pv, pw, px, sf, sq, stWhere are the unmapped reads?
Per default, unmapped reads are omitted. You can add them to the output BAM file with
--unmapped.Can I output secondary alignments?
Use
--secondaryto enable secondary alignment output. Secondary alignments are independent alternate mappings that skip repeated matches trimming and do not participate in SA tag generation. Use--max-secondary-alns Nto retain at most N secondary alignments prior to filtering (default: 5). This option is only effective with--secondary.Can I output at maximum the N best alignments per read?
Use
-N, --best-n. If set to0, default, maximum filtering is disabled.Is there a way to only align one subread per ZMW?
Using
--median-filter, only the subread closest to the median subread length per ZMW is being aligned. Preferably, full-length subreads flanked by adapters are chosen.What is
--collapse-homopolymers?The idea behind
--collapse-homopolymersis to collapse any two or more consecutive bases of the same type. In this mode, the reference is collapsed and written to disk with the same prefix as your output alignment and appended with suffix.ref.collapsed.fasta. In addition, each read is collapsed before alignment. This mode cannot be combined with.mmiinput.Known issues
Due to multithreading the ouput alignment ordering can differ between multiple runs with the same input parameters. The same can occur even with option
--sortfor records that align to the same target sequence, the same position within that target, and in the same orientation, which are the only fields thatsamtools sortuses.Full Changelog
26.1.99
--secondary26.1.0
ISOSEQpreset parametersASAlignment score tag1.17.0
SAtags from input1.16.0
1.14.99
1.14.0:
--include-fail-readsfor datasets1.13.1:
1.13.0:
1.12.0:
--preset CCSas default1.11.0:
1.10.0:
1.9.0:
1.8.0:
1.7.0:
TLEN, for information only1.6.0:
SAtag contains full cigar; use--short-sa-cigarto use legacy version1.5.0:
--min-concordance-percand--min-id-perc--min-gap-comp-id-perc1.4.0:
--presetSMtag are labelled asUnnamedSample1.3.0:
1.2.1:
--min-id-percand--min-gap-comp-id-perc-gto control minimap2’smax_gap--bam-index1.1.0:
.fofn--collapse-homopolymersTMPDIRenv variable to set path for temporary files1.0.0:
0.12.0:
--unmappedto add unmapped records to output0value overrides0.11.0:
--lj-min-ratio,--rg,--split-by-sample,--stripSAtag0.10.1:
0.10.0:
--preset CCS-uSAtag0.9.0:
--sort--preset ISOSEQ--median-filterAcknowledgements
Many thanks to Heng Li for a pleasant API experience and to Lance Hepler for the initial idea and code.
Disclaimer
THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED “AS IS,” WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES.