midsv is a Python module that converts SAM files to MIDSV format.
MIDSV (Match, Insertion, Deletion, Substitution, and inVersion) is a comma-separated format that represents differences between a reference and a query, with the same length as the reference.
[!CAUTION]
MIDSV is intended for targeted amplicon sequences (10-100 kbp). Using whole chromosomes as references may exhaust memory and crash.
[!IMPORTANT]
MIDSV requires long-format cstag tags in the SAM file. Please use minimap2 with --cs=long option.
or use cstag tool to append long-format cstag.
The output includes MIDSV and, optionally, QSCORE.
MIDSV preserves original nucleotides while annotating mutations.
QSCORE provides Phred quality scores for each nucleotide.
Details of MIDSV (formerly MIDS) are described in our paper.
MIDSV uses | to separate nucleotides in insertion sites so +A|+C|+G|+T|=A can be easily split into [+A, +C, +G, +T, =A] by "+A|+C|+G|+T|=A".split("|").
QSCORE
Op
Description
-1
Unknown
|
Separator for insertion sites
QSCORE uses -1 for deletions or unknown nucleotides.
As with MIDSV, QSCORE uses | to separate quality scores in insertion sites.
midsv.formatter.revcomp returns the reverse complement of a MIDSV string. Insertions are reversed and complemented with their anchor moved to the new position, following the MIDSV specification.
Export VCF
from midsv import transform
from midsv.io import write_vcf
alignments = transform("examples/example_indels.sam", qscore=False)
write_vcf(alignments, "variants.vcf", large_sv_threshold=50)
midsv.io.write_vcf writes MIDSV output to VCF and supports insertion, deletion, substitution, large insertion, large deletion, and inversion. Insertions longer than large_sv_threshold are emitted as symbolic <INS>, large deletions (or =N padding) use <DEL>, and inversions use <INV>. The INFO field includes TYPE or SVTYPE, SVLEN, SEQ, and QNAME.
midsv
midsvis a Python module that converts SAM files to MIDSV format.MIDSV (Match, Insertion, Deletion, Substitution, and inVersion) is a comma-separated format that represents differences between a reference and a query, with the same length as the reference.
The output includes
MIDSVand, optionally,QSCORE.MIDSVpreserves original nucleotides while annotating mutations.QSCOREprovides Phred quality scores for each nucleotide.Details of MIDSV (formerly MIDS) are described in our paper.
🛠️Installation
From Bioconda (recommended):
From PyPI:
📜Specifications
MIDSV
MIDSVuses|to separate nucleotides in insertion sites so+A|+C|+G|+T|=Acan be easily split into[+A, +C, +G, +T, =A]by"+A|+C|+G|+T|=A".split("|").QSCORE
QSCOREuses-1for deletions or unknown nucleotides.As with
MIDSV,QSCOREuses|to separate quality scores in insertion sites.📘Usage
path_sam: Path to a SAM file on disk.
qscore (bool, optional): Output QSCORE. Defaults to False.
keep: Subset of {‘FLAG’, ‘POS’, ‘SEQ’, ‘QUAL’, ‘CIGAR’, ‘CSTAG’} to include from the SAM file. Defaults to None.
midsv.transform()returns a list of dictionaries containingQNAME,RNAME,MIDSV, and optionallyQSCORE, plus any fields specified bykeep.MIDSVandQSCOREare comma-separated strings and have the same reference sequence length.🖍️Examples
Perfect match
Insertion, deletion, and substitution
Large deletion
Inversion
🧩Helper functions
Read SAM file
midsv.io.read_samreads a local SAM file into an iterator of string lists.Read/Write JSON Line (JSONL)
Since
midsv.transformreturns a list of dictionaries,midsv.io.write_jsonloutputs it to a file in JSONL format.Conversely,
midsv.io.read_jsonlreads JSONL as an iterator of dictionaries.Reverse complement MIDSV
midsv.formatter.revcompreturns the reverse complement of a MIDSV string. Insertions are reversed and complemented with their anchor moved to the new position, following the MIDSV specification.Export VCF
midsv.io.write_vcfwrites MIDSV output to VCF and supports insertion, deletion, substitution, large insertion, large deletion, and inversion. Insertions longer thanlarge_sv_thresholdare emitted as symbolic<INS>, large deletions (or=Npadding) use<DEL>, and inversions use<INV>. The INFO field includesTYPEorSVTYPE,SVLEN,SEQ, andQNAME.