nanomonsv

Introduction

nanomonsv is a software for detecting somatic structural variations from paired (tumor and matched control) cancer genome sequence data. nanomonsv is presented in the following paper. When you use nanomonsv or any resource of this repository, please kindly cite this paper.

Precise characterization of somatic complex structural variations from tumor/control paired long-read sequencing data with nanomonsv, Shiraishi et al., Nucleic Acids Research, 2023, [link].

Key features:

Single-nucleotide breakpoint resolution using consensus sequences from long-read alignments.
LINE1 insertion classification: Distinguishes Solo L1, Partnered L1 (transduction), and Orphan L1 (orphan transduction), and identifies source L1 elements.
Two detection modules: Canonical SV module for standard SVs with high precision and recall, and Single breakend SV module for complex SVs involving highly-repetitive sequences (centromeres, LINE1, viruses) that can only be identified by long-reads.
Haplotype-aware (v0.9.0+): Reports per-haplotype supporting read counts (HP1, HP2, unphased) using phasing information from the input BAM file. This enables phasing of SV breakpoints.

Installation

pip install nanomonsv

You can also install via conda (bioconda channel). Occasionally the conda releases lag behind PyPI.

conda create -n nanomonsv -c conda-forge -c bioconda nanomonsv

Dependencies

Tool	Required for	Notes
htslib (tabix, bgzip)	parse, get	Must be in PATH
racon	get	Consensus generation (default)
mafft	get (`--use_mafft`)	For backward compatibility
bwa	insert_classify
minimap2	insert_classify
bedtools	insert_classify
RepeatMasker	insert_classify

Python >=3.9, pysam, numpy, parasail

Input requirements

BAM or CRAM file aligned by minimap2
For CRAM files, specify --reference_fasta
S3 paths (e.g., s3://bucket/path.bam) are supported via pip install nanomonsv[s3]. Note that network latency may significantly slow down processing compared to local files.

Quick Start

Prepare the reference genome (here, GDC GRCh38 reference genome).

wget https://api.gdc.cancer.gov/data/254f697d-310d-4d7d-a27b-27fbf767a834 -O GRCh38.d1.vd1.fa.tar.gz
tar xvf GRCh38.d1.vd1.fa.tar.gz

(Optional but highly recommended) Download a control panel from zenodo. See Control Panel for available panels and how to choose.

wget https://zenodo.org/api/records/11470934/files/1kg-ont-vienna_hg38_no_singleton.tar.gz/content \
 -O 1kg-ont-vienna_hg38_no_singleton.tar.gz
tar xvf 1kg-ont-vienna_hg38_no_singleton.tar.gz

(Optional but highly recommended) Download a simple repeat BED file. Pre-built files for GRCh38 and CHM13 are included in this repository under resource/simple_repeats.

Parse putative SV supporting reads.

nanomonsv parse tumor.bam output/tumor
nanomonsv parse ctrl.bam output/ctrl

Get the final result.

nanomonsv get output/tumor tumor.bam GRCh38.d1.vd1.fa \
 --control_prefix output/ctrl --control_bam ctrl.bam \
 --control_panel_prefix 1kg-ont-vienna_hg38_no_singleton \
 --simple_repeat_bed resource/simple_repeats/human_GRCh38_simpleRepeat.bed.gz

You will find the result file tumor.nanomonsv.result.txt.

Usage

parse

Parses all supporting reads of putative somatic SVs.

nanomonsv parse [-h] [--reference_fasta reference.fa] [--debug]
                [--split_alignment_check_margin SPLIT_ALIGNMENT_CHECK_MARGIN]
                [--minimum_breakpoint_ambiguity MINIMUM_BREAKPOINT_AMBIGUITY]
                alignment_file output_prefix

alignment_file: Path to input indexed BAM or CRAM file
output_prefix: Output file prefix
–reference_fasta: Path to reference genome (recommended for CRAM files)

After successful completion, you will find: {output_prefix}.deletion.sorted.bed.gz, {output_prefix}.insertion.sorted.bed.gz, {output_prefix}.rearrangement.sorted.bedpe.gz, {output_prefix}.bp_info.sorted.bed.gz and their indexes (.tbi files).

get

Gets the SV result from parsed supporting reads.

nanomonsv get [-h] [--control_prefix CONTROL_PREFIX]
              [--control_bam CONTROL_BAM]
              [--control_panel_prefix CONTROL_PANEL_PREFIX]
              [--simple_repeat_bed SIMPLE_REPEAT_BED]
              [--min_tumor_variant_read_num MIN_TUMOR_VARIANT_READ_NUM]
              [--min_tumor_VAF MIN_TUMOR_VAF]
              [--max_control_variant_read_num MAX_CONTROL_VARIANT_READ_NUM]
              [--max_control_VAF MAX_CONTROL_VAF]
              [--cluster_margin_size CLUSTER_MARGIN_SIZE]
              [--median_mapQ_thres MEDIAN_MAPQ_THRES]
              [--max_overhang_size_thres MAX_OVERHANG_SIZE_THRES]
              [--var_read_min_mapq VAR_READ_MIN_MAPQ]
              [--qv10] [--qv15] [--qv20] [--qv25] [--use_mafft]
              [--no_single_bnd] [--processes PROCESSES]
              [--sort_option SORT_OPTION] [--max_memory_minimap2] [--debug]
              tumor_prefix tumor_bam reference.fa

tumor_prefix: Prefix to the tumor data set in the parse step
tumor_bam: Path to input indexed BAM file
reference.fa: Path to reference genome used for the alignment

Recommended options

Option	Recommendation	Description
`--control_prefix` / `--control_bam`	Strongly recommended	Matched control for somatic filtering. We strongly recommend using matched control data whenever possible.
`--control_panel_prefix`	Recommended	Non-matched control panel (see Control Panel)
`--simple_repeat_bed`	Strongly recommended	Filter indels in simple repeats. BED files provided in resource/simple_repeats
`--use_mafft`	Not recommended	Use mafft instead of racon for consensus generation (for backward compatibility)
`--no_single_bnd`	Not recommended	Disable single breakend SV detection. See wiki
`--processes N`	Optional	Multi-processing mode

Quality presets

Preset	Recommended for
`--qv10`	ONT data with median Q10 (e.g., Guppy before v5)
`--qv15`	ONT data with median Q15 (e.g., Guppy v5/v6)
`--qv20`	ONT data with median Q20+ (e.g., Dorado SUP, Q20+ chemistry)
`--qv25`	PacBio HiFi data

merge_control

Merges non-matched control panel supporting reads obtained by parse.

nanomonsv merge_control [-h] prefix_list_file output_prefix

prefix_list_file: List of output_prefix generated at the parse stage
output_prefix: Prefix to the merged control supporting reads

insert_classify

Classifies long insertions into mobile element insertions (LINE1, Alu, SVA, processed pseudogene).

nanomonsv insert_classify [-h] [--debug] sv_list_file output_file reference.fa gencode.gtf.gz LINE1_db

sv_list_file: VCF file or nanomonsv get result file (nanomonsv.result.txt)
output_file: Path to the output file
reference.fa: Path to the reference genome
gencode.gtf.gz: Path to gene annotation GTF file. We recommend Gencode basic annotation (e.g., gencode.v49.basic.annotation.gtf.gz)
LINE1_db: Path to LINE1 database. Use the files in resource/LINE1_db

validate

Validates candidate SVs by alignment of tumor and matched control BAM files. This may be helpful for evaluating SV tools on short-read platforms when pairs of short-read and long-read sequencing data are available.

nanomonsv validate [-h] [--control_bam CONTROL_BAM]
                   [--var_read_min_mapq VAR_READ_MIN_MAPQ] [--debug]
                   sv_list_file tumor_bam output reference.fa

sv_list_file: SV candidate list file (only Chr_1 to Inserted_Seq columns are necessary)
output_file: Path to the output file
reference.fa: Path to the reference genome

Output Format

Canonical SV result ({tumor_prefix}.nanomonsv.result.txt)

Column	Description
Chr_1	Chromosome for the 1st breakpoint
Pos_1	Coordinate for the 1st breakpoint
Dir_1	Direction of the 1st breakpoint
Chr_2	Chromosome for the 2nd breakpoint
Pos_2	Coordinate for the 2nd breakpoint
Dir_2	Direction of the 2nd breakpoint
Inserted_Seq	Inserted nucleotides within the breakpoints (`---` if none)
SV_ID	Identifier of SVs
Checked_Read_Num_Tumor	Total reads in the tumor used for validation alignment
Supporting_Read_Num_Tumor	Variant reads in the tumor from validation alignment
Supporting_Read_Num_Tumor_HP_BP1	Haplotype counts of variant reads at breakpoint 1 (HP1,HP2,unphased)
Supporting_Read_Num_Tumor_HP_BP2	Haplotype counts of variant reads at breakpoint 2 (HP1,HP2,unphased)
Checked_Read_Num_Control	Total reads in the matched control used for validation alignment
Supporting_Read_Num_Control	Variant reads in the matched control from validation alignment
Is_Filter	Filter status (`PASS` or filter reason such as `Simple_repeat`)

A VCF format file ({tumor_prefix}.nanomonsv.result.vcf) is also generated. See the wiki page for details on filtering.

Single breakend result ({tumor_prefix}.nanomonsv.sbnd.result.txt)

Generated by default. Use --no_single_bnd to disable.

Column	Description
Chr_1	Chromosome of the breakpoint
Pos_1	Coordinate of the breakpoint
Dir_1	Direction of the breakpoint
Contig	Assembled contig sequence at the breakpoint
SV_ID	Identifier of the single breakend
Checked_Read_Num_Tumor	Total reads in the tumor used for validation alignment
Supporting_Read_Num_Tumor	Variant reads in the tumor from validation alignment
Supporting_Read_Num_Tumor_HP	Haplotype counts of variant reads (HP1,HP2,unphased)
Checked_Read_Num_Control	Total reads in the matched control used for validation alignment
Supporting_Read_Num_Control	Variant reads in the matched control from validation alignment
Is_Filter	Filter status (`PASS`, `Simple_repeat`, `Canonical_SV_overlap`, or combinations)

A VCF format file ({tumor_prefix}.nanomonsv.sbnd.result.vcf) is also generated, using VCF single breakend notation (e.g., N. or .N in ALT field with SVTYPE=BND).

insert_classify result

Column	Description
Insert_Type	Type of insertion (Solo_L1, Partnered_L1, Orphan_L1, Alu, SVA, PSD)
Is_Inversion	Inverted form for Solo LINE1 (Simple, Inverted, Other)
L1_Ratio	Match rate with LINE1 sequences
Alu_Ratio	Match rate with Alu sequences
SVA_Ratio	Match rate with SVA sequences
RMSK_Info	Summary information of RepeatMasker
Alignment_Info	Alignment information to the human genome
Inserted_Pos	Inserted position (for tandem duplication or nested LINE1 transduction)
Is_PolyA_T	Extracted poly-A or poly-T sequences
Target_Site_Duplication	Nucleotides of target site duplications
L1_Source_Info	Inferred source site of LINE1 transduction
PSD_Gene	Processed pseudogene name
PSD_Overlap_Ratio	Match rate with the pseudogene
PSD_Exon_Num	Number of pseudogene exons matched with the inserted sequence

Control Panel

We strongly recommend using a control panel for filtering common SVs and sequencing noise. Pre-built control panels are available at zenodo. You can also create your own from your sequencing data using merge_control.

Pre-built control panels

Panel	Samples	Reference	Source
1000G ONT Vienna	1,019	GRCh38 / CHM13	1000 Genomes Project
HPRC Nanopore (Guppy v4)	~30	GRCh38 / CHM13	HPRC release 1
HPRC Nanopore (Guppy v6)	~40	GRCh38 / CHM13	HPRC release 1
HPRC PacBio HiFi	~30	GRCh38 / CHM13	HPRC release 1

For ONT data, the 1000G ONT Vienna panel (1,019 samples) is recommended for its large sample size. We recommend using a control panel as close as possible in platform and basecall quality. When unsure, a noisier panel (e.g., Guppy v4) tends to be more versatile.

When you use these control panels and publish, please cite:

Liao et al., Nature, 2023 (doi:10.1038/s41586-023-05896-x) for HPRC panels
Schloissnig et al., Nature, 2025 (doi:10.1038/s41586-025-09290-7) for 1000G ONT Vienna panels

Example Data

The Oxford Nanopore Sequencing data used in the paper is available through the public sequence repository (BioProject ID: PRJDB10898):

COLO829: tumor, control
H2009: tumor, control
HCC1954: tumor, control

Results of nanomonsv for the above data are available here. Please kindly cite the NAR paper when you use these data.

See the tutorial wiki page for an example workflow on analyzing the COLO829 sample.

Citation

Shiraishi et al., Precise characterization of somatic complex structural variations from tumor/control paired long-read sequencing data with nanomonsv, Nucleic Acids Research, 2023, [link].