nanomonsv is a software for detecting somatic structural variations from paired (tumor and matched control) cancer genome sequence data. nanomonsv is presented in the following paper. When you use nanomonsv or any resource of this repository, please kindly cite this paper.
Precise characterization of somatic complex structural variations from tumor/control paired long-read sequencing data with nanomonsv, Shiraishi et al., Nucleic Acids Research, 2023, [link].
Key features:
Single-nucleotide breakpoint resolution using consensus sequences from long-read alignments.
LINE1 insertion classification: Distinguishes Solo L1, Partnered L1 (transduction), and Orphan L1 (orphan transduction), and identifies source L1 elements.
Two detection modules: Canonical SV module for standard SVs with high precision and recall, and Single breakend SV module for complex SVs involving highly-repetitive sequences (centromeres, LINE1, viruses) that can only be identified by long-reads.
Haplotype-aware (v0.9.0+): Reports per-haplotype supporting read counts (HP1, HP2, unphased) using phasing information from the input BAM file. This enables phasing of SV breakpoints.
Installation
pip install nanomonsv
You can also install via conda (bioconda channel). Occasionally the conda releases lag behind PyPI.
S3 paths (e.g., s3://bucket/path.bam) are supported via pip install nanomonsv[s3]. Note that network latency may significantly slow down processing compared to local files.
Quick Start
Prepare the reference genome (here, GDC GRCh38 reference genome).
wget https://api.gdc.cancer.gov/data/254f697d-310d-4d7d-a27b-27fbf767a834 -O GRCh38.d1.vd1.fa.tar.gz
tar xvf GRCh38.d1.vd1.fa.tar.gz
(Optional but highly recommended) Download a control panel from zenodo. See Control Panel for available panels and how to choose.
wget https://zenodo.org/api/records/11470934/files/1kg-ont-vienna_hg38_no_singleton.tar.gz/content \
-O 1kg-ont-vienna_hg38_no_singleton.tar.gz
tar xvf 1kg-ont-vienna_hg38_no_singleton.tar.gz
(Optional but highly recommended) Download a simple repeat BED file. Pre-built files for GRCh38 and CHM13 are included in this repository under resource/simple_repeats.
alignment_file: Path to input indexed BAM or CRAM file
output_prefix: Output file prefix
–reference_fasta: Path to reference genome (recommended for CRAM files)
After successful completion, you will find:
{output_prefix}.deletion.sorted.bed.gz, {output_prefix}.insertion.sorted.bed.gz, {output_prefix}.rearrangement.sorted.bedpe.gz, {output_prefix}.bp_info.sorted.bed.gz and their indexes (.tbi files).
LINE1_db: Path to LINE1 database. Use the files in resource/LINE1_db
validate
Validates candidate SVs by alignment of tumor and matched control BAM files.
This may be helpful for evaluating SV tools on short-read platforms
when pairs of short-read and long-read sequencing data are available.
sv_list_file: SV candidate list file (only Chr_1 to Inserted_Seq columns are necessary)
output_file: Path to the output file
reference.fa: Path to the reference genome
Output Format
Canonical SV result ({tumor_prefix}.nanomonsv.result.txt)
Column
Description
Chr_1
Chromosome for the 1st breakpoint
Pos_1
Coordinate for the 1st breakpoint
Dir_1
Direction of the 1st breakpoint
Chr_2
Chromosome for the 2nd breakpoint
Pos_2
Coordinate for the 2nd breakpoint
Dir_2
Direction of the 2nd breakpoint
Inserted_Seq
Inserted nucleotides within the breakpoints (--- if none)
SV_ID
Identifier of SVs
Checked_Read_Num_Tumor
Total reads in the tumor used for validation alignment
Supporting_Read_Num_Tumor
Variant reads in the tumor from validation alignment
Supporting_Read_Num_Tumor_HP_BP1
Haplotype counts of variant reads at breakpoint 1 (HP1,HP2,unphased)
Supporting_Read_Num_Tumor_HP_BP2
Haplotype counts of variant reads at breakpoint 2 (HP1,HP2,unphased)
Checked_Read_Num_Control
Total reads in the matched control used for validation alignment
Supporting_Read_Num_Control
Variant reads in the matched control from validation alignment
Is_Filter
Filter status (PASS or filter reason such as Simple_repeat)
A VCF format file ({tumor_prefix}.nanomonsv.result.vcf) is also generated.
See the wiki page for details on filtering.
Single breakend result ({tumor_prefix}.nanomonsv.sbnd.result.txt)
Generated by default. Use --no_single_bnd to disable.
Column
Description
Chr_1
Chromosome of the breakpoint
Pos_1
Coordinate of the breakpoint
Dir_1
Direction of the breakpoint
Contig
Assembled contig sequence at the breakpoint
SV_ID
Identifier of the single breakend
Checked_Read_Num_Tumor
Total reads in the tumor used for validation alignment
Supporting_Read_Num_Tumor
Variant reads in the tumor from validation alignment
Supporting_Read_Num_Tumor_HP
Haplotype counts of variant reads (HP1,HP2,unphased)
Checked_Read_Num_Control
Total reads in the matched control used for validation alignment
Supporting_Read_Num_Control
Variant reads in the matched control from validation alignment
Is_Filter
Filter status (PASS, Simple_repeat, Canonical_SV_overlap, or combinations)
A VCF format file ({tumor_prefix}.nanomonsv.sbnd.result.vcf) is also generated,
using VCF single breakend notation (e.g., N. or .N in ALT field with SVTYPE=BND).
insert_classify result
Column
Description
Insert_Type
Type of insertion (Solo_L1, Partnered_L1, Orphan_L1, Alu, SVA, PSD)
Is_Inversion
Inverted form for Solo LINE1 (Simple, Inverted, Other)
L1_Ratio
Match rate with LINE1 sequences
Alu_Ratio
Match rate with Alu sequences
SVA_Ratio
Match rate with SVA sequences
RMSK_Info
Summary information of RepeatMasker
Alignment_Info
Alignment information to the human genome
Inserted_Pos
Inserted position (for tandem duplication or nested LINE1 transduction)
Is_PolyA_T
Extracted poly-A or poly-T sequences
Target_Site_Duplication
Nucleotides of target site duplications
L1_Source_Info
Inferred source site of LINE1 transduction
PSD_Gene
Processed pseudogene name
PSD_Overlap_Ratio
Match rate with the pseudogene
PSD_Exon_Num
Number of pseudogene exons matched with the inserted sequence
Control Panel
We strongly recommend using a control panel for filtering common SVs and sequencing noise.
Pre-built control panels are available at zenodo.
You can also create your own from your sequencing data using merge_control.
For ONT data, the 1000G ONT Vienna panel (1,019 samples) is recommended for its large sample size.
We recommend using a control panel as close as possible in platform and basecall quality.
When unsure, a noisier panel (e.g., Guppy v4) tends to be more versatile.
When you use these control panels and publish, please cite:
Results of nanomonsv for the above data are available here.
Please kindly cite the NAR paper when you use these data.
See the tutorial wiki page for an example workflow on analyzing the COLO829 sample.
Citation
Shiraishi et al., Precise characterization of somatic complex structural variations from tumor/control paired long-read sequencing data with nanomonsv, Nucleic Acids Research, 2023, [link].
nanomonsv
Introduction
nanomonsv is a software for detecting somatic structural variations from paired (tumor and matched control) cancer genome sequence data. nanomonsv is presented in the following paper. When you use nanomonsv or any resource of this repository, please kindly cite this paper.
Precise characterization of somatic complex structural variations from tumor/control paired long-read sequencing data with nanomonsv, Shiraishi et al., Nucleic Acids Research, 2023, [link].
Key features:
Installation
You can also install via conda (bioconda channel). Occasionally the conda releases lag behind PyPI.
Dependencies
--use_mafft)Python >=3.9, pysam, numpy, parasail
Input requirements
--reference_fastas3://bucket/path.bam) are supported viapip install nanomonsv[s3]. Note that network latency may significantly slow down processing compared to local files.Quick Start
Prepare the reference genome (here, GDC GRCh38 reference genome).
(Optional but highly recommended) Download a control panel from zenodo. See Control Panel for available panels and how to choose.
(Optional but highly recommended) Download a simple repeat BED file. Pre-built files for GRCh38 and CHM13 are included in this repository under resource/simple_repeats.
Parse putative SV supporting reads.
Get the final result.
You will find the result file
tumor.nanomonsv.result.txt.Usage
parse
Parses all supporting reads of putative somatic SVs.
After successful completion, you will find:
{output_prefix}.deletion.sorted.bed.gz,{output_prefix}.insertion.sorted.bed.gz,{output_prefix}.rearrangement.sorted.bedpe.gz,{output_prefix}.bp_info.sorted.bed.gzand their indexes (.tbi files).get
Gets the SV result from parsed supporting reads.
Recommended options
--control_prefix/--control_bam--control_panel_prefix--simple_repeat_bed--use_mafft--no_single_bnd--processes NQuality presets
--qv10--qv15--qv20--qv25merge_control
Merges non-matched control panel supporting reads obtained by
parse.parsestageinsert_classify
Classifies long insertions into mobile element insertions (LINE1, Alu, SVA, processed pseudogene).
validate
Validates candidate SVs by alignment of tumor and matched control BAM files. This may be helpful for evaluating SV tools on short-read platforms when pairs of short-read and long-read sequencing data are available.
Output Format
Canonical SV result ({tumor_prefix}.nanomonsv.result.txt)
---if none)PASSor filter reason such asSimple_repeat)A VCF format file ({tumor_prefix}.nanomonsv.result.vcf) is also generated. See the wiki page for details on filtering.
Single breakend result ({tumor_prefix}.nanomonsv.sbnd.result.txt)
Generated by default. Use
--no_single_bndto disable.PASS,Simple_repeat,Canonical_SV_overlap, or combinations)A VCF format file ({tumor_prefix}.nanomonsv.sbnd.result.vcf) is also generated, using VCF single breakend notation (e.g.,
N.or.Nin ALT field withSVTYPE=BND).insert_classify result
Control Panel
We strongly recommend using a control panel for filtering common SVs and sequencing noise. Pre-built control panels are available at zenodo. You can also create your own from your sequencing data using
merge_control.Pre-built control panels
For ONT data, the 1000G ONT Vienna panel (1,019 samples) is recommended for its large sample size. We recommend using a control panel as close as possible in platform and basecall quality. When unsure, a noisier panel (e.g., Guppy v4) tends to be more versatile.
When you use these control panels and publish, please cite:
Example Data
The Oxford Nanopore Sequencing data used in the paper is available through the public sequence repository (BioProject ID: PRJDB10898):
Results of nanomonsv for the above data are available here. Please kindly cite the NAR paper when you use these data.
See the tutorial wiki page for an example workflow on analyzing the COLO829 sample.
Citation
Shiraishi et al., Precise characterization of somatic complex structural variations from tumor/control paired long-read sequencing data with nanomonsv, Nucleic Acids Research, 2023, [link].