目录

nanomonsv

License: GPL v3 CI

Introduction

nanomonsv is a software for detecting somatic structural variations from paired (tumor and matched control) cancer genome sequence data. nanomonsv is presented in the following paper. When you use nanomonsv or any resource of this repository, please kindly cite this paper.

Precise characterization of somatic complex structural variations from tumor/control paired long-read sequencing data with nanomonsv, Shiraishi et al., Nucleic Acids Research, 2023, [link].

Key features:

  • Single-nucleotide breakpoint resolution using consensus sequences from long-read alignments.
  • LINE1 insertion classification: Distinguishes Solo L1, Partnered L1 (transduction), and Orphan L1 (orphan transduction), and identifies source L1 elements.
  • Two detection modules: Canonical SV module for standard SVs with high precision and recall, and Single breakend SV module for complex SVs involving highly-repetitive sequences (centromeres, LINE1, viruses) that can only be identified by long-reads.
  • Haplotype-aware (v0.9.0+): Reports per-haplotype supporting read counts (HP1, HP2, unphased) using phasing information from the input BAM file. This enables phasing of SV breakpoints.

Installation

pip install nanomonsv

You can also install via conda (bioconda channel). Occasionally the conda releases lag behind PyPI.

conda create -n nanomonsv -c conda-forge -c bioconda nanomonsv

Dependencies

Tool Required for Notes
htslib (tabix, bgzip) parse, get Must be in PATH
racon get Consensus generation (default)
mafft get (--use_mafft) For backward compatibility
bwa insert_classify
minimap2 insert_classify
bedtools insert_classify
RepeatMasker insert_classify

Python >=3.9, pysam, numpy, parasail

Input requirements

  • BAM or CRAM file aligned by minimap2
  • For CRAM files, specify --reference_fasta
  • S3 paths (e.g., s3://bucket/path.bam) are supported via pip install nanomonsv[s3]. Note that network latency may significantly slow down processing compared to local files.

Quick Start

  1. Prepare the reference genome (here, GDC GRCh38 reference genome).

    wget https://api.gdc.cancer.gov/data/254f697d-310d-4d7d-a27b-27fbf767a834 -O GRCh38.d1.vd1.fa.tar.gz
    tar xvf GRCh38.d1.vd1.fa.tar.gz
  2. (Optional but highly recommended) Download a control panel from zenodo. See Control Panel for available panels and how to choose.

    wget https://zenodo.org/api/records/11470934/files/1kg-ont-vienna_hg38_no_singleton.tar.gz/content \
     -O 1kg-ont-vienna_hg38_no_singleton.tar.gz
    tar xvf 1kg-ont-vienna_hg38_no_singleton.tar.gz
  3. (Optional but highly recommended) Download a simple repeat BED file. Pre-built files for GRCh38 and CHM13 are included in this repository under resource/simple_repeats.

  4. Parse putative SV supporting reads.

    nanomonsv parse tumor.bam output/tumor
    nanomonsv parse ctrl.bam output/ctrl
  5. Get the final result.

    nanomonsv get output/tumor tumor.bam GRCh38.d1.vd1.fa \
     --control_prefix output/ctrl --control_bam ctrl.bam \
     --control_panel_prefix 1kg-ont-vienna_hg38_no_singleton \
     --simple_repeat_bed resource/simple_repeats/human_GRCh38_simpleRepeat.bed.gz

You will find the result file tumor.nanomonsv.result.txt.

Usage

parse

Parses all supporting reads of putative somatic SVs.

nanomonsv parse [-h] [--reference_fasta reference.fa] [--debug]
                [--split_alignment_check_margin SPLIT_ALIGNMENT_CHECK_MARGIN]
                [--minimum_breakpoint_ambiguity MINIMUM_BREAKPOINT_AMBIGUITY]
                alignment_file output_prefix
  • alignment_file: Path to input indexed BAM or CRAM file
  • output_prefix: Output file prefix
  • –reference_fasta: Path to reference genome (recommended for CRAM files)

After successful completion, you will find: {output_prefix}.deletion.sorted.bed.gz, {output_prefix}.insertion.sorted.bed.gz, {output_prefix}.rearrangement.sorted.bedpe.gz, {output_prefix}.bp_info.sorted.bed.gz and their indexes (.tbi files).

get

Gets the SV result from parsed supporting reads.

nanomonsv get [-h] [--control_prefix CONTROL_PREFIX]
              [--control_bam CONTROL_BAM]
              [--control_panel_prefix CONTROL_PANEL_PREFIX]
              [--simple_repeat_bed SIMPLE_REPEAT_BED]
              [--min_tumor_variant_read_num MIN_TUMOR_VARIANT_READ_NUM]
              [--min_tumor_VAF MIN_TUMOR_VAF]
              [--max_control_variant_read_num MAX_CONTROL_VARIANT_READ_NUM]
              [--max_control_VAF MAX_CONTROL_VAF]
              [--cluster_margin_size CLUSTER_MARGIN_SIZE]
              [--median_mapQ_thres MEDIAN_MAPQ_THRES]
              [--max_overhang_size_thres MAX_OVERHANG_SIZE_THRES]
              [--var_read_min_mapq VAR_READ_MIN_MAPQ]
              [--qv10] [--qv15] [--qv20] [--qv25] [--use_mafft]
              [--no_single_bnd] [--processes PROCESSES]
              [--sort_option SORT_OPTION] [--max_memory_minimap2] [--debug]
              tumor_prefix tumor_bam reference.fa
  • tumor_prefix: Prefix to the tumor data set in the parse step
  • tumor_bam: Path to input indexed BAM file
  • reference.fa: Path to reference genome used for the alignment
Option Recommendation Description
--control_prefix / --control_bam Strongly recommended Matched control for somatic filtering. We strongly recommend using matched control data whenever possible.
--control_panel_prefix Recommended Non-matched control panel (see Control Panel)
--simple_repeat_bed Strongly recommended Filter indels in simple repeats. BED files provided in resource/simple_repeats
--use_mafft Not recommended Use mafft instead of racon for consensus generation (for backward compatibility)
--no_single_bnd Not recommended Disable single breakend SV detection. See wiki
--processes N Optional Multi-processing mode

Quality presets

Preset Recommended for
--qv10 ONT data with median Q10 (e.g., Guppy before v5)
--qv15 ONT data with median Q15 (e.g., Guppy v5/v6)
--qv20 ONT data with median Q20+ (e.g., Dorado SUP, Q20+ chemistry)
--qv25 PacBio HiFi data

merge_control

Merges non-matched control panel supporting reads obtained by parse.

nanomonsv merge_control [-h] prefix_list_file output_prefix
  • prefix_list_file: List of output_prefix generated at the parse stage
  • output_prefix: Prefix to the merged control supporting reads

insert_classify

Classifies long insertions into mobile element insertions (LINE1, Alu, SVA, processed pseudogene).

nanomonsv insert_classify [-h] [--debug] sv_list_file output_file reference.fa gencode.gtf.gz LINE1_db

validate

Validates candidate SVs by alignment of tumor and matched control BAM files. This may be helpful for evaluating SV tools on short-read platforms when pairs of short-read and long-read sequencing data are available.

nanomonsv validate [-h] [--control_bam CONTROL_BAM]
                   [--var_read_min_mapq VAR_READ_MIN_MAPQ] [--debug]
                   sv_list_file tumor_bam output reference.fa
  • sv_list_file: SV candidate list file (only Chr_1 to Inserted_Seq columns are necessary)
  • output_file: Path to the output file
  • reference.fa: Path to the reference genome

Output Format

Canonical SV result ({tumor_prefix}.nanomonsv.result.txt)

Column Description
Chr_1 Chromosome for the 1st breakpoint
Pos_1 Coordinate for the 1st breakpoint
Dir_1 Direction of the 1st breakpoint
Chr_2 Chromosome for the 2nd breakpoint
Pos_2 Coordinate for the 2nd breakpoint
Dir_2 Direction of the 2nd breakpoint
Inserted_Seq Inserted nucleotides within the breakpoints (--- if none)
SV_ID Identifier of SVs
Checked_Read_Num_Tumor Total reads in the tumor used for validation alignment
Supporting_Read_Num_Tumor Variant reads in the tumor from validation alignment
Supporting_Read_Num_Tumor_HP_BP1 Haplotype counts of variant reads at breakpoint 1 (HP1,HP2,unphased)
Supporting_Read_Num_Tumor_HP_BP2 Haplotype counts of variant reads at breakpoint 2 (HP1,HP2,unphased)
Checked_Read_Num_Control Total reads in the matched control used for validation alignment
Supporting_Read_Num_Control Variant reads in the matched control from validation alignment
Is_Filter Filter status (PASS or filter reason such as Simple_repeat)

A VCF format file ({tumor_prefix}.nanomonsv.result.vcf) is also generated. See the wiki page for details on filtering.

Single breakend result ({tumor_prefix}.nanomonsv.sbnd.result.txt)

Generated by default. Use --no_single_bnd to disable.

Column Description
Chr_1 Chromosome of the breakpoint
Pos_1 Coordinate of the breakpoint
Dir_1 Direction of the breakpoint
Contig Assembled contig sequence at the breakpoint
SV_ID Identifier of the single breakend
Checked_Read_Num_Tumor Total reads in the tumor used for validation alignment
Supporting_Read_Num_Tumor Variant reads in the tumor from validation alignment
Supporting_Read_Num_Tumor_HP Haplotype counts of variant reads (HP1,HP2,unphased)
Checked_Read_Num_Control Total reads in the matched control used for validation alignment
Supporting_Read_Num_Control Variant reads in the matched control from validation alignment
Is_Filter Filter status (PASS, Simple_repeat, Canonical_SV_overlap, or combinations)

A VCF format file ({tumor_prefix}.nanomonsv.sbnd.result.vcf) is also generated, using VCF single breakend notation (e.g., N. or .N in ALT field with SVTYPE=BND).

insert_classify result

Column Description
Insert_Type Type of insertion (Solo_L1, Partnered_L1, Orphan_L1, Alu, SVA, PSD)
Is_Inversion Inverted form for Solo LINE1 (Simple, Inverted, Other)
L1_Ratio Match rate with LINE1 sequences
Alu_Ratio Match rate with Alu sequences
SVA_Ratio Match rate with SVA sequences
RMSK_Info Summary information of RepeatMasker
Alignment_Info Alignment information to the human genome
Inserted_Pos Inserted position (for tandem duplication or nested LINE1 transduction)
Is_PolyA_T Extracted poly-A or poly-T sequences
Target_Site_Duplication Nucleotides of target site duplications
L1_Source_Info Inferred source site of LINE1 transduction
PSD_Gene Processed pseudogene name
PSD_Overlap_Ratio Match rate with the pseudogene
PSD_Exon_Num Number of pseudogene exons matched with the inserted sequence

Control Panel

We strongly recommend using a control panel for filtering common SVs and sequencing noise. Pre-built control panels are available at zenodo. You can also create your own from your sequencing data using merge_control.

Pre-built control panels

Panel Samples Reference Source
1000G ONT Vienna 1,019 GRCh38 / CHM13 1000 Genomes Project
HPRC Nanopore (Guppy v4) ~30 GRCh38 / CHM13 HPRC release 1
HPRC Nanopore (Guppy v6) ~40 GRCh38 / CHM13 HPRC release 1
HPRC PacBio HiFi ~30 GRCh38 / CHM13 HPRC release 1

For ONT data, the 1000G ONT Vienna panel (1,019 samples) is recommended for its large sample size. We recommend using a control panel as close as possible in platform and basecall quality. When unsure, a noisier panel (e.g., Guppy v4) tends to be more versatile.

When you use these control panels and publish, please cite:

Example Data

The Oxford Nanopore Sequencing data used in the paper is available through the public sequence repository (BioProject ID: PRJDB10898):

Results of nanomonsv for the above data are available here. Please kindly cite the NAR paper when you use these data.

See the tutorial wiki page for an example workflow on analyzing the COLO829 sample.

Citation

Shiraishi et al., Precise characterization of somatic complex structural variations from tumor/control paired long-read sequencing data with nanomonsv, Nucleic Acids Research, 2023, [link].

关于

用于从纳米孔测序数据中检测体细胞结构变异

118.4 MB
邀请码
    Gitlink(确实开源)
  • 加入我们
  • 官网邮箱:gitlink@ccf.org.cn
  • QQ群
  • QQ群
  • 公众号
  • 公众号

版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9 京公网安备 11010802032778号