NSCCN/bioconductor-seq.hotspot：用于分析DNA序列中突变热点区域的生物信息学工具

Introduction
Installation & Setup
Dataset Formatting
Algorithm
Generation of Amplicon Pool
Forward Binning
Comprehensive Binning
Choosing between Forward and Comprehensive Binning Methods
Summary of Output Data

Introduction to seq.hotSPOT

seq.hotSPOT provides a resource for designing effective sequencing panels to help improve mutation capture efficacy for ultradeep sequencing projects. Using SNV datasets, this package designs custom panels for any tissue of interest and identify the genomic regions likely to contain the most mutations. Establishing efficient targeted sequencing panels can allow researchers to study mutation burden in tissues at high depth without the economic burden of whole-exome or whole-genome sequencing. This tool was developed to make high-depth sequencing panels to study low-frequency clonal mutations in clinically normal and cancerous tissues.

Installation & Setup

if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")                                                                      
BiocManager::install("seq.hotSPOT")

library(seq.hotSPOT)

Formatting of Input Data

The mutation dataset should include two columns containing the chromosome and genomic position of each mutation. The columns should be named “chr” and “pos” respectively. Optionally, the gene names for each mutation may be included under a column named “gene”.

Loading example data:

data("mutation_data")

Overview of hotSPOT Algorithm

Generation of Amplicon Pool

This algorithm searches the mutational dataset (input) for mutational hotspot regions on each chromosome:

Starting at the mutation with the lowest chromosomal position (primary mutation), using a modified rank and recovery system, the algorithm searches for the closest neighboring mutation.
If the neighboring mutation is less than one amplicon, in distance, away from the primary mutation, the neighboring mutation is included within the hotspot region.

This rank and recovery system is repeated, integrating mutations into the hotspot region until the neighboring mutation is greater than or equal to the length of one amplicon in distance, from the primary mutation.
Once neighboring mutations equal or exceed one amplicon in distance from the primary mutation, incorporation into the hotspot region, halts incorporation.

For hotspots within the one amplicon range, from the lowest to highest mutation location, this area is covered by a single amplicon and added to an amplicon pool, with a unique ID.

The center of these single amplicons is then defined by the weighted distribution of mutations.

For all hotspots larger than one amplicon, the algorithm examines 5 potential amplicons at each covered mutation in the hotspot:

one amplicon directly upstream of the primary mutation
one amplicon directly downstream of the primary mutation
one amplicon including the mutation at the end of the read and base pairs (amplicon length - 1) upstream
one amplicon including the mutation at the beginning of the read and base pairs (amplicon length - 1) downstream
one amplicon with the mutation directly in the center.

All amplicons generated for each hotspot region of interest, are assigned a unique ID and added to the amplicon pool.

Running amplicon finder:

amps <- amp_pool(data = data, amp = 100)

Forward Selection Sequencing Panel Identifier

Amplicons covering hotspots less than or equal to one amplicon in length, are added to the final sequencing panel dataset.

For amplicons covering larger hotspot regions, the algorithm uses a forward selection method to determine the optimal combination of amplicons to use in the sequencing panel:

the algorithm first identifies the amplicon containing the highest number of mutations
the algorithm then identifies the next amplicon, which contains the highest number of new mutations.
this process continues until all mutations are covered by at least one amplicon

Each of these amplicons are then added to the final sequencing panel, with their own unique IDs.
All amplicons in the final sequencing panel are ranked from highest to lowest based on the number of mutations they cover.
The algorithm then calculates the cumulative base-pair length and the cumulative mutations covered by each amplicon.
Dependent on the desired length of the targeted panel, a cutoff may be applied to remove all amplicons which fall below a set cumulative length.

Running forward selection sequencing panel identifier

fw_bins <- fw_hotspot(bins = amps, data = data, amp = 100, len = 1000, include_genes = TRUE)

Comprehensive Selection Sequencing Panel Identifier

To conserve computational power, the forward selection sequencing panel identifier is run to determine the lowest number of mutations per amplicon (mutation frequency) that need to be included in the predetermined length sequencing panel.

any amplicon generated by the algorithm, which is less than this threshold value, will be removed.

For the feasible exhaustive selection of amplicon combinations covering hotspot areas larger than the predefined number of amplicons in length, the algorithm breaks these large regions into multiple smaller regions.

the amplicons covering these regions are pulled from the amplicon pool, based on their unique IDs.

The algorithm finds both the minimum number of amplicons overlap and all positions with this value and identifies the region with the longest continuous spot of minimum value.

the region is split at the center of this longest continuous minimum post values and continues the splitting process until all smaller regions are less than the “n” number amplicon length set by the user.
As this set number of amplicons decreases, the computation time required also often decreases.

All amplicons contained in these bins are added back to the amplicon pool, based on a new unique ID.
Amplicons covering hotspots less than or equal to one amplicon length are added to the final sequencing panel dataset.
To determine the optimal combination of amplicons for each region, the number of amplicons necessary for full coverage of the bin is calculated.
A list is generated of every possible combination of n, number of amplicons, needed. For each combination of amplicons:

amplicons that would not meet the threshold of unique mutations are filtered out, and the number of all mutations captured by these amplicons is calculated.
the combination of amplicons that yields the highest number of mutations is added to the final sequencing panel.

All amplicons in the final sequencing panel are ranked from highest to lowest based on the number of mutations they cover.
All amplicons capturing the number of mutations equal to the cutoff are further ranked to favor amplicons that have mutations closer in location to the center of the amplicon.
Cumulative base-pair length and cumulative mutations covered by each amplicon are calculated.

Depending on the desired length of the targeted panel, a cutoff may be applied to remove all amplicons which fall below a set cumulative length.

Running comprehensive selection sequencing panel identifier

com_bins <- com_hotspot(fw_panel = fw_bins, bins = amps, data = data, 
                        amp = 100, len = 1000, size = 3, include_genes = TRUE)

Choosing between Forward and Comprehensive Methods

Although the output sequencing panel from forward and comprehensive methods will in most cases be very similar, the differences in methods for capturing the optimal number of mutations vary and therefore may lead to small differences. While the comprehensive method may lead to a slight increase in mutation capture efficacy, this method is much more computationally intensive compared to the forward method. Therefore, we recommend the use of the comprehensive method for smaller mutations datasets (~500 data points or less) and the use of the forward binning for larger datasets.

Summary of Output Data

Both the forward and comprehensive methods will output a dataframe of the same format. Each row of the dataframe contains the information for an individual hotspot. The dataframe is ranked containing the most mutated hotspot at the top and continuing in descending order. The columns contain the following information:

Lowerbound: lowest base pair position of the hotspot
Upperbound: highest base pair position of the hotspot
Chromosome: chromosome number which the hotspot is located on
Mutation Count: number of mutations in input dataset which were found within the regions of this hotspot
Cumulative Panel Length: cumulative number of base pairs which are included in the panel starting from the most mutated hotspot and adding in descending order
Cumulative Mutations: cumulative number of mutations included in the panel starting with the most mutated hotspot and adding in descending order
Gene (optional): name of gene(s) which are affected by mutations within each hotspot region

Table of Contents