update to vignette installation
seq.hotSPOT provides a resource for designing effective sequencing panels to help improve mutation capture efficacy for ultradeep sequencing projects. Using SNV datasets, this package designs custom panels for any tissue of interest and identify the genomic regions likely to contain the most mutations. Establishing efficient targeted sequencing panels can allow researchers to study mutation burden in tissues at high depth without the economic burden of whole-exome or whole-genome sequencing. This tool was developed to make high-depth sequencing panels to study low-frequency clonal mutations in clinically normal and cancerous tissues.
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("seq.hotSPOT")
library(seq.hotSPOT)
The mutation dataset should include two columns containing the chromosome and genomic position of each mutation. The columns should be named “chr” and “pos” respectively. Optionally, the gene names for each mutation may be included under a column named “gene”.
Loading example data:
data("mutation_data")
This algorithm searches the mutational dataset (input) for mutational hotspot regions on each chromosome:
Starting at the mutation with the lowest chromosomal position (primary mutation), using a modified rank and recovery system, the algorithm searches for the closest neighboring mutation.
If the neighboring mutation is less than one amplicon, in distance, away from the primary mutation, the neighboring mutation is included within the hotspot region.
Running amplicon finder:
amps <- amp_pool(data = data, amp = 100)
Amplicons covering hotspots less than or equal to one amplicon in length, are added to the final sequencing panel dataset.
For amplicons covering larger hotspot regions, the algorithm uses a forward selection method to determine the optimal combination of amplicons to use in the sequencing panel:
Each of these amplicons are then added to the final sequencing panel, with their own unique IDs.
All amplicons in the final sequencing panel are ranked from highest to lowest based on the number of mutations they cover.
The algorithm then calculates the cumulative base-pair length and the cumulative mutations covered by each amplicon.
Dependent on the desired length of the targeted panel, a cutoff may be applied to remove all amplicons which fall below a set cumulative length.
Running forward selection sequencing panel identifier
fw_bins <- fw_hotspot(bins = amps, data = data, amp = 100, len = 1000, include_genes = TRUE)
All amplicons contained in these bins are added back to the amplicon pool, based on a new unique ID.
Amplicons covering hotspots less than or equal to one amplicon length are added to the final sequencing panel dataset.
To determine the optimal combination of amplicons for each region, the number of amplicons necessary for full coverage of the bin is calculated.
A list is generated of every possible combination of n, number of amplicons, needed. For each combination of amplicons:
All amplicons capturing the number of mutations equal to the cutoff are further ranked to favor amplicons that have mutations closer in location to the center of the amplicon.
Cumulative base-pair length and cumulative mutations covered by each amplicon are calculated.
Running comprehensive selection sequencing panel identifier
com_bins <- com_hotspot(fw_panel = fw_bins, bins = amps, data = data, amp = 100, len = 1000, size = 3, include_genes = TRUE)
Although the output sequencing panel from forward and comprehensive methods will in most cases be very similar, the differences in methods for capturing the optimal number of mutations vary and therefore may lead to small differences. While the comprehensive method may lead to a slight increase in mutation capture efficacy, this method is much more computationally intensive compared to the forward method. Therefore, we recommend the use of the comprehensive method for smaller mutations datasets (~500 data points or less) and the use of the forward binning for larger datasets.
Both the forward and comprehensive methods will output a dataframe of the same format. Each row of the dataframe contains the information for an individual hotspot. The dataframe is ranked containing the most mutated hotspot at the top and continuing in descending order. The columns contain the following information:
用于分析DNA序列中突变热点区域的生物信息学工具
Table of Contents
Introduction to seq.hotSPOT
seq.hotSPOT provides a resource for designing effective sequencing panels to help improve mutation capture efficacy for ultradeep sequencing projects. Using SNV datasets, this package designs custom panels for any tissue of interest and identify the genomic regions likely to contain the most mutations. Establishing efficient targeted sequencing panels can allow researchers to study mutation burden in tissues at high depth without the economic burden of whole-exome or whole-genome sequencing. This tool was developed to make high-depth sequencing panels to study low-frequency clonal mutations in clinically normal and cancerous tissues.
Installation & Setup
Formatting of Input Data
The mutation dataset should include two columns containing the chromosome and genomic position of each mutation. The columns should be named “chr” and “pos” respectively. Optionally, the gene names for each mutation may be included under a column named “gene”.
Loading example data:
Overview of hotSPOT Algorithm
Generation of Amplicon Pool
This algorithm searches the mutational dataset (input) for mutational hotspot regions on each chromosome:
Starting at the mutation with the lowest chromosomal position (primary mutation), using a modified rank and recovery system, the algorithm searches for the closest neighboring mutation.
If the neighboring mutation is less than one amplicon, in distance, away from the primary mutation, the neighboring mutation is included within the hotspot region.
Running amplicon finder:
Forward Selection Sequencing Panel Identifier
Amplicons covering hotspots less than or equal to one amplicon in length, are added to the final sequencing panel dataset.
For amplicons covering larger hotspot regions, the algorithm uses a forward selection method to determine the optimal combination of amplicons to use in the sequencing panel:
Each of these amplicons are then added to the final sequencing panel, with their own unique IDs.
All amplicons in the final sequencing panel are ranked from highest to lowest based on the number of mutations they cover.
The algorithm then calculates the cumulative base-pair length and the cumulative mutations covered by each amplicon.
Dependent on the desired length of the targeted panel, a cutoff may be applied to remove all amplicons which fall below a set cumulative length.
Running forward selection sequencing panel identifier
Comprehensive Selection Sequencing Panel Identifier
All amplicons contained in these bins are added back to the amplicon pool, based on a new unique ID.
Amplicons covering hotspots less than or equal to one amplicon length are added to the final sequencing panel dataset.
To determine the optimal combination of amplicons for each region, the number of amplicons necessary for full coverage of the bin is calculated.
A list is generated of every possible combination of n, number of amplicons, needed. For each combination of amplicons:
All amplicons in the final sequencing panel are ranked from highest to lowest based on the number of mutations they cover.
All amplicons capturing the number of mutations equal to the cutoff are further ranked to favor amplicons that have mutations closer in location to the center of the amplicon.
Cumulative base-pair length and cumulative mutations covered by each amplicon are calculated.
Running comprehensive selection sequencing panel identifier
Choosing between Forward and Comprehensive Methods
Although the output sequencing panel from forward and comprehensive methods will in most cases be very similar, the differences in methods for capturing the optimal number of mutations vary and therefore may lead to small differences. While the comprehensive method may lead to a slight increase in mutation capture efficacy, this method is much more computationally intensive compared to the forward method. Therefore, we recommend the use of the comprehensive method for smaller mutations datasets (~500 data points or less) and the use of the forward binning for larger datasets.
Summary of Output Data
Both the forward and comprehensive methods will output a dataframe of the same format. Each row of the dataframe contains the information for an individual hotspot. The dataframe is ranked containing the most mutated hotspot at the top and continuing in descending order. The columns contain the following information: