Kernel Density Plot Tool: SNP Distance Analysis and Visualization
A Python tool for generating SNP density and closest neighbor plots from aligned FASTA files. The Kernel Density Plot Tool reveals population structure, transmission patterns, and evolutionary relationships through comprehensive distance analysis and visualization.
Figure 1: Kernel density and closest neighbor analysis of MTBC Lineage 2 isolates
🔍 Key Features
Population Structure Analysis: Identify genetic clusters and relationships within your dataset
Dual Visualization Approach: Generate both kernel density plots and closest neighbor histograms
Comprehensive Output: Produces distance matrices, statistical summaries, and publication-ready figures
Flexible Integration: Compatible with vSNP3, phylogenetic tools, and standard FASTA alignments
Statistical Analysis: Detailed bin data and summary statistics for further analysis
🧬 Understanding SNP Distance Analysis
Kernel Density Plots
Kernel density plots show the distribution of SNP distances between isolates, revealing population structure through peak patterns:
Single peaks: Homogeneous populations or recent expansions
Multiple distinct peaks: Structured populations with genetic subgroups
Peak separation: Genetic boundaries between populations
Peak width: Genetic diversity within clusters
Closest Neighbor Analysis
Closest neighbor histograms show the frequency distribution of minimum genetic distances, useful for:
Outbreak detection: Tight clusters at low SNP distances
Transmission analysis: Identifying closely related isolates
# Set up test directory
cd ${HOME}
mkdir kernel_test
cd kernel_test
# Copy and run test data
cp -v $CONDA_PREFIX/share/kernel_density_plots/test/*fasta .
kernel_plot.py --lineage="Lineage L2" --bin-size=20 -f La2_test.fasta
Analyze suspected transmission clusters by examining SNP distance distributions. Tight clusters at low distances may indicate recent transmission events, while scattered patterns suggest multiple introductions or endemic circulation.
Surveillance Programs
Monitor population structure changes over time by comparing density plots across sampling periods. Establish distance thresholds for surveillance definitions using closest neighbor distributions.
Population Genetics Research
Characterize genetic diversity and population structure. Multiple peaks may indicate distinct lineages or clades.
Quality Control
Assess alignment quality and identify potential outliers or contamination through extreme distance values or unexpected clustering patterns.
📊 Interpreting Results
Kernel Density Plot Patterns
Unimodal Distribution
Peak
/ \
/ \
/ \
Indicates homogeneous population structure
Bimodal Distribution
Peak Peak
/ \ / \
\ /
\ /
Suggests two distinct genetic groups
Multimodal Distribution
Multiple peaks indicate complex population structure with several genetic clusters
Closest Neighbor Histogram Analysis
Low distance peaks (0-10 SNPs): Recent transmission or clonal expansion
Intermediate distances: Background population diversity
High distance outliers: Genetically distinct strains or potential contamination
Median values: Useful for establishing relatedness thresholds
🔗 Integration with Other Tools
vSNP3 Workflow
# Generate SNP alignment with vSNP3
vsnp3_step2.py -a -t ReferenceType
# Analyze population structure
kernel_plot.py -f alignment.fasta --lineage="Analysis_Name"
Use distance distributions in conjunction with phylogenetic analysis and epidemiological data for comprehensive interpretation. Consider confidence intervals and sampling effects when drawing conclusions.
🤝 Support
For questions, bug reports, or feature requests, please open an issue on GitHub or contact the development team directly.
📚 Citation
If you use this tool in your research, please cite.
Kernel Density Plot Tool: SNP Distance Analysis and Visualization
A Python tool for generating SNP density and closest neighbor plots from aligned FASTA files. The Kernel Density Plot Tool reveals population structure, transmission patterns, and evolutionary relationships through comprehensive distance analysis and visualization.
🔍 Key Features
🧬 Understanding SNP Distance Analysis
Kernel Density Plots
Kernel density plots show the distribution of SNP distances between isolates, revealing population structure through peak patterns:
Closest Neighbor Analysis
Closest neighbor histograms show the frequency distribution of minimum genetic distances, useful for:
📦 Installation
🚀 Quick Start
Test Dataset
See Figure 1 for expected output
Basic Usage
⚙️ Command Line Options
Key Parameters:
-f, --fasta: Input aligned FASTA file (required)--lineage: Analysis identifier for output files--bin-size: Histogram bin size (default: 10)--output-dir: Output directory pathFor complete options:
kernel_plot.py -h📁 Output Files
Visualizations
*_density_plot.pdf- Kernel density plot of SNP distances*_closest_neighbor.pdf- Closest neighbor distance histogram*_combined_figure.pdf- Combined plot figureData Files
*_distances.tab- Complete pairwise distance matrix*_no_root_YYYY-MM-DD.tab- Distance matrix with root sequence removed*_lowertriangle.txt- Lower triangle distance values*_density_bins.csv- Density plot bin data*_neighbor_bins.csv- Neighbor histogram bin dataStatistics
*_density_stats.txt- Density distribution statistics*_neighbor_stats.txt- Neighbor distance statistics🔬 Applications
Outbreak Investigation
Analyze suspected transmission clusters by examining SNP distance distributions. Tight clusters at low distances may indicate recent transmission events, while scattered patterns suggest multiple introductions or endemic circulation.
Surveillance Programs
Monitor population structure changes over time by comparing density plots across sampling periods. Establish distance thresholds for surveillance definitions using closest neighbor distributions.
Population Genetics Research
Characterize genetic diversity and population structure. Multiple peaks may indicate distinct lineages or clades.
Quality Control
Assess alignment quality and identify potential outliers or contamination through extreme distance values or unexpected clustering patterns.
📊 Interpreting Results
Kernel Density Plot Patterns
Unimodal Distribution
Indicates homogeneous population structure
Bimodal Distribution
Suggests two distinct genetic groups
Multimodal Distribution Multiple peaks indicate complex population structure with several genetic clusters
Closest Neighbor Histogram Analysis
🔗 Integration with Other Tools
vSNP3 Workflow
Additional Detail
⚠️ Important Considerations
Data Quality
Analytical Limitations
Statistical Interpretation
Use distance distributions in conjunction with phylogenetic analysis and epidemiological data for comprehensive interpretation. Consider confidence intervals and sampling effects when drawing conclusions.
🤝 Support
For questions, bug reports, or feature requests, please open an issue on GitHub or contact the development team directly.
📚 Citation
If you use this tool in your research, please cite.