目录
tstuber11个月前8次提交

Kernel Density Plot Tool: SNP Distance Analysis and Visualization

GitHub release Conda License

A Python tool for generating SNP density and closest neighbor plots from aligned FASTA files. The Kernel Density Plot Tool reveals population structure, transmission patterns, and evolutionary relationships through comprehensive distance analysis and visualization.

MTBC Lineage 2
Figure 1: Kernel density and closest neighbor analysis of MTBC Lineage 2 isolates

🔍 Key Features

  • Population Structure Analysis: Identify genetic clusters and relationships within your dataset
  • Dual Visualization Approach: Generate both kernel density plots and closest neighbor histograms
  • Comprehensive Output: Produces distance matrices, statistical summaries, and publication-ready figures
  • Flexible Integration: Compatible with vSNP3, phylogenetic tools, and standard FASTA alignments
  • Statistical Analysis: Detailed bin data and summary statistics for further analysis

🧬 Understanding SNP Distance Analysis

Kernel Density Plots

Kernel density plots show the distribution of SNP distances between isolates, revealing population structure through peak patterns:

  • Single peaks: Homogeneous populations or recent expansions
  • Multiple distinct peaks: Structured populations with genetic subgroups
  • Peak separation: Genetic boundaries between populations
  • Peak width: Genetic diversity within clusters

Closest Neighbor Analysis

Closest neighbor histograms show the frequency distribution of minimum genetic distances, useful for:

  • Outbreak detection: Tight clusters at low SNP distances
  • Transmission analysis: Identifying closely related isolates
  • Threshold establishment: Setting epidemiological distance cutoffs
  • Diversity assessment: Understanding genetic variation ranges

📦 Installation

# Create and activate conda environment
conda create -n kernel_density_plots kernel_density_plots
conda activate kernel_density_plots

# Verify installation
kernel_plot.py -h

🚀 Quick Start

Test Dataset

# Set up test directory
cd ${HOME}
mkdir kernel_test
cd kernel_test

# Copy and run test data
cp -v $CONDA_PREFIX/share/kernel_density_plots/test/*fasta .
kernel_plot.py --lineage="Lineage L2" --bin-size=20 -f La2_test.fasta

See Figure 1 for expected output

Basic Usage

# Simple analysis
kernel_plot.py -f your_alignment.fasta

# Analysis with custom parameters
kernel_plot.py --lineage="Study Name" --bin-size=25 --output-dir results -f your_alignment.fasta

⚙️ Command Line Options

kernel_plot.py [options] -f input.fasta

Key Parameters:

  • -f, --fasta: Input aligned FASTA file (required)
  • --lineage: Analysis identifier for output files
  • --bin-size: Histogram bin size (default: 10)
  • --output-dir: Output directory path

For complete options: kernel_plot.py -h

📁 Output Files

Visualizations

  • *_density_plot.pdf - Kernel density plot of SNP distances
  • *_closest_neighbor.pdf - Closest neighbor distance histogram
  • *_combined_figure.pdf - Combined plot figure

Data Files

  • *_distances.tab - Complete pairwise distance matrix
  • *_no_root_YYYY-MM-DD.tab - Distance matrix with root sequence removed
  • *_lowertriangle.txt - Lower triangle distance values
  • *_density_bins.csv - Density plot bin data
  • *_neighbor_bins.csv - Neighbor histogram bin data

Statistics

  • *_density_stats.txt - Density distribution statistics
  • *_neighbor_stats.txt - Neighbor distance statistics

🔬 Applications

Outbreak Investigation

Analyze suspected transmission clusters by examining SNP distance distributions. Tight clusters at low distances may indicate recent transmission events, while scattered patterns suggest multiple introductions or endemic circulation.

Surveillance Programs

Monitor population structure changes over time by comparing density plots across sampling periods. Establish distance thresholds for surveillance definitions using closest neighbor distributions.

Population Genetics Research

Characterize genetic diversity and population structure. Multiple peaks may indicate distinct lineages or clades.

Quality Control

Assess alignment quality and identify potential outliers or contamination through extreme distance values or unexpected clustering patterns.

📊 Interpreting Results

Kernel Density Plot Patterns

Unimodal Distribution

     Peak
    /    \
   /      \
  /        \

Indicates homogeneous population structure

Bimodal Distribution

Peak      Peak
/  \      /  \
    \    /
     \  /

Suggests two distinct genetic groups

Multimodal Distribution Multiple peaks indicate complex population structure with several genetic clusters

Closest Neighbor Histogram Analysis

  • Low distance peaks (0-10 SNPs): Recent transmission or clonal expansion
  • Intermediate distances: Background population diversity
  • High distance outliers: Genetically distinct strains or potential contamination
  • Median values: Useful for establishing relatedness thresholds

🔗 Integration with Other Tools

vSNP3 Workflow

# Generate SNP alignment with vSNP3
vsnp3_step2.py -a -t ReferenceType

# Analyze population structure
kernel_plot.py -f alignment.fasta --lineage="Analysis_Name"

Additional Detail

⚠️ Important Considerations

Data Quality

  • Ensure high-quality sequence alignments
  • Verify appropriate reference genome selection
  • Consider the impact of recombination on distance calculations
  • Account for potential sequencing artifacts

Analytical Limitations

  • Results reflect sample composition, not entire population
  • Parsimony-based analysis may underestimate evolutionary distances
  • Population sampling bias can affect interpretation
  • Reference genome choice influences absolute distance values

Statistical Interpretation

Use distance distributions in conjunction with phylogenetic analysis and epidemiological data for comprehensive interpretation. Consider confidence intervals and sampling effects when drawing conclusions.

🤝 Support

For questions, bug reports, or feature requests, please open an issue on GitHub or contact the development team directly.

📚 Citation

If you use this tool in your research, please cite.

关于

用于生成核密度估计图的数据可视化工具

310.0 KB
邀请码