vSNP3: High-Resolution SNP Analysis for Pathogen Surveillance

vSNP3 is a powerful tool for high-resolution bacterial and viral SNP analysis, designed specifically for disease tracing and outbreak investigations in diagnostic laboratories.

🌟 Why Choose vSNP3?

Superior Resolution: Precisely identifies with confidence strain differences down to the single nucleotide level
Flexible Database: Build, maintain, and update your strain database without rerunning all samples
Intelligent Sample Classification: Automatically group samples based on defining SNPs
Computational Efficiency: Focus analysis on relevant sample subsets, saving time and resources
Comprehensive Output: Complete suite of BAM, VCF, annotated SNP matrices, and phylogenetic trees
Zero Coverage Tracking: Unique capability to track regions with no sequence data
Mixed SNP Handling: Accurately represents positions with multiple alleles using IUPAC codes - ability to identify mixed strains

🔍 The vSNP3 Advantage: Two-Step Process Explained

Why Two Steps Are Better Than One

Most SNP callers force you to reprocess all your samples each time you add new ones. vSNP3’s two-step approach is different:

Step 1: Process Alignment Once - Align reads and call SNPs for each sample individually
Step 2: Combine and run VCF files - Generate matrices and trees from any combination of samples

Step 1 output

Step 2 output

This approach lets you:

Add new samples to your analysis without reprocessing existing ones
Create different sample groupings for different investigations
Save computational resources and time
Maintain a growing, curated database of SNP profiles

Defining SNPs: Sample Classification

A unique feature of vSNP3 is its use of defining SNPs to automatically categorize samples:

Full Dataset (100 samples)
   │
   ├── Group A (40 samples) - Defining SNP: position 123456 = T
   │    │
   │    ├── Subgroup A1 (15 samples) - Defining SNP: position 234567 = G
   │    │
   │    └── Subgroup A2 (25 samples) - Defining SNP: position 234567 = A
   │
   └── Group B (60 samples) - Defining SNP: position 123456 = C
        │
        ├── Subgroup B1 (20 samples) - Defining SNP: position 345678 = T
        │
        └── Subgroup B2 (40 samples) - Defining SNP: position 345678 = C

Benefits of defining SNPs:

Automatic Grouping: Samples are classified into groups based on specific SNP patterns
Focused Analysis: Quickly drill down to specific subsets of related samples
Computational Efficiency: Reduce analysis time by working with smaller, relevant sample sets

📦 Installation

conda create -c conda-forge -c bioconda -n vsnp3 vsnp3=3.35
conda activate vsnp3

For detailed setup instructions, see conda instructions.

🚀 Quick Start

# Verify installation
vsnp3_step1.py -h
vsnp3_step2.py -h

# Download test dataset and add reference types
cd ${HOME}
git clone https://github.com/USDA-VS/vsnp3_test_dataset.git
cd vsnp3_test_dataset/vsnp_dependencies
vsnp3_path_adder.py -d $(pwd)

# Run Step 1: Process a single sample (only needed once per sample)
cd ~/vsnp3_test_dataset/AF2122_test_files/step1
vsnp3_step1.py -r1 *_R1*.fastq.gz -r2 *_R2*.fastq.gz -t Mycobacterium_AF2122

# Run Step 2: Generate SNP matrix and tree (can be run with any sample combination)
cd ~/vsnp3_test_dataset/AF2122_test_files/step2
vsnp3_step2.py -a -t Mycobacterium_AF2122

📊 Real-World Example: Building Your Surveillance Database

Imagine you’re tracking a bacterial outbreak over time:

Initial Investigation: Process your first 10 samples through Step 1, then use Step 2 to generate a phylogenetic tree
New Sample Analysis: When you receive 5 new samples, only run Step 1 on these new samples
Updated Results: Run Step 2 again using all 15 samples to see how the new samples relate to the existing ones
Focused Investigation: Use defining SNPs to identify a specific cluster, then create a detailed analysis with just those samples

This workflow saves time and resources while maintaining a comprehensive database of all processed samples.

📘 Key Features in Detail

Step 1: Alignment and SNP Calling

Step 1 processes raw sequencing data for each sample individually:

Aligns reads to your reference genome
Calls high-quality SNPs
Tracks regions with zero coverage
Generates comprehensive quality metrics
Automatically assigns samples to groups based on defining SNPs

Sample output metrics: Step 1 alignment metrics

Step 2: Matrix and Tree Generation

Step 2 combines results from multiple samples:

Creates SNP matrices from any combination of processed samples
Builds phylogenetic trees showing evolutionary relationships
Handles mixed SNPs using IUPAC ambiguity codes
Generates HTML summary reports for easy interpretation

Sample outputs:

Step 2 tree Step 2 SNP matrix

Using Defining SNPs

vSNP3’s defining SNP capability allows you to:

Automatically classify samples into hierarchical groups
Focus your analysis on biologically relevant sample subsets
Quickly identify related samples in an outbreak scenario
Build a labeled sample database

🧬 Understanding Defining SNPs

One of vSNP3’s most powerful features is its ability to automatically classify samples using defining SNPs. Each reference type has its own defining SNP Excel file that defines these critical positions.

Locating Your Defining SNP Files

After installation, you can find the path to your defining SNP files with:

vsnp3_path_adder.py -s

This will show all installed reference types and their associated file paths.

Anatomy of a Defining SNP File

Defining SNPs Excel Structure

The defining SNP Excel file has a structured format:

Row 1: Contains chromosome:position identifiers for each SNP position
Row 2: Names of each group/subgroup (e.g., Mbovis-All, Mbovis-01, Mbovis-01A)
Remaining Rows: Positions to be filtered from the analysis for each specific group

When vSNP3 analyzes a sample:

It checks the sample’s nucleotides at each defining position
Based on the SNP pattern, it automatically assigns the sample to the appropriate group
During analysis, it filters out the problematic positions listed below each group’s column

Customizing Your Analysis

The beauty of this system is its flexibility:

You can define hierarchical groups based on evolutionary relationships
Each group can have its own set of filtered positions to improve analysis quality
As you discover new lineages, you can update the defining SNP file to reflect them

This classification system allows you to:

Automatically organize samples as they’re processed
Focus your analysis on specific groups of interest
Maintain consistent classifications across your entire database
Filter out positions known to be problematic for specific lineages

The defining SNP system transforms vSNP3 from a simple SNP caller into an intelligent analysis platform that grows more valuable as your sample database expands.

🧰 Reference Types

Reference types have key files that provide structure to your analysis:

Defining filter file: Identifies group-specific SNPs
Metadata file: Maps sample names
FASTA reference: For read alignment
GenBank file: For annotation

Adding a reference is simple:

vsnp3_path_adder.py -d /path/to/reference_files

Reference types are called based on their directory names once their parent directory is added.

🔄 Setting Up Reference Types

One of the most important first steps in using vSNP3 is setting up your reference types. This only needs to be done once, and it enables all the powerful features of vSNP3 including automatic sample classification and group-specific filtering. Reference types are called based on their directory names once their parent directory is added.

What Is a Reference Type?

A reference type in vSNP3 is a collection of files for a specific organism that includes:

A reference genome (FASTA)
Annotation information (GenBank)
Defining SNP positions (Excel file)
Sample name mapping (Excel file)

These files work together to provide the foundation for your analyses.

Adding Your First Reference Type

Adding a reference type is simple using the vsnp3_path_adder.py utility:

Parent directory contains the reference directory. This parent directory may contain many reference types, each a separate subfolder.

# Add a reference parent directory containing all necessary files.  
vsnp3_path_adder.py -d /path/to/parent_dictory

This command tells vSNP3 where to find the reference files for a particular organism. The reference type name is taken directly from the directory name. For example, if your files are in a directory called Mycobacterium_AF2122, that becomes the reference type name you’ll use in your commands.

Example Reference Type Setup

Let’s walk through a complete example:

Prepare your reference directory

Create a directory with these files:

Parent_Directory/
   └──Mycobacterium_AF2122/
      ├── defining_filter.xlsx    # Contains defining SNPs and filter positions
      ├── metadata.xlsx           # Sample name mapping
      ├── AF2122.fasta            # Reference genome
      └── AF2122.gbk              # GenBank annotation file

Add the reference type to vSNP3

vsnp3_path_adder.py -d /path/to/Parent_Directory

Verify the reference was added
```
vsnp3_path_adder.py -s
```
You should see your reference type listed, along with paths to all associated files.

Managing Multiple Reference Types

vSNP3 allows you to work with multiple reference types:

Adding additional references: Simply run the path adder for each new reference
```
vsnp3_path_adder.py -d /path/to/another_parent_directory
```
Viewing all references: Check which references are available
```
vsnp3_path_adder.py -s
```

Best Practices for Reference Management

Organize by organism: Keep reference files for each organism in separate directories
Use descriptive names: Choose reference type names that clearly identify the organism
Keep references consistent: Use the same reference across all related analyses
Back up your reference files: Save your defining SNP files especially, as they contain valuable classification information

By properly setting up your reference types, you’re creating a foundation for consistent, repeatable analyses that grow more valuable as your sample database expands.

🔧 Additional Tools

vSNP3 includes utility scripts for:

Adding reference paths
MLST typing
Downloading reference genomes
Filter optimization
Spoligotyping

For full details, see Additional Tools.

💡 Common Use Cases

Disease outbreak investigation: Track transmission chains in real time
Surveillance programs: Monitor pathogen evolution over time
Vaccine strain monitoring: Detect drift from vaccine strains
Mix strain evaluation: Identify mixed strains
Antimicrobial resistance tracking: Link resistance profiles to genetic markers