Some QUIck Reconstruction to Resolve Evolutionary Links
Data use
Squirrel used open data to train the clade assignment random forest model. This data was accessed through Pathoplexus and the Sequence Set used for training can be accessed at https://doi.org/10.62599/PP_SS_529.1.
How to cite:
If you use squirrel in your research, please cite the Squirrel paper, which is now published in Virus Evolution:
Áine O’Toole, Eddy Kinganda-Lusamaki, Rachel Colquhoun, Connor Chato, Emily Scher, Chris Kent, Sam Wilkinson, Joshua Quick, Nicholas Loman, Ana T Duggan, Placide Mbala, Andrew Rambaut, Human outbreak detection and best practice MPXV analysis and interpretation with squirrel, Virus Evolution, Volume 12, Issue 1, 2026, veaf095, https://doi.org/10.1093/ve/veaf095
where <your-sequences.fasta> is the name of your input FASTA sequence file. See FASTA format for file format details.
Mixed Clade Sequences
NEW FEATURE: If your FASTA file contains a mixture of Clade I and Clade II sequences, squirrel can now assign clade to your sequences, split them into separate alignments for Clade I and II, and run all downstream analysis per clade. A single report is produced at the end.
squirrel --clade split <your-sequences.fasta>
Note: An EPI2ME wrapper for squirrel with the same options is available at artic-network/squirrel-nf.
If mamba is not installed, run conda install -n base mamba
mamba env create -f environment.yml
conda activate squirrel
pip install .
Verify installation:
squirrel -v
You should see the version of squirrel.
How It Works
Alignment Process
squirrel provides a rapid way of producing reliable alignments for MPXV and enables maximum-likelihood phylogenetics pipeline tree estimation.
The reliability of tree estimation is determined by the quality of the input genome sequences. In QC mode, squirrel can flag potential issues in the MPXV sequences that have been provided for alignment (e.g. SNPs near tracts of N, clusters of unique SNPs, reversions to reference alleles and convergent mutations) and outputs these in a mask file for investigation. We suggest you use this information to examine the alignment and pay close attention to the regions flagged. Squirrel can then accept this file with suggested masks and apply it to the sequences before doing phylogenetics.
Enrichment of APOBEC3-mutations in the MPXV population are a signature of sustained human-to-human transmission. Identifying APOBEC3-like mutations in MPXV genomes from samples in a new outbreak can be a piece of evidence to support sustained human transmission of mpox. Squirrel can run an APOBEC3-reconstruction and map these mutations onto the phylogeny.
Squirrel produces a HTML report summarising some of the key outputs.
Generate a quick MPXV alignment
To align MPXV Clade II sequences, run:
squirrel <your-sequences.fasta>
To align MPXV Clade I sequences, run:
squirrel --clade cladei <your-sequences.fasta>
where <your-sequences.fasta> is the name of your input FASTA sequence file. Click here see what a FASTA formatted file looks like.
NEW FEATURE If your FASTA file contains a mixture of Clade I and Clade II sequences, squirrel can now assign clade to your sequences, split them into separate alignments for Clade I and II, and run all downstream analysis per clade. A single report is produced at the end. To align a mixed FASTA file, run:
squirrel --clade split <your-sequences.fasta>
Note, an EPI2ME wrapper for squirrel with the same options is available here.
How it works - alignment
Squirrel maps each query genome in the input file against a reference genome specific to each clade using minimap2. Using gofasta, the mapping file is then converted into a multiple sequence alignment.
For Clade II, the reference used is NC_063383 and for Clade I, we use NC_003310. This means that all coordinates within an alignment will be relative to these references. A benefit of this is that within a clade, alignment files and be combined without having to recalculate the alignment. Note however that insertions relative to the reference sequence will not be included in the alignment.
Squirrel by default creates a single alignment fasta file. Using the genbank coordinates for NC_063383 it also has the ability to extract the aligned coding sequences either as separate records or as a concatenated alignment. This can facilitate codon-aware phylogenetic or sequence analysis.
If you input sequences that are entirely N or very short, squirrel may fail to align these. If all input sequences are 100% ambiguous, squirrel will exit with an error.
Some of the phylogenetic software within squirrel (iqtree, baltic) will modify or break with particular special characters in the sequence names. Squirrel will check for these special characters in the sequence names and report back if it finds any to help you curate your alignment. Special characters that we check for are: ' " { ( [ ] ) } ; , ` % # : \n \t, however accented letters etc may also disrupt the software. For best practice, please exclude special characters from sequence names.
Phylogeny & Reconstruction
Squirrel has the optional --run-phylo and --run-apobec3-phylo modes that will take the newly generated alignment and build a maximum likelihood phylogeny using IQTREE2.
With --run-apobec3-phylo mode, it also runs IQTREE ancestral state reconstruction (-asr), and parses the output state files, providing a branch-mapped summary of SNPs that have occurred across the phylogeny, and an output phylogeny figure with SNPs plotted along branches, coloured by whether SNPs are consistent with APOBEC3-editing or not (red for APOBEC3-like mutations and yellow for other mutations).
An outgroup (or multiple outgroups) must be specified (although this is handled internally in --include-background mode) to ensure correct rooting for the ancestral state reconstruction. If --cns-qc mode is on in conjunction with phylogenetics, reversions to reference and convergent SNPs are also flagged using the reconstruction.
Squirrel can also run a pan-clade alignment option, which aligns against a pan-clade composite consensus reference generated from a pairwise alignment of PX667572 and NC_063383.
Masking
Default Masking Options
The final alignment from squirrel has additional processing steps that includes various masking options:
**Sequences with a high N content
Sequences that have many ambiguous bases in them are flagged that they may want to be excluded in further analysis. This may not always be appropriate, often genomes that have a lot of ambiguity can still be informative, however if there is something unusual about a sequence, having lots of ambiguities can be a flag for wider problems (like low read count during assembly).
Phylogenetics
Maximum-Likelihood Phylogeny
Low-complexity/Repetitive Regions (enabled by default)
Squirrel performs masking (replacement with N) on low-complexity or repetitive regions that have been characterised for Clade I and II
The relevant mask file will be selected depending on the --clade specified
Can be disabled with --no-mask
Custom Masking
Additional Column Masking (--additional-mask)
Squirrel can accept an additional mask file in CSV format for flagged sites that you wish to mask within the alignment
These sites will be masked in conjunction with the default masking files
The format fits with the features format from Geneious and at a minimum should contain: “Maximum”, “Minimum” fields
If squirrel is run in QC mode (-qc), it will flag sites that may need to be masked and produce a CSV mask file summarising the sites
Sequence-specific Masking (--sequence-mask)
Mask particular sites in a problematic sequence, rather than masking out the entire column across the alignment
Requires a CSV file with a sequence column (must match an id in the supplied FASTA file) and a site column (1-based position)
If a sequence needs multiple sites masked, specify them on separate lines with the same sequence name
Quality Control
Squirrel can run quality control (QC) on the alignment and flag certain sites to the user that may need to be masked. We recommend that the user looks at these sites in an alignment viewer to judge whether the sites should be masked or not. If -qc mode is toggled on, squirrel with check within the alignment for:
Mutations that are adjacent to N bases
The rationale for this is that N sites are usually a product of low coverage regions. Mutations that occur directly adjacent to low coverage regions may be a result of mis-alignment prior to the low coverage masking and may not be real SNPs. In squirrel, non-majority alleles that are present next to an N are flagged as potential sites for masking.
Unique mutations that clump together
If mutations are observed in only a single sequence in the genome, they are classed as unique mutations. Usually mutations do not clump closely together and may suggest an alignment or assembly issue. If these mutations are not shared with any other sequences, they are flagged for masking.
Build a phylogenetic tree with specified outgroups:
More details on --include-background can be found in the Background Mode section.
APOBEC3 Mutation Analysis
Signatures of APOBEC3-editing are characteristic of MPXV evolution during sustained transmission in the human population (see O’Toole et al 2023). To run the maximum-likelihood phylogenetics and ancestral reconstruction pipeline that characterises the APOBEC3-like and non-APOBEC3 mutations:
The sequence file you provide must have the specified outgroups in it, with the IDs matching those you provide
This pipeline can accept one or more outgroup IDs
The outgroups help to root the phylogeny so that ancestral reconstruction can be performed in the appropriate direction
Squirrel will produce an annotated tree file that contains information on each branch about how many APOBEC3/non-APOBEC3 mutations are present on each branch, and also on each subtree beneath that branch.
Recommended Outgroups
NEW FEATURE: It is no longer necessary to manually specify an outgroup. By default the recommended outgroups below are selected and pruned off for each clade I or II specifically.
Clade I: KJ642617, KJ642615
Clade IIb: KJ642617, KJ642615
Example: To run phylogenetics mode on a Clade I dataset, include KJ642617, KJ642615 and KJ642616 in your test sequence file and run:
Squirrel has the optional --run-phylo and --run-apobec3-phylo modes that will take the newly generated alignment and build a maximum likelihood phylogeny using IQTREE2. With --run-apobec3-phylo mode, it also runs IQTREE ancestral state reconstruction (-asr), and parses the output state files, providing a branch-mapped summary of SNPs that have occurred across the phylogeny, and an output phylogeny figure with SNPs plotted along branches, coloured by whether SNPs are consistent with APOBEC3-editing or not (red for APOBEC3-like mutations and yellow for other mutations). An outgroup (or multiple outgroups) must be specified (although this is handled internally in --include-background mode) to ensure correct rooting for the ancestral state reconstruction. If --cns-qc mode is on in conjunction with phylogenetics, reversions to reference and convergent SNPs are also flagged using the reconstruction.
Automatic Outgroup Selection
squirrel will automatically pull out the background sequences from the relevant clade and select the appropriate outgroup from the background set. This outgroup will be pruned from the final output tree, so will not be seen in the output file.
Automatically selected outgroups by clade:
Clade
Outgroup
cladei
KJ642615|human|Nigeria||1978
cladeia
KJ642615|human|Nigeria||1978
cladeib
KJ642613|human|DRC|Equateur|1970-09-01
cladeii
KJ642613|human|DRC|Equateur|1970-09-01
cladeiia
KJ642613|human|DRC|Equateur|1970-09-01
cladeiib
KJ642615|human|Nigeria||1978
Background Mode
The squirrel software has a set of publically available MPXV genome sequences that include representatives of CladeIa, CladeIb, Clade IIa and CladeIIb. The sequences, the Genbank accession numbers and their clade annotations can be found in background_sample.csv and background.fasta.
If --include-background is run with the squirrel command, squirrel will automatically pull out the background sequences from the relevant clade and select the appropriate outgroup from the background set.
Squirrel includes a set of publicly available MPXV genome sequences that include representatives of Clade Ia, Clade Ib, Clade IIa and Clade IIb. The sequences, GenBank accession numbers and their clade annotations can be found in:
When --include-background is used, squirrel will automatically:
Pull out the background sequences from the relevant clade
Select the appropriate outgroup from the background set
Combine them with your input sequences
Prune the outgroup from the final output tree
Example: Running squirrel --clade cladei --include-background <sequences.fasta> will combine all Clade I background sequences with your input file and include a Clade II outgroup sequence to correctly root the tree.
Note: If outgroups are specified in the supplied input fboth the -qc and --run-phylo flags), additional checks will be performed after the reconstruction:
Convergent Mutations
Any mutations that occur multiple times across the phylogeny are flagged for investigation
May highlight an issue with the tree structure or underlying sequences
Reversions to Reference
There are several options for customizing the final APOBEC3-reconstruction tree figure output:
Re-render Tree Figure Only (-tfig/--tree-figure-only)
Rather than rerunning the entire analysis, you can re-render just the tree figure
Requires:
Tree file from previous squirrel run (-tf/--tree-file)
--fig-height - Specify custom height for the final tree visualization
--fig-width - Specify custom width for the final tree visualization
Point Style Options
--point-style - Choose how mutations are represented:
NEW FEATURE: Squirrel can now accept a precomputed treefile, alignment and state file generated by IQTREE2 to run the APOBEC3 reconstruction.
Note: If the tree file has multiple likelihood metrics, it cannot be parsed by baltic.
Required flags:
Flag
Description
-at, --asr-tree
Precomputed tree with ancestral states reconstructed
-as, --asr-state
Precomputed ancestral states file
-aln, --asr-alignment
Precomputed alignment file
Output Files
Core Output Files
sequences.aln.fasta - The alignment file, scaffolded against a clade-specific reference. By default one of the ITR regions and a curated set of problematic regions is masked as Ns.
sequences.aln.tree - The output maximum likelihood tree file from IQTREE2 with Node labels that correspond to the reconstruction Node labels. Can be viewed in tree viewers like FigTree.
sequences.aln.annotated.tree - The output maximum likelihood tree file from IQTREE2 with annotations on branches describing the APOBEC3-reconstruction results. Can be viewed in tree viewers like FigTree.
sequences.aln.report.html - Summary report of analysis run.
Reconstruction Files
sequences.aln.tree.state and sequences.aln.tree.state_differences.csv - The output ancestral state reconstruction file from IQTREE2 and the compiled list of unambiguously variable sites from squirrel.
sequences.aln.tree.branch_snps.reconstruction.csv - A report of individual site changes mapped to specific branches and their dinucleotide context.
sequences.aln.tree.amino_acid.reconstruction.csv - A report of each mutation that occurs across the phylogeny: location, dinucleotide context, APOBEC3 status, gene, codon position, amino acid change, and Grantham score.
Visualisation Files
sequences.aln.tree.png and sequences.aln.tree.svg - Visualization of reconstructed tree showing whether mutations are consistent with APOBEC3 editing or not (red for APOBEC3-like mutations, yellow for other mutations).
*tree.html - Interactive tree file (when using --interactive-tree flag).
Installation
Install from bioconda with conda or mamba, e.g. conda create -c bioconda -c conda-forge -n squirrel -y squirrel.
Or:
Clone this repository and cd squirrel
if mamba is not installed, run conda install -n base mamba
mamba env create -f environment.yml
conda activate squirrel
pip install .
Check the install worked
Type (in the squirrel environment):
squirrel -v
and you should see the versions of squirrel.
Full usage
usage: squirrel <input> [options]
squirrel: Some QUIck Rearranging to Resolve Evolutionary Links
options:
-h, --help show this help message and exit
Input-Output options:
input Input fasta file of sequences to analyse.
-o, --outdir OUTDIR Output directory. Default: current working directory
--outfile OUTFILE Optional output file name. Default: <input>.aln.fasta
--tempdir TEMPDIR Specify where you want the temp stuff to go. Default: $TMPDIR
--no-temp Output all intermediate files, for dev purposes.
--epi2me-outdir EPI2ME_OUTDIR
Output directory for report paths in epi2me.
Alignment options:
-qc, --seq-qc Flag potentially problematic SNPs and sequences. Default: don't run QC
--assembly-refs ASSEMBLY_REFS
References to check for `calls to reference` against.
--no-mask Skip masking of repetitive regions. Default: masks repeat regions
--no-itr-mask Skip masking of end ITR. Default: masks ITR
--additional-mask ADDITIONAL_MASK
Masking additional sites provided as a csv. Needs columns `Maximum` and
`Minimum` in 1-base.
--sequence-mask SEQUENCE_MASK
Mask sites in specific sequences in the alignment as a csv, rather than
the whole alignment column. Needs `sequence` and `site` (1-based)
column.
-ex, --exclude EXCLUDE
Supply a csv file listing sequences that should be excluded from the
analysis.
--extract-cds Extract coding sequences based on coordinates in the reference
--concatenate Concatenate coding sequences for each genome, separated by `NNN`.
Default: write out as separate records
--clade CLADE Specify whether the alignment is primarily for `cladei` or `cladeii`
(can also specify a or b, e.g. `cladeia`, `cladeiib`). `split` will
separate input sequences into separate cladei and cladeii alignments.
This will determine reference used for alignment, mask file and
background set used if `--include-background` flag used in conjunction
with the `--run-phylo` option. Default: `cladeii`
Phylo options:
-p, --run-phylo Run phylogenetics pipeline
-a, --run-apobec3-phylo
Run phylogenetics & APOBEC3-mutation reconstruction pipeline
--outgroups OUTGROUPS
Specify which MPXV outgroup(s) in the alignment to use in the phylogeny.
These will get pruned out from the final tree.
-bg, --include-background
Include a default background set of sequences for the phylogenetics
pipeline. The set will be determined by the `--clade` specified.
-bf, --background-file BACKGROUND_FILE
Include this additional FASTA file as background to the phylogenetics.
-bm, --binary-partition-mask
Calculate and write binary partition mask
--bm-separate-dimers Write partition mask with 0 for non-apo, 1 for GA and 2 for TC target
sites
-at, --asr-tree ASR_TREE
Precomputed tree with ancestral states reconstructed
-as, --asr-state ASR_STATE
Precomputed ancestral states file
-aln, --asr-alignment ASR_ALIGNMENT
Precomputed alignment file
Tree figure options:
-tfig, --tree-figure-only
Re-render tree figure custom height and width arguments. Requires: tree
file, branch reconstruction file, height, width.
-tf, --tree-file TREE_FILE
Tree for re-rendering the figure.
-brf, --branch-reconstruction-file BRANCH_RECONSTRUCTION_FILE
Reconstruction file for re-rendering the figure.
--fig-height FIG_HEIGHT
Overwrite tree figure default height.
--fig-width FIG_WIDTH
Overwrite tree figure default width.
--point-style POINT_STYLE
Shape of points for apobec3 reconstruction figure. Options: circle,
square. Default: circle
--point-justify POINT_JUSTIFY
Justification of points for apobec3 reconstruction figure. Options:
left, right. Default: left
--interactive-tree Create a separate interactive file built from R functions
Misc options:
-v, --version show program's version number and exit
--verbose Print lots of stuff to screen
-t, --threads THREADS
Number of threads
Output files
sequences.aln.fasta
The alignment file, with alignment scaffolded against a clade-specific reference. By default one of the ITR regions and a curated set of problematic regions is masked as Ns.
sequences.aln.tree
The output maximum likelihood tree file from IQTREE2 with Node labels that correspond to the reconstruction Node labels. This tree can be viewed in various tree viewers, for example FigTree.
sequences.aln.annotated.tree
The output maximum likelihood tree file from IQTREE2 with annotations on branches that describe the APOBEC3-reconstrucion results. This tree can be viewed in various tree viewers, for example FigTree.
sequences.aln.tree.state and sequences.aln.tree.state_differences.csv
The output ancestral state reconstruction file from IQTREE2 and the compiled list of unambiguously variable sites from squirrel.
sequences.aln.tree.branch_snps.reconstruction.csv
A report of individual site changes mapped to specific branches and their dinucleotide context.
sequences.aln.tree.amino_acid.reconstruction.csv
A report of each mutation that occurs across the phylogeny, their location, dinucleotide context, APOBEC3 status, which gene they’re present in, codon position, amino acid change and a prediction of how extreme that amino acid change is with Grantham score.
sequences.aln.tree.png and sequences.aln.tree.svg
Visualisation of reconstructed tree showing whether mutations are consistent with APOBEC3 editing or not.
sequences.aln.report.html
Summary report of analysis run.
What is a FASTA file
A FASTA-formatted file contains sequence records. A record minimally contains two pieces of information:
The sequence ID (e.g. sequence1)
The sequence itself (e.g. CGATCGAT...ACTGACT)
Format details:
The sequence ID is stored in the header line, denoted by a > symbol
The header line may also contain additional information (sequence description) after the first space
Important: The sequence ID must not contain whitespace (spaces or tabs)
The sequence is stored on the following line(s)
Sequences can be split across multiple lines for readability
The next record does not start until the next line that begins with >
Example FASTA with three sequence records:
Example FASTA with three sequence records:
squirrel
Some QUIck Reconstruction to Resolve Evolutionary Links
Data use
Squirrel used open data to train the clade assignment random forest model. This data was accessed through Pathoplexus and the Sequence Set used for training can be accessed at https://doi.org/10.62599/PP_SS_529.1.
How to cite:
If you use squirrel in your research, please cite the Squirrel paper, which is now published in Virus Evolution:
Áine O’Toole, Eddy Kinganda-Lusamaki, Rachel Colquhoun, Connor Chato, Emily Scher, Chris Kent, Sam Wilkinson, Joshua Quick, Nicholas Loman, Ana T Duggan, Placide Mbala, Andrew Rambaut, Human outbreak detection and best practice MPXV analysis and interpretation with squirrel, Virus Evolution, Volume 12, Issue 1, 2026, veaf095, https://doi.org/10.1093/ve/veaf095
Table of Contents
Quick Start
Basic Alignment
Align MPXV Clade II sequences:
Align MPXV Clade I sequences:
where
<your-sequences.fasta>is the name of your input FASTA sequence file. See FASTA format for file format details.Mixed Clade Sequences
NEW FEATURE: If your FASTA file contains a mixture of Clade I and Clade II sequences, squirrel can now assign clade to your sequences, split them into separate alignments for Clade I and II, and run all downstream analysis per clade. A single report is produced at the end.
Installation
Option 1: Install from bioconda (recommended)
Option 2: Install from source
cd squirrelmambais not installed, runconda install -n base mambamamba env create -f environment.ymlconda activate squirrelpip install .Verify installation:
You should see the version of squirrel.
How It Works
Alignment Process
squirrel provides a rapid way of producing reliable alignments for MPXV and enables maximum-likelihood phylogenetics pipeline tree estimation.
The reliability of tree estimation is determined by the quality of the input genome sequences. In QC mode, squirrel can flag potential issues in the MPXV sequences that have been provided for alignment (e.g. SNPs near tracts of N, clusters of unique SNPs, reversions to reference alleles and convergent mutations) and outputs these in a mask file for investigation. We suggest you use this information to examine the alignment and pay close attention to the regions flagged. Squirrel can then accept this file with suggested masks and apply it to the sequences before doing phylogenetics.
Enrichment of APOBEC3-mutations in the MPXV population are a signature of sustained human-to-human transmission. Identifying APOBEC3-like mutations in MPXV genomes from samples in a new outbreak can be a piece of evidence to support sustained human transmission of mpox. Squirrel can run an APOBEC3-reconstruction and map these mutations onto the phylogeny.
Squirrel produces a HTML report summarising some of the key outputs.
Generate a quick MPXV alignment
where
<your-sequences.fasta>is the name of your input FASTA sequence file. Click here see what a FASTA formatted file looks like.Note, an EPI2ME wrapper for squirrel with the same options is available here.
How it works - alignment
Squirrel maps each query genome in the input file against a reference genome specific to each clade using minimap2. Using gofasta, the mapping file is then converted into a multiple sequence alignment.
For Clade II, the reference used is
NC_063383and for Clade I, we useNC_003310. This means that all coordinates within an alignment will be relative to these references. A benefit of this is that within a clade, alignment files and be combined without having to recalculate the alignment. Note however that insertions relative to the reference sequence will not be included in the alignment.Squirrel by default creates a single alignment fasta file. Using the genbank coordinates for
NC_063383it also has the ability to extract the aligned coding sequences either as separate records or as a concatenated alignment. This can facilitate codon-aware phylogenetic or sequence analysis.If you input sequences that are entirely N or very short, squirrel may fail to align these. If all input sequences are 100% ambiguous, squirrel will exit with an error.
Some of the phylogenetic software within squirrel (iqtree, baltic) will modify or break with particular special characters in the sequence names. Squirrel will check for these special characters in the sequence names and report back if it finds any to help you curate your alignment. Special characters that we check for are:
' " { ( [ ] ) } ; , ` % # : \n \t, however accented letters etc may also disrupt the software. For best practice, please exclude special characters from sequence names.Phylogeny & Reconstruction
Squirrel has the optional
--run-phyloand--run-apobec3-phylomodes that will take the newly generated alignment and build a maximum likelihood phylogeny using IQTREE2.With
--run-apobec3-phylomode, it also runs IQTREE ancestral state reconstruction (-asr), and parses the output state files, providing a branch-mapped summary of SNPs that have occurred across the phylogeny, and an output phylogeny figure with SNPs plotted along branches, coloured by whether SNPs are consistent with APOBEC3-editing or not (red for APOBEC3-like mutations and yellow for other mutations).An outgroup (or multiple outgroups) must be specified (although this is handled internally in
--include-backgroundmode) to ensure correct rooting for the ancestral state reconstruction. If--cns-qcmode is on in conjunction with phylogenetics, reversions to reference and convergent SNPs are also flagged using the reconstruction.Squirrel can also run a pan-clade alignment option, which aligns against a pan-clade composite consensus reference generated from a pairwise alignment of PX667572 and NC_063383.
Masking
Default Masking Options
The final alignment from squirrel has additional processing steps that includes various masking options:
**Sequences with a high N content
Sequences that have many ambiguous bases in them are flagged that they may want to be excluded in further analysis. This may not always be appropriate, often genomes that have a lot of ambiguity can still be informative, however if there is something unusual about a sequence, having lots of ambiguities can be a flag for wider problems (like low read count during assembly).
Phylogenetics
Maximum-Likelihood Phylogeny
Low-complexity/Repetitive Regions (enabled by default)
N) on low-complexity or repetitive regions that have been characterised for Clade I and II--cladespecified--no-maskCustom Masking
Additional Column Masking (
--additional-mask)-qc), it will flag sites that may need to be masked and produce a CSV mask file summarising the sitesSequence-specific Masking (
--sequence-mask)sequencecolumn (must match an id in the supplied FASTA file) and asitecolumn (1-based position)Quality Control
Squirrel can run quality control (QC) on the alignment and flag certain sites to the user that may need to be masked. We recommend that the user looks at these sites in an alignment viewer to judge whether the sites should be masked or not. If
-qcmode is toggled on, squirrel with check within the alignment for:Build a phylogenetic tree with specified outgroups:
Build a tree with automatic background sequences (no outgroup needed):
More details on
--include-backgroundcan be found in the Background Mode section.APOBEC3 Mutation Analysis
Signatures of APOBEC3-editing are characteristic of MPXV evolution during sustained transmission in the human population (see O’Toole et al 2023). To run the maximum-likelihood phylogenetics and ancestral reconstruction pipeline that characterises the APOBEC3-like and non-APOBEC3 mutations:
Requirements:
Squirrel will produce an annotated tree file that contains information on each branch about how many APOBEC3/non-APOBEC3 mutations are present on each branch, and also on each subtree beneath that branch.
Recommended Outgroups
NEW FEATURE: It is no longer necessary to manually specify an outgroup. By default the recommended outgroups below are selected and pruned off for each clade I or II specifically.
Example: To run phylogenetics mode on a Clade I dataset, include KJ642617, KJ642615 and KJ642616 in your test sequence file and run:
How it works - phylogeny & reconstruction
Squirrel has the optional
--run-phyloand--run-apobec3-phylomodes that will take the newly generated alignment and build a maximum likelihood phylogeny using IQTREE2. With--run-apobec3-phylomode, it also runs IQTREE ancestral state reconstruction (-asr), and parses the output state files, providing a branch-mapped summary of SNPs that have occurred across the phylogeny, and an output phylogeny figure with SNPs plotted along branches, coloured by whether SNPs are consistent with APOBEC3-editing or not (red for APOBEC3-like mutations and yellow for other mutations). An outgroup (or multiple outgroups) must be specified (although this is handled internally in--include-backgroundmode) to ensure correct rooting for the ancestral state reconstruction. If--cns-qcmode is on in conjunction with phylogenetics, reversions to reference and convergent SNPs are also flagged using the reconstruction.Automatic Outgroup Selection
squirrel will automatically pull out the background sequences from the relevant clade and select the appropriate outgroup from the background set. This outgroup will be pruned from the final output tree, so will not be seen in the output file.
Automatically selected outgroups by clade:
Background Mode
The squirrel software has a set of publically available MPXV genome sequences that include representatives of CladeIa, CladeIb, Clade IIa and CladeIIb. The sequences, the Genbank accession numbers and their clade annotations can be found in background_sample.csv and background.fasta.
If
--include-backgroundis run with the squirrel command, squirrel will automatically pull out the background sequences from the relevant clade and select the appropriate outgroup from the background set. Squirrel includes a set of publicly available MPXV genome sequences that include representatives of Clade Ia, Clade Ib, Clade IIa and Clade IIb. The sequences, GenBank accession numbers and their clade annotations can be found in:When
--include-backgroundis used, squirrel will automatically:Example: Running
squirrel --clade cladei --include-background <sequences.fasta>will combine all Clade I background sequences with your input file and include a Clade II outgroup sequence to correctly root the tree.The outgroups automatically selected are the same as listed in the Automatic Outgroup Selection table above.
Convergent Mutations
Reversions to Reference There are several options for customizing the final APOBEC3-reconstruction tree figure output:
Re-render Tree Figure Only (
-tfig/--tree-figure-only)-tf/--tree-file)-brf/--branch-reconstruction-file)Size Customization
--fig-height- Specify custom height for the final tree visualization--fig-width- Specify custom width for the final tree visualizationPoint Style Options
--point-style- Choose how mutations are represented:NEW FEATURE: Squirrel can now accept a precomputed treefile, alignment and state file generated by IQTREE2 to run the APOBEC3 reconstruction.
Note: If the tree file has multiple likelihood metrics, it cannot be parsed by baltic.
Required flags:
-at, --asr-tree-as, --asr-state-aln, --asr-alignmentOutput Files
Core Output Files
sequences.aln.fasta- The alignment file, scaffolded against a clade-specific reference. By default one of the ITR regions and a curated set of problematic regions is masked as Ns.sequences.aln.tree- The output maximum likelihood tree file from IQTREE2 with Node labels that correspond to the reconstruction Node labels. Can be viewed in tree viewers like FigTree.sequences.aln.annotated.tree- The output maximum likelihood tree file from IQTREE2 with annotations on branches describing the APOBEC3-reconstruction results. Can be viewed in tree viewers like FigTree.sequences.aln.report.html- Summary report of analysis run.Reconstruction Files
sequences.aln.tree.stateandsequences.aln.tree.state_differences.csv- The output ancestral state reconstruction file from IQTREE2 and the compiled list of unambiguously variable sites from squirrel.sequences.aln.tree.branch_snps.reconstruction.csv- A report of individual site changes mapped to specific branches and their dinucleotide context.sequences.aln.tree.amino_acid.reconstruction.csv- A report of each mutation that occurs across the phylogeny: location, dinucleotide context, APOBEC3 status, gene, codon position, amino acid change, and Grantham score.Visualisation Files
sequences.aln.tree.pngandsequences.aln.tree.svg- Visualization of reconstructed tree showing whether mutations are consistent with APOBEC3 editing or not (red for APOBEC3-like mutations, yellow for other mutations).*tree.html- Interactive tree file (when using--interactive-treeflag).Installation
Install from bioconda with
condaormamba, e.g.conda create -c bioconda -c conda-forge -n squirrel -y squirrel.Or:
cd squirrelmambais not installed, runconda install -n base mambamamba env create -f environment.ymlconda activate squirrelpip install .Check the install worked
Type (in the squirrel environment):
and you should see the versions of squirrel.
Full usage
Output files
sequences.aln.fastaThe alignment file, with alignment scaffolded against a clade-specific reference. By default one of the ITR regions and a curated set of problematic regions is masked as Ns.
sequences.aln.treeThe output maximum likelihood tree file from IQTREE2 with Node labels that correspond to the reconstruction Node labels. This tree can be viewed in various tree viewers, for example FigTree.
sequences.aln.annotated.treeThe output maximum likelihood tree file from IQTREE2 with annotations on branches that describe the APOBEC3-reconstrucion results. This tree can be viewed in various tree viewers, for example FigTree.
sequences.aln.tree.stateandsequences.aln.tree.state_differences.csvThe output ancestral state reconstruction file from IQTREE2 and the compiled list of unambiguously variable sites from squirrel.
sequences.aln.tree.branch_snps.reconstruction.csvA report of individual site changes mapped to specific branches and their dinucleotide context.
sequences.aln.tree.amino_acid.reconstruction.csvA report of each mutation that occurs across the phylogeny, their location, dinucleotide context, APOBEC3 status, which gene they’re present in, codon position, amino acid change and a prediction of how extreme that amino acid change is with Grantham score.
sequences.aln.tree.pngandsequences.aln.tree.svgVisualisation of reconstructed tree showing whether mutations are consistent with APOBEC3 editing or not.
sequences.aln.report.htmlSummary report of analysis run.
What is a FASTA file
A FASTA-formatted file contains sequence records. A record minimally contains two pieces of information:
sequence1)CGATCGAT...ACTGACT)Format details:
>symbol>Example FASTA with three sequence records: Example FASTA with three sequence records: