SNAP is a general purpose gene finding program suitable for both eukaryotic and
prokaryotic genomes. SNAP is an acroynm for Semi-HMM-based Nucleic Acid Parser.
Reference
Korf I. Gene finding in novel Genomes. BMC Bioinformatics 2004, 5:59
Contact
I appreciate bug reports, comments, and suggestions. My current contact
information is:
DNA Contains some sample sequences
HMM Contains SNAP parameter files
LICENSE MIT license
Makefile For compiling
Makefile.include Automatically generated, should not be edited
Zoe A library containing lots of the base functions
fathom.c Utility for investigating sequences and annotation
forge.c Parameter estimation
hmm-assembler.pl Creates HMMs for SNAP
snap.c Gene prediction program
Your favorite genome…
If you wish to train SNAP for a new genome, please contact me. Parameter
estimation is not particularly difficult, but the procedure is not well
documented and I have only included the minimum applications here. I’ve
included the basic strategy at the end of this document.
INSTALLATION INSTRUCTIONS
The software is routinely compiled and tested on Mac and Linux on a variety of
architectures. It should compile cleanly on any Linux/Unix type operating
systems but new compilers sometimes complain, so please let me know if you have
problems.
Enviroment
The ZOE environment variable can be used by SNAP to find the HMM files. Set
this to the directory containing this file. For example, if you unpackaged the
tar-ball in /usr/local, set the ZOE environment variable to /usr/local/Zoe
Sequences must be in FASTA format. It’s a good idea if you don’t have genes
that are too related to each other.
Gene structures must be in ZFF format. What is ZFF? It is a non-standard format
(ie. nobody uses it but me) that bears resemblence to FASTA and GFF (both true
standards). There are two styles of ZFF, the short format and the long format.
In both cases, the sequence records are separated by a definition line, just
like FASTA. In the short format, there are 4 fields: Label, Begin, End, Group.
The 4th field is optional. Label is a controlled vocabulary (see zoeFeature.h
for a complete list). All exons of a gene (or more appropriately a
transcriptional unit) must share the same unique group name. The strand of the
feature is implied in the coordinates, so if Begin > End, the feature is on the
minus strand. Here’s and example of the short format with two sequences, each
containing a single gene on the plus strand:
The long format adds 5 fields between the coordinates and the group: Strand,
Score, 5’-overhang, 3’-overhang, and Frame. Strand is +/-. Score is any
floating point value. 5’- and 3’-overhang are the number of bp of an incomplete
codon at each end of an exon. Frame is the reading frame (0..2 and not 1..3).
Here’s an example of the long format:
In this tutorial, we will create SNAP HMM files for 3 different genomes. In the
DATA directory, you will find fasta and gff3 files corresponding to 1 percent
of the A. thaliana, C. elegans, and D. melanogaster genomes. Let’s start by
creating a directory for training A. thaliana in the main SNAP directory. We’ll
run gff3_to_zff.pl to convert the annotation to ZFF.
mkdir train_at
cd train_at
../gff3_to_zff.pl ../DATA/at.fa.gz ../DATA/at.gff3.gz > at.zff
The next step is to check for errors in the annotation. The training procedure
assumes that genes are canonical in various respects.
Coding sequences start with ATG and end in a stop codon
Splice sites follow the GT..AG rule
Genes are at least 150 bp
Exons are at least 6 bp
CDS are at least 150 bp
Introns are at least 30 bp
Running fathom -validate will tell you which genes look ok and which genes
look suspicious. Let’s try one.
This will produce a bunch of output to STDERR. You will see several WARNING
lines saying the DNA and annotation don’t have the same definition lines.
That’s okay. The ZFF contains only the sequence id from the FASTA file and not
the whole definition line present in the original FASTA. The last line gives
some overall stats.
463 genes, 463 OK, 40 warnings, 0 errors
If you examine the at.validate file, you will see warnings for some short
genes, short exons, non-canonical introns, etc. We won’t be using these to
train SNAP. To split genes into various categories use the -categorize
function of fathom and give it a value for how much intergenic sequence you
want on each side of a gene. For example fathom -categorize 1000 attempts to
put 1000 bp of genomic sequence on each side of a gene. However, if two genes
are close to each other, say only 400 bp apart, they split the intergenic
sequence and each get 200 bp of intergenic.
../fathom -categorize 100 at.zff ../DATA/at.fa.gz
This produces several new files:
alt.* contains genes with alternative splicing
err.* contains genes with errors
olp.* contains overlapping genes
uni.* contains unique genes
wrn.* contains genes with warnings
fathom doesn’t want to create training files with alternative splicing. It
could create a case of overtraining for those specific genes. If you have a lot
of alternative splicing, you may want to remove all of the isoforms except for
the main one. fathom also doesn’t know what to do with overlapping genes
because it requires genes to have intergenic sequence on either side. These
tend to be rare. Genes with unusual features or outright errors are separated
also.
The next step is to export all of the uni genes into their plus-stranded
versions.
../fathom -export 100 -plus uni.*
This creates 4 new files:
export.aa contains the amino acid sequences
export.ann contains the annotation in ZFF format
export.dna contains the sequences in FASTA format
export.tx contains the sequences of the coding sequences
Ideally, all of the proteins in export.aa start with M and end with *.
Similarly, the export.tx files should start with ATG and end in a stop
codon. All of the genes should validate without any reported warnings or
errors.
../fathom -validate export.ann export.dna
The next step is to run forge, which will create a large number of model
files.
../forge export.ann export.dna
Finally, run hmm-assembler.pl to glue the various models together to form an
hmm parameter file. There are several options, but we’ll just use the defaults.
./hmm-assembler.pl A.thaliana . > at.hmm
To verify this works, you can try it on the various fasta files we’ve used.
Next, we are going to try a slightly different training procedure that might be
better if you have a lot of genes that are getting stuffed into the wrn.*
files. Let’s rewind a bit.
The standard procedure will have us training from the 236 genes in the uni
category. To recover the genes in the wrn category, we’ll just glue them to
the uni and then export that for the training.
What about all of the genes in the alt category? Those genes are reported to
have multiple isoforms. Training from a gene with 10 isoforms would count that
gene 10 times, so these are generally skipped. However, as more and more
isoforms are found, this will become problematic. You will need some way to
figure out which isoform is canonical and delete the rest. I don’t have an
automated way to do that as each community has their own standards.
Try the training procedures for the C. elegans and D. melanogaster genomes
next. Note that these training sets represent just 1% of each chromosome and
are just for demonstration purposes.
SNAP Documentation
Introduction
SNAP is a general purpose gene finding program suitable for both eukaryotic and prokaryotic genomes. SNAP is an acroynm for Semi-HMM-based Nucleic Acid Parser.
Reference
Korf I. Gene finding in novel Genomes. BMC Bioinformatics 2004, 5:59
Contact
I appreciate bug reports, comments, and suggestions. My current contact information is:
License
This software is covered by the MIT License.
Files and Directories
Your favorite genome…
If you wish to train SNAP for a new genome, please contact me. Parameter estimation is not particularly difficult, but the procedure is not well documented and I have only included the minimum applications here. I’ve included the basic strategy at the end of this document.
INSTALLATION INSTRUCTIONS
The software is routinely compiled and tested on Mac and Linux on a variety of architectures. It should compile cleanly on any Linux/Unix type operating systems but new compilers sometimes complain, so please let me know if you have problems.
Enviroment
The ZOE environment variable can be used by SNAP to find the HMM files. Set this to the directory containing this file. For example, if you unpackaged the tar-ball in /usr/local, set the ZOE environment variable to /usr/local/Zoe
If you do not use the ZOE environment variable, you can still use SNAP but you must specify the explict path to the parameter file.
Compiling
Provided you have gcc and make, compiling should be as simple as:
Testing
PARAMETER ESTIMATION
Sequences must be in FASTA format. It’s a good idea if you don’t have genes that are too related to each other.
Gene structures must be in ZFF format. What is ZFF? It is a non-standard format (ie. nobody uses it but me) that bears resemblence to FASTA and GFF (both true standards). There are two styles of ZFF, the short format and the long format. In both cases, the sequence records are separated by a definition line, just like FASTA. In the short format, there are 4 fields: Label, Begin, End, Group. The 4th field is optional. Label is a controlled vocabulary (see zoeFeature.h for a complete list). All exons of a gene (or more appropriately a transcriptional unit) must share the same unique group name. The strand of the feature is implied in the coordinates, so if Begin > End, the feature is on the minus strand. Here’s and example of the short format with two sequences, each containing a single gene on the plus strand:
The long format adds 5 fields between the coordinates and the group: Strand, Score, 5’-overhang, 3’-overhang, and Frame. Strand is +/-. Score is any floating point value. 5’- and 3’-overhang are the number of bp of an incomplete codon at each end of an exon. Frame is the reading frame (0..2 and not 1..3). Here’s an example of the long format:
TUTORIAL
In this tutorial, we will create SNAP HMM files for 3 different genomes. In the
DATAdirectory, you will find fasta and gff3 files corresponding to 1 percent of the A. thaliana, C. elegans, and D. melanogaster genomes. Let’s start by creating a directory for training A. thaliana in the main SNAP directory. We’ll rungff3_to_zff.plto convert the annotation to ZFF.The next step is to check for errors in the annotation. The training procedure assumes that genes are canonical in various respects.
Running
fathom -validatewill tell you which genes look ok and which genes look suspicious. Let’s try one.This will produce a bunch of output to STDERR. You will see several WARNING lines saying the DNA and annotation don’t have the same definition lines. That’s okay. The ZFF contains only the sequence id from the FASTA file and not the whole definition line present in the original FASTA. The last line gives some overall stats.
If you examine the
at.validatefile, you will see warnings for some short genes, short exons, non-canonical introns, etc. We won’t be using these to train SNAP. To split genes into various categories use the-categorizefunction offathomand give it a value for how much intergenic sequence you want on each side of a gene. For examplefathom -categorize 1000attempts to put 1000 bp of genomic sequence on each side of a gene. However, if two genes are close to each other, say only 400 bp apart, they split the intergenic sequence and each get 200 bp of intergenic.This produces several new files:
alt.*contains genes with alternative splicingerr.*contains genes with errorsolp.*contains overlapping genesuni.*contains unique geneswrn.*contains genes with warningsfathomdoesn’t want to create training files with alternative splicing. It could create a case of overtraining for those specific genes. If you have a lot of alternative splicing, you may want to remove all of the isoforms except for the main one.fathomalso doesn’t know what to do with overlapping genes because it requires genes to have intergenic sequence on either side. These tend to be rare. Genes with unusual features or outright errors are separated also.The next step is to export all of the
unigenes into their plus-stranded versions.This creates 4 new files:
export.aacontains the amino acid sequencesexport.anncontains the annotation in ZFF formatexport.dnacontains the sequences in FASTA formatexport.txcontains the sequences of the coding sequencesIdeally, all of the proteins in
export.aastart withMand end with*. Similarly, theexport.txfiles should start withATGand end in a stop codon. All of the genes should validate without any reported warnings or errors.The next step is to run
forge, which will create a large number of model files.Finally, run
hmm-assembler.plto glue the various models together to form an hmm parameter file. There are several options, but we’ll just use the defaults.To verify this works, you can try it on the various fasta files we’ve used.
Next, we are going to try a slightly different training procedure that might be better if you have a lot of genes that are getting stuffed into the
wrn.*files. Let’s rewind a bit.Let’s see how many genes are in each category.
The standard procedure will have us training from the 236 genes in the
unicategory. To recover the genes in thewrncategory, we’ll just glue them to theuniand then export that for the training.What about all of the genes in the
altcategory? Those genes are reported to have multiple isoforms. Training from a gene with 10 isoforms would count that gene 10 times, so these are generally skipped. However, as more and more isoforms are found, this will become problematic. You will need some way to figure out which isoform is canonical and delete the rest. I don’t have an automated way to do that as each community has their own standards.Try the training procedures for the C. elegans and D. melanogaster genomes next. Note that these training sets represent just 1% of each chromosome and are just for demonstration purposes.