***
A sensitive aligner for 454, Illumina and PacBio data, employing a full Smith-Waterman alignment against a reference.
This java command line application aligns Next-Generation Sequencing (NGS) and third-generation reads
to a set of reference sequences, by a prior fast k-mer matching and removes indels, causing
frame shifts. In addition, only a specific region can be considered.
An iterative refinement of the alignment can be performed, by alignment against the consensus sequence with wobbles.
The output is in SAM format.
Features
Fully multithreaded
Performes a full Smith-Waterman alignment
Multiple sets of affine gap costs can be used to find optimal alignment for each read
Paired-end reads are properly paired with SAM Flags
Accepts multiple reference genomes with wobbles
Iterative alignment against a consensus with wobbles to increase alignment quality
With parameter -sensitive, multiple affine gap costs are tested for each read and the best alignment is kept.
Affine GAP costs
Gap costs for the used Smith-Waterman can be set with
-gop 3 (gap open)
-gex 1 (gap extend)
Predefined: 10 open & 3 extend. Tested with with PacBio, Illumina and 454 data on HIV, HCV and HBV data.
Iterative refinement
The alignment can be improved by aligning against the consensus sequence. The parameter -refine INT takes a positive number as input and activates the iterative refinement. Only works if the alignment is against one reference genome.
Remove conserved deletions:
During the iterative alignment, conserved deletions can be removed with -rmDel.
Remove frame-shift causing deletions
With parameter -fix, frame-shift causing deletions are replaced with the consensus sequence.
Line breaks
In the case that a single fastq entry is longer than four lines, which is caused by line breaks in the sequence and quality string, use -flat.
Extract region:
In addition, only a specific region can be extracted with -r begin-end, for example a certain gene:
java -jar InDelFixer.jar -i libCase102.sff -g referenceGenomes.fasta -r 342-944
FILTER
-l INT : Minimal read-length prior alignment (default 0)
-la INT : Minimal read-length after alignment (default 0)
-ins DOUBLE : The maximum percentage of insertions allowed [range 0.0 - 1.0] (default 1.0)
-del DOUBLE : The maximum percentage of deletions allowed [range 0.0 - 1.0] (default 1.0)
-sub DOUBLE : The maximum percentage of substitutions allowed [range 0.0 - 1.0] (default 1.0)
-maxDel INT : The maximum number of consecutive deletions allowed (default no filtering)
Help:
Further help can be shown by running without additional parameters:
java -jar InDelFixer.jar
BAM output:
In order to convert the reads.sam into the BAM format, please install samtools and run:
InDelFixer
Dr. Armin Töpfer, armintoepfer.com
*** A sensitive aligner for 454, Illumina and PacBio data, employing a full Smith-Waterman alignment against a reference.This java command line application aligns Next-Generation Sequencing (NGS) and third-generation reads to a set of reference sequences, by a prior fast k-mer matching and removes indels, causing frame shifts. In addition, only a specific region can be considered.
An iterative refinement of the alignment can be performed, by alignment against the consensus sequence with wobbles.
The output is in SAM format.
Features
PREREQUISITES TO RUN:
RUN:
454/Roche:
java -jar InDelFixer.jar -i libCase102.sff -g referenceGenomes.fastaBut I encourage to convert the sff to fastq with
sff2fastq input.sff -o input.fastqsff2fastq can be installed with:Fasta / PacBio ccs:
java -jar InDelFixer.jar -i libCase102.fasta -g referenceGenomes.fastaFor PacBio input, please use
-noHashingsince the PacBio error rate is too high for a reliable kmer-matching.Illumina paired end:
java -jar InDelFixer.jar -i libCase102_R1.fastq -ir libCase102_R2.fastq -g referenceGenomes.fastaHigh quality alignment
With parameter
-sensitive, multiple affine gap costs are tested for each read and the best alignment is kept.Affine GAP costs
Gap costs for the used Smith-Waterman can be set with
Predefined: 10 open & 3 extend. Tested with with PacBio, Illumina and 454 data on HIV, HCV and HBV data.
Iterative refinement
The alignment can be improved by aligning against the consensus sequence. The parameter
-refine INTtakes a positive number as input and activates the iterative refinement. Only works if the alignment is against one reference genome.Remove conserved deletions:
During the iterative alignment, conserved deletions can be removed with
-rmDel.Remove frame-shift causing deletions
With parameter
-fix, frame-shift causing deletions are replaced with the consensus sequence.Line breaks
In the case that a single fastq entry is longer than four lines, which is caused by line breaks in the sequence and quality string, use
-flat.Extract region:
In addition, only a specific region can be extracted with
-r begin-end, for example a certain gene:java -jar InDelFixer.jar -i libCase102.sff -g referenceGenomes.fasta -r 342-944FILTER
Help:
Further help can be shown by running without additional parameters:
java -jar InDelFixer.jarBAM output:
In order to convert the
reads.saminto the BAM format, please install samtools and run:COMPILE (only for dev):
Install Maven 3
CONTACT:
Contributions
Armin Töpfer
David Seifert
Alexander Artyomenko
LICENSE:
GNU GPLv3 http://www.gnu.org/licenses/gpl-3.0