Lightweight resources assembly algorithm for high-throughput sequencing reads. It uses a pair of cache oblivious Bloom filters, one holding a uniform sample of g-spaced sequenced kmers and the other holding kmers classified as likely correct, using a simple statistical test. LightAssembler contains a light implementation of the graph traversal and simplification modules that achieves comparable assembly accuracy and contiguity to other competing tools, more details about LightAssembler can be found in :
If the gap size parameter is missing, LightAssembler invokes its parameters extrapolation module to compute the starting gap based on the sequencing coverage and the error rate of the dataset.
The maximum read length for this version is 1024 bp.
The maximum supported read files for this version is 100 files.
Read files
LightAssembler assembles multiple input files of the sequencing reads given in fasta/fastq format. Also, LightAssembler can read directly the input files compressed with gzip fasta.gz/fastq.gz.
Outputs
The output of LightAssembler is the set of assembled contigs in fasta format, in the file:
[output prefix].contigs.fasta
LightAssembler also reports the following on the screen:
Number of resulted contigs.
Maximum contig length.
Total Assembly size.
Total genome coverage.
Total Assembly time as well as the total time for each step.
Also, by using the --verbose option, LightAssembler reports the additional details for each step such as the number of kmers, the false positive rate of Bloom filters and the number of branching kmers in the dataset, the average read length and the average sequencing coverage.
LightAssembler
Lightweight resources assembly algorithm for high-throughput sequencing reads. It uses a pair of cache oblivious Bloom filters, one holding a uniform sample of g-spaced sequenced kmers and the other holding kmers classified as likely correct, using a simple statistical test. LightAssembler contains a light implementation of the graph traversal and simplification modules that achieves comparable assembly accuracy and contiguity to other competing tools, more details about LightAssembler can be found in :
El-Metwally, S., Zakaria, M. and Hamza, T.; LightAssembler: fast and memory-efficient assembly algorithm for high-throughput sequencing reads. Bioinformatics 2016; 32 (21): 3215-3223. doi: 10.1093/bioinformatics/btw470.
Copyright (C) 2015-2016, and GNU GPL, by Sara El-Metwally, Magdi Zakaria and Taher Hamza.
System requirements
64-bit machine with g++ compiler or gcc in general, pthreads,and zlib libraries.
Installation
git clone https://github.com/SaraEl-Metwally/LightAssembler.gitmakein the repo directory for k <= 31** ormake k=kmersizefor **k > 31, e.g.make k=49.Quick usage guide
Notes
1024 bp.100files.Read files
LightAssembler assembles multiple input files of the sequencing reads given in fasta/fastq format. Also, LightAssembler can read directly the input files compressed with gzip fasta.gz/fastq.gz.
Outputs
The output of LightAssembler is the set of assembled contigs in fasta format, in the file:
[output prefix].contigs.fastaLightAssembler also reports the following on the screen:
Also, by using the
--verboseoption, LightAssembler reports the additional details for each step such as the number of kmers, the false positive rate of Bloom filters and the number of branching kmers in the dataset, the average read length and the average sequencing coverage.Example 1
./LightAssembler -k 31 -g 15 -e 0.01 -G 4686137 -o ecoli_contigs -t 3 ecoli_reads_1.fq ecoli_reads_2.fq --verboseExample 2 (missing g)
Example 3 (without –verbose)
./LightAssembler -k 31 -g 15 -e 0.01 -G 4686137 -o ecoli_contigs -t 3 ecoli_reads_1.fq ecoli_reads_2.fq --verbose