NSCCN/LightAssembler：面向高通量测序 reads 的轻量级组装工具，采用 Bloom filter 和图遍历策略实现快速且节省内存的组装。

LightAssembler

Lightweight resources assembly algorithm for high-throughput sequencing reads. It uses a pair of cache oblivious Bloom filters, one holding a uniform sample of g-spaced sequenced kmers and the other holding kmers classified as likely correct, using a simple statistical test. LightAssembler contains a light implementation of the graph traversal and simplification modules that achieves comparable assembly accuracy and contiguity to other competing tools, more details about LightAssembler can be found in :

El-Metwally, S., Zakaria, M. and Hamza, T.; LightAssembler: fast and memory-efficient assembly algorithm for high-throughput sequencing reads. Bioinformatics 2016; 32 (21): 3215-3223. doi: 10.1093/bioinformatics/btw470.

System requirements

64-bit machine with g++ compiler or gcc in general, pthreads,and zlib libraries.

Installation

Clone the GitHub repo, e.g. with git clone https://github.com/SaraEl-Metwally/LightAssembler.git
Run make in the repo directory for k <= 31** or make k=kmersize for **k > 31, e.g. make k=49.

Quick usage guide

./LightAssembler -k [kmer size] -g [gap size] -e [error rate] -G [genome size] -t
[threads] -o [output prefix] [input files] --verbose

* [-k] kmer size                [default: 31]
* [-g] gap size                 [default: 25X:3 35X:4 75X:8 140X:15 280X:25]
* [-e] error rate               [default: 0.01]
* [-G] genome size              [default: 0]
* [-t] number of threads        [default: 1]
* [-o] output prefix file name  [default: LightAssembler]

Notes

If the gap size parameter is missing, LightAssembler invokes its parameters extrapolation module to compute the starting gap based on the sequencing coverage and the error rate of the dataset.
The maximum read length for this version is 1024 bp.
The maximum supported read files for this version is 100 files.

Read files

LightAssembler assembles multiple input files of the sequencing reads given in fasta/fastq format. Also, LightAssembler can read directly the input files compressed with gzip fasta.gz/fastq.gz.

Outputs

The output of LightAssembler is the set of assembled contigs in fasta format, in the file:

[output prefix].contigs.fasta

LightAssembler also reports the following on the screen:

Number of resulted contigs.
Maximum contig length.
Total Assembly size.
Total genome coverage.
Total Assembly time as well as the total time for each step.

Also, by using the --verbose option, LightAssembler reports the additional details for each step such as the number of kmers, the false positive rate of Bloom filters and the number of branching kmers in the dataset, the average read length and the average sequencing coverage.

Example 1

./LightAssembler -k 31 -g 15 -e 0.01 -G 4686137 -o ecoli_contigs -t 3 ecoli_reads_1.fq ecoli_reads_2.fq --verbose

--- Uniform kmers sampling. 

--- h(0):m(0):s(5) elapsed time.
--- total number of kmers in BloomA = 7791111
--- BloomA false positive rate = 0.00193375
--- average read length = 101
--- average sequencing coverage = 35
--- probability of an incorrect kmer appears in the sample : 0.0249524

--- Trusted/untrusted kmers filtering. 

--- h(0):m(0):s(24) elapsed time.
--- total number of kmers in BloomB = 4548112
--- BloomB false positive rate = 7.7715e-05

--- Branching-kmers computation. 

--- h(0):m(0):s(5) elapsed time.
--- number of branching kmers = 54644

--- Graph traversal. 

--- h(0):m(0):s(16) elapsed time.
--- number of contigs     = 731
--- maximum contig length = 120924
--- assembly size         = 4473869
--- genome coverage       = 95.4703%

--- The assembly session is finished. 

--- h(0):m(0):s(31) elapsed time.

Example 2 (missing g)

./LightAssembler -k 31 -e 0.01 -G 4686137 -o ecoli_contigs -t 3 ecoli_reads_1.fq ecoli_reads_2.fq --verbose

--- Parameters extrapolation. 

--- h(0):m(0):s(1) elapsed time.
--- start with gap size g = 4
--- average read length = 101
--- average sequencing coverage = 35

--- Uniform kmers sampling. 

--- h(0):m(0):s(8) elapsed time.
--- total number of kmers in BloomA = 27604568
--- BloomA false positive rate = 0.0375047
--- probability of an incorrect kmer appears in the sample : 0.118144

--- Trusted/untrusted kmers filtering. 

--- h(0):m(0):s(9) elapsed time.
--- total number of kmers in BloomB = 4655530
--- BloomB false positive rate = 9.1219e-05

--- Branching-kmers computation. 

--- h(0):m(0):s(2) elapsed time.
--- number of branching kmers = 57242

--- Graph traversal. 

--- h(0):m(0):s(22) elapsed time.
--- number of contigs     = 747
--- maximum contig length = 127975
--- assembly size         = 4474072
--- genome coverage       = 95.4746%

--- The assembly session is finished. 

--- h(0):m(0):s(42) elapsed time.

Example 3 (without –verbose)

./LightAssembler -k 31 -g 15 -e 0.01 -G 4686137 -o ecoli_contigs -t 3 ecoli_reads_1.fq ecoli_reads_2.fq --verbose

--- Uniform kmers sampling. 

--- h(0):m(0):s(2) elapsed time.

--- Trusted/untrusted kmers filtering. 

--- h(0):m(0):s(11) elapsed time.

--- Branching-kmers computation. 

--- h(0):m(0):s(1) elapsed time.

--- Graph traversal. 

--- h(0):m(0):s(17) elapsed time.
--- number of contigs     = 731
--- maximum contig length = 120924
--- assembly size         = 4473869
--- genome coverage       = 95.4703%

--- The assembly session is finished. 

--- h(0):m(0):s(31) elapsed time.