This is especially useful for single-cell datasets. RNA-Bloom was tested on Smart-seq2 and SMARTer datasets. It is not supported for long-read data (-long) at this time.
file format for the -pool option:
This is a tabular file that describes the read file paths for all cells/samples to be used pooled assembly.
Column header is on the first line, leading with #
Columns are separated by space/tab characters
Each sample can have more than one lines; lines sharing the same name will be grouped together during assembly
column
description
name
sample name
left
path to one left read file
right
path to one right read file
sef
path to one single-end forward read file
ser
path to one single-end reverse read file
(i) paired-end reads only:
Only name, left, and right columns are specified for a total of 3 columns. The legacy header-less tri-column format is still supported.
#name left right
cell1 /path/to/cell1/left.fastq /path/to/cell1/right.fastq
cell2 /path/to/cell2/left.fastq /path/to/cell2/right.fastq
cell3 /path/to/cell3/left.fastq /path/to/cell3/right.fastq
(ii) paired and unpaired reads:
In addition to name, left, and right columns, either sef, ser or both are specified for a total of 4~5 columns.
#name left right sef ser
cell1 /path/to/cell1/left.fastq /path/to/cell1/right.fastq /path/to/cell1/sef.fastq /path/to/cell1/ser.fastq
cell2 /path/to/cell2/left.fastq /path/to/cell2/right.fastq /path/to/cell2/sef.fastq /path/to/cell2/ser.fastq
cell3 /path/to/cell3/left.fastq /path/to/cell3/right.fastq /path/to/cell3/sef.fastq /path/to/cell3/ser.fastq
final output files per cell:
file name
description
rnabloom.transcripts.fa
assembled transcripts longer than length threshold (default: 200)
rnabloom.transcripts.short.fa
assembled transcripts shorter than length threshold
rnabloom.transcripts.nr.fa
assembled transcripts with redundancy reduced
(C) strand-specific assembly:
java -jar RNA-Bloom.jar -stranded ...
The -stranded option indicates that input reads are strand-specific.
Strand-specific reads are typically in the F2R1 orientation, where /2 denotes left reads in forward orientation and /1 denotes right reads in reverse orientation.
Configure the read file paths accordingly for bulk RNA-seq data and indicate read orientation:
The -ref option specifies the reference transcriptome FASTA file for guiding short-read assembly. It is not supported for long-read data (-long) at this time.
Quick Start for Long Reads
It is strongly recommended to trim adapters in your reads before assembly. For example, see Porechop for more information.
Input reads must not have purely integer IDs (e.g. 1, 2, 3), which could be in conflict with RNA-Bloom’s sequence IDs. Please rename your read IDs (with seqtk rename) if necessary.
Note that -long, -sef, and -ser can accept multiple file paths separated by the whitespace character.
(A) assemble long-read cDNA sequencing data:
Default presets for -long are intended for ONT data. Please add the -lrpb flag for PacBio data.
Ka Ming Nip, Saber Hafezqorani, Kristina K. Gagalova, Readman Chiu, Chen Yang, René L. Warren, and Inanc Birol. Reference-free assembly of long-read transcriptome sequencing data with RNA-Bloom2. Nature Communications. 2023 May 22;14(1):2940. doi: 10.1038/s41467-023-38553-y
Ka Ming Nip, Readman Chiu, Chen Yang, Justin Chu, Hamid Mohamadi, René L. Warren, and Inanc Birol. RNA-Bloom enables reference-free and reference-guided sequence assembly for single-cell transcriptomes. Genome Research. 2020 Aug;30(8):1191-1200. doi: 10.1101/gr.260174.119. Epub 2020 Aug 17.
RNA-Bloom is a fast and memory-efficient de novo transcript sequence assembler. It is designed for the following sequencing data types:
Written by Ka Ming Nip
Dependency 📌
Java SE Development Kit (JDK) 11 (JDK 17 is slightly faster)
External software used:
PATH!Installation 🔧
RNA-Bloom can be installed in two ways:
(A) install with
condaormamba:All dependent software (listed above) will be installed. RNA-Bloom can be run as
rnabloom ...(B) download from GitHub:
rnabloom_vX.X.X.tar.gzfrom the releases section.java -jar /path/to/RNA-Bloom.jar ...Quick Start for Short Reads
-left,-right,-sef, and-sercan accept multiple file paths separated by the whitespace character.(A) assemble bulk RNA-seq data:
paired-end reads only
leftreads are sense andrightreads are antisense, use-revcomp-rightto reverse-complementrightreadsleftreads are antisense andrightreads are sense, use-revcomp-leftto reverse-complementleftreads-revcomp-rightor-revcomp-leftsingle-end reads only
-seffor forward reads and-serfor reverse readspaired-end and single-end reads
final output files:
rnabloom.transcripts.farnabloom.transcripts.short.farnabloom.transcripts.nr.fa(B) assemble multi-sample RNA-seq data with pooled assembly mode:
This is especially useful for single-cell datasets. RNA-Bloom was tested on Smart-seq2 and SMARTer datasets. It is not supported for long-read data (
-long) at this time.file format for the
-pooloption:This is a tabular file that describes the read file paths for all cells/samples to be used pooled assembly.
#namewill be grouped together during assemblynameleftrightsefser(i) paired-end reads only:
Only
name,left, andrightcolumns are specified for a total of 3 columns. The legacy header-less tri-column format is still supported.(ii) paired and unpaired reads:
In addition to
name,left, andrightcolumns, eithersef,seror both are specified for a total of 4~5 columns.final output files per cell:
rnabloom.transcripts.farnabloom.transcripts.short.farnabloom.transcripts.nr.fa(C) strand-specific assembly:
The
-strandedoption indicates that input reads are strand-specific.Strand-specific reads are typically in the F2R1 orientation, where
/2denotes left reads in forward orientation and/1denotes right reads in reverse orientation.Configure the read file paths accordingly for bulk RNA-seq data and indicate read orientation:
-stranded -left /path/to/reads_2.fastq -right /path/to/reads_1.fastq -revcomp-rightand for scRNA-seq data:
(D) reference-guided assembly:
The
-refoption specifies the reference transcriptome FASTA file for guiding short-read assembly. It is not supported for long-read data (-long) at this time.Quick Start for Long Reads
1,2,3), which could be in conflict with RNA-Bloom’s sequence IDs. Please rename your read IDs (withseqtk rename) if necessary.-long,-sef, and-sercan accept multiple file paths separated by the whitespace character.(A) assemble long-read cDNA sequencing data:
Default presets for
-longare intended for ONT data. Please add the-lrpbflag for PacBio data.Input reads are expected to be in a mix of both forward and reverse orientations.
Options
-pooland-refare not supported for long-read data at this time.(B) assemble nanopore direct RNA sequencing data:
Input reads are expected to be only in the forward orientation.
By default, uracil (
U) is written asT. Use the-uraciloption to writeUinstead ofTin the output assembly.ntCard v1.2.1 supports uracil in reads.
(C) assemble long-read sequencing data with short-read polishing:
cDNA data:
direct RNA data:
final output files:
rnabloom.transcripts.farnabloom.transcripts.short.faGeneral Settings :gear:
(A) set Bloom filter sizes automatically:
If
ntcardis found in yourPATH, then the-ntcardoption is automatically turned on to count the number of unique k-mers in your reads.This sets the size of Bloom filters automatically to accommodate a false positive rate (FPR) of ~1%.
Alternatively, you can specify the exact number of unique k-mers:
This sets the size of Bloom filters automatically to accommodate 28,077,715 unique k-mers for a FPR of ~1%.
As a rule of thumb, a lower FPR may result in a better assembly but requires more memory for a larger Bloom filter.
(B) set the total size of Bloom filters:
This sets the total size to 10 GB. If neither
-nk,-ntcard, or-memare used, then the total size is configured based on the size of input read files.(C) stop at an intermediate stage:
This is a very useful option if you only want to assemble fragments or correct long reads (ie. with
-stage 2)!(D) list all available options in RNA-Bloom:
(E) limit the size of Java heap:
or if you installed with
conda:This limits the maximum Java heap to 2 GB with the
-Xmxoption. Note thatjavaoptions has no effect on Bloom filter sizes.See documentation for other JVM options.
Implementation 📝
RNA-Bloom is written in Java with Apache NetBeans IDE. It uses the following libraries:
Citing RNA-Bloom
If you use RNA-Bloom in your work, please cite our manuscript(s).
Long-read RNA-seq assembly:
Short-read RNA-seq assembly: