To mark the reads with PCR duplicates (and assuming you want to use 8 threads), run
umitools mark_duplicates -f fmt.bam -p 8
And it will produce fmt.deumi.sorted.bam in which reads that are identified as PCR duplicates will have the flag 0x400. If your downstream analysis (e.g., Picard) can take into consideration this flag, then you are good to go! Otherwise, you can just eliminate PCR duplicates:
You can then feed the bam file without PCR duplicates to your downstream analysis.
How UMI locators are handled
For UMI RNA-seq, the UMI locator in each read is required to exactly match GGG, TCA, or ATC. You can customize the locator sequence by setting --umi-locator LOCATOR1,LOCATOR2,LOCATOR3,LOCATOR4 when you run umi_reformat_fastq.
For UMI small RNA-seq, the default setting requires that the 5' UMI locator in each read should match NNNCGANNNTACNNN or NNNATCNNNAGTNNN, AND 3' UMI locator should match NNNGTCNNNTAGNNN where N’s are not required to match and there is at most 1 error across all non-N positions. You can customized the locator sequence for small RNA-seq by setting --umi-pattern-5 and --umi-pattern-3. You can further tweak the number of errors allowed by changing N_MISMATCH_ALLOWED_IN_UMI_LOCATOR in the script.
Other utilities
umi_simulator
A simple in silico PCR simulator for UMI reads. Run it with -h to see options.
FAQ
Other ways to run umitools?
In addition to providing subcommands to umitools (e.g., umitools mark_duplicates), these commands can also be called individually.
umitools reformat_fastq is equivalent to umi_reformat_fastq.
umitools mark_duplicates is equivalent to umi_mark_duplicates.
umitools reformat_sra_fastq is equivalent to umi_reformat_sra_fastq.
How to remove 3’ end small RNA-seq adapter
There are many tools to remove adapters. This is just one example. To process a fastq (raw.fq.gz) file from your UMI small RNA-seq data, you can first remove the 3’ end small RNA-seq adapter. For example, you can use fastx_clipper from the FASTX-Toolkit and the adapter sequence is TGGAATTCTCGGGTGCCAAGG:
where -l 48 specified the minimum length of the reads after the adapter removal, since I want to make sure all reads are at least 18 nt (18 nt + 15 nt in the 5’ UMI + 15 nt in the 3’ UMI).
Not sure if your libraries have high-quality UMIs at proper positions?
Fu, Y., Wu, P.-H., Beane, T., Zamore, P.D., and Weng, Z. (2018). Elimination of PCR duplicates in RNA-seq and small RNA-seq using unique molecular identifiers. BMC Genomics 19, 531.
Description
A toolset for handling sequencing data with unique molecular identifiers (UMIs)
Installation
This tools set requires Python 3.
To install
umitools, runHow to process UMI small RNA-seq data
0. (Skip to the next step if you have data.) Download the test data
1. Identify UMIs:
How to process UMI RNA-seq data
0. (Skip to the next step if you have data.) Download the test data
1. To identify reads with proper UMIs and parse out their UMIs, you can run:
And it will output some stats for your UMI RNA-seq data.
2. Then you can use your favorite RNA-seq aligner (e.g. STAR) to map these reads to the genome and get a BAM/SAM file (e.g.,
fmt.bam).To download an example, run
To mark the reads with PCR duplicates (and assuming you want to use 8 threads), run
And it will produce
fmt.deumi.sorted.bamin which reads that are identified as PCR duplicates will have the flag0x400. If your downstream analysis (e.g., Picard) can take into consideration this flag, then you are good to go! Otherwise, you can just eliminate PCR duplicates:You can then feed the bam file without PCR duplicates to your downstream analysis.
How UMI locators are handled
For UMI RNA-seq, the UMI locator in each read is required to exactly match GGG, TCA, or ATC. You can customize the locator sequence by setting
--umi-locator LOCATOR1,LOCATOR2,LOCATOR3,LOCATOR4when you runumi_reformat_fastq.For UMI small RNA-seq, the default setting requires that the 5' UMI locator in each read should match
NNNCGANNNTACNNNorNNNATCNNNAGTNNN, AND 3' UMI locator should matchNNNGTCNNNTAGNNNwhere N’s are not required to match and there is at most 1 error across all non-N positions. You can customized the locator sequence for small RNA-seq by setting--umi-pattern-5and--umi-pattern-3. You can further tweak the number of errors allowed by changingN_MISMATCH_ALLOWED_IN_UMI_LOCATORin the script.Other utilities
umi_simulator
A simple in silico PCR simulator for UMI reads. Run it with
-hto see options.FAQ
Other ways to run umitools?
In addition to providing subcommands to
umitools(e.g.,umitools mark_duplicates), these commands can also be called individually.umitools reformat_fastqis equivalent toumi_reformat_fastq.umitools mark_duplicatesis equivalent toumi_mark_duplicates.umitools reformat_sra_fastqis equivalent toumi_reformat_sra_fastq.How to remove 3’ end small RNA-seq adapter
There are many tools to remove adapters. This is just one example. To process a fastq (
raw.fq.gz) file from your UMI small RNA-seq data, you can first remove the 3’ end small RNA-seq adapter. For example, you can usefastx_clipperfrom the FASTX-Toolkit and the adapter sequence isTGGAATTCTCGGGTGCCAAGG:where
-l 48specified the minimum length of the reads after the adapter removal, since I want to make sure all reads are at least 18 nt (18 nt + 15 nt in the 5’ UMI + 15 nt in the 3’ UMI).Not sure if your libraries have high-quality UMIs at proper positions?
To see which reads have improper UMIs, run
where
sra.umi.fqcontains all the non-duplicate reads andsra.dup.fqcontains all duplicates.Feeling adventurous? You can install the git version
Citation
Fu, Y., Wu, P.-H., Beane, T., Zamore, P.D., and Weng, Z. (2018). Elimination of PCR duplicates in RNA-seq and small RNA-seq using unique molecular identifiers. BMC Genomics 19, 531.
Contact us
Yu Fu (Yu.Fu {at} umassmed.edu)