NSCCN/umitools：用于处理和分析单细胞RNA测序数据中的UMI（Unique Molecular Identifier）的工具

Description

A toolset for handling sequencing data with unique molecular identifiers (UMIs)

Installation

This tools set requires Python 3.

To install umitools, run

pip3 install umitools  # add --user if you want to install it to your own directory

How to process UMI small RNA-seq data

0. (Skip to the next step if you have data.) Download the test data

wget -O clipped.fq.gz "https://github.com/weng-lab/umitools/raw/master/umitools/testdata/umitools.test.sRNA-seq.fq.gz"

1. Identify UMIs:

umitools reformat_sra_fastq -i clipped.fq.gz -o sra.umi.fq -d sra.dup.fq

How to process UMI RNA-seq data

0. (Skip to the next step if you have data.) Download the test data

wget -O "r1.fq.gz" "https://github.com/weng-lab/umitools/raw/master/umitools/testdata/umitools.test.RNA-seq.r1.fq.gz"
wget -O "r2.fq.gz" "https://github.com/weng-lab/umitools/raw/master/umitools/testdata/umitools.test.RNA-seq.r2.fq.gz"

1. To identify reads with proper UMIs and parse out their UMIs, you can run:

umitools reformat_fastq -l r1.fq.gz -r r2.fq.gz -L r1.fmt.fq.gz -R r2.fmt.fq.gz

And it will output some stats for your UMI RNA-seq data.

2. Then you can use your favorite RNA-seq aligner (e.g. STAR) to map these reads to the genome and get a BAM/SAM file (e.g., `fmt.bam`).

To download an example, run

wget -O fmt.bam https://github.com/weng-lab/umitools/raw/master/umitools/testdata/umitools.test.RNA-seq.sorted.bam

To mark the reads with PCR duplicates (and assuming you want to use 8 threads), run

umitools mark_duplicates -f fmt.bam -p 8

And it will produce fmt.deumi.sorted.bam in which reads that are identified as PCR duplicates will have the flag 0x400. If your downstream analysis (e.g., Picard) can take into consideration this flag, then you are good to go! Otherwise, you can just eliminate PCR duplicates:

samtools view -b -h -F 0x400 fmt.deumi.sorted.bam > fmt.deumi.F400.sorted.bam

You can then feed the bam file without PCR duplicates to your downstream analysis.

How UMI locators are handled

For UMI RNA-seq, the UMI locator in each read is required to exactly match GGG, TCA, or ATC. You can customize the locator sequence by setting --umi-locator LOCATOR1,LOCATOR2,LOCATOR3,LOCATOR4 when you run umi_reformat_fastq.

For UMI small RNA-seq, the default setting requires that the 5' UMI locator in each read should match NNNCGANNNTACNNN or NNNATCNNNAGTNNN, AND 3' UMI locator should match NNNGTCNNNTAGNNN where N’s are not required to match and there is at most 1 error across all non-N positions. You can customized the locator sequence for small RNA-seq by setting --umi-pattern-5 and --umi-pattern-3. You can further tweak the number of errors allowed by changing N_MISMATCH_ALLOWED_IN_UMI_LOCATOR in the script.

Other utilities

umi_simulator

A simple in silico PCR simulator for UMI reads. Run it with -h to see options.

FAQ

Other ways to run umitools?

In addition to providing subcommands to umitools (e.g., umitools mark_duplicates), these commands can also be called individually.

umitools reformat_fastq is equivalent to umi_reformat_fastq.
umitools mark_duplicates is equivalent to umi_mark_duplicates.
umitools reformat_sra_fastq is equivalent to umi_reformat_sra_fastq.

How to remove 3’ end small RNA-seq adapter

There are many tools to remove adapters. This is just one example. To process a fastq (raw.fq.gz) file from your UMI small RNA-seq data, you can first remove the 3’ end small RNA-seq adapter. For example, you can use fastx_clipper from the FASTX-Toolkit and the adapter sequence is TGGAATTCTCGGGTGCCAAGG:

zcat raw.fq.gz | fastx_clipper -a TGGAATTCTCGGGTGCCAAGG -l 48 -c -Q33 2> raw.clipped.log | gzip -c - > clipped.fq.gz

where -l 48 specified the minimum length of the reads after the adapter removal, since I want to make sure all reads are at least 18 nt (18 nt + 15 nt in the 5’ UMI + 15 nt in the 3’ UMI).

Not sure if your libraries have high-quality UMIs at proper positions?

To see which reads have improper UMIs, run

umitools reformat_sra_fastq -i clipped.fq.gz -o sra.umi.fq -d sra.dup.fq --reads-with-improper-umi sra.improper_umi.fq

where sra.umi.fq contains all the non-duplicate reads and sra.dup.fq contains all duplicates.

Feeling adventurous? You can install the git version

Grab the version on GitHub:

git clone https://github.com/weng-lab/umitools.git

Install it in editable mode:

pip3 install -e /path/to/umitools

Citation

Fu, Y., Wu, P.-H., Beane, T., Zamore, P.D., and Weng, Z. (2018). Elimination of PCR duplicates in RNA-seq and small RNA-seq using unique molecular identifiers. BMC Genomics 19, 531.

Contact us

Yu Fu (Yu.Fu {at} umassmed.edu)

Description

Installation

How to process UMI small RNA-seq data

0. (Skip to the next step if you have data.) Download the test data

1. Identify UMIs:

How to process UMI RNA-seq data

0. (Skip to the next step if you have data.) Download the test data

1. To identify reads with proper UMIs and parse out their UMIs, you can run:

2. Then you can use your favorite RNA-seq aligner (e.g. STAR) to map these reads to the genome and get a BAM/SAM file (e.g., fmt.bam).

How UMI locators are handled

Other utilities

umi_simulator

FAQ

Other ways to run umitools?

How to remove 3’ end small RNA-seq adapter

Not sure if your libraries have high-quality UMIs at proper positions?

Feeling adventurous? You can install the git version

Citation

Contact us

2. Then you can use your favorite RNA-seq aligner (e.g. STAR) to map these reads to the genome and get a BAM/SAM file (e.g., `fmt.bam`).