scTagger

scTagger matches barcodes of short- and long-reads of single-cell RNA-seq experiments to enable relating at the cell level gene expression (from short-reads) and RNA splicing (from the long-reads).

Installation

Conda

scTagger is available as a Conda package:

conda create -n sctagger-env -c bioconda sctagger 
conda activate sctagger-env
scTagger.py -h

Running with Snakemake

We provided a simple Snakefile alongside a config.yaml file that runs the three stages of scTagger as well as Cell Ranger (assumes Cell Ranger is in path).

Running manually

scTagger has a single python script containing different functions to match long-reads and short-reads barcodes.

The whole pipeline contains three steps that you can run each part separately:

1) Extract long-reads segment

The first step of the scTagger pipeline is to extract a segment where the probability of seeing a barcode is more than in other places. To run this step, you can use the following command.

./scTagger.py extract_lr_bc -r "path/to/long/read/fastq" -o "path/to/output/file" -p "path/to/output/plots"

Augments

-r: Space separated paths to reads in FASTQ
-g: Space separated of the ranges of where SR adapter should be found on the LR’s (Optional, Default: Detect from data)
-z: Indicate input is gzipped (Optional, Default: Assume input is gzipped if it ends with ".gz")
-t: Number of threads (Optional, Default: 1)
-sa: Short-read adapter (Optional, Default: CTACACGACGCTCTTCCGATCT)
--num-bp-afte: Number of bases after the end of the SR adapter alignment to generate (Optional, Default: 20)
-o: Path to output file
-p: Path to plot file (Optional, Default: No plotting)

Inputs

A list of FASTQ files of long-reads

Outputs

A Tsv file:
- First column is read-id
- Second column is the best edit distance with the short-read adapter
- Third column is the starting point of long-read that matches with the adapter
- Fourth column is the long-read segment that find.
A plot of optimal alignment locations of the short read adapter to the long-reads.

2) Extract short-reads barcodes

The second step is to extract the top short-reads barcodes that cover most of the reads.

./scTagger.py extract_sr_bc -i "path/to/bam/file" -o "path/to/output/file" -p "path/to/output/plot"

Arguments

-i: Input file
-o: Path to output file.
-p: Path to plot file (Optional, Default: No plotting)
--thresh: Percentage theshold required per step to continue adding read barcodes (Optional, Default: 0.005)
--step-size: Number of barcodes processed at a time and whose sum is used to check against the theshold (Optional, Default: 1000)
--max-barcode-cnt: Max number of barcodes to keep (Optional, Default: 25000)

Input

A bam file of short reads data

Output

A TSV file
- First column is barcodes
- Second column is the number of appearances of the barcode
A cumulative plot of SR coverage with batches of 1,000 barcodes

Alt. 2) Extract short-reads barcodes directly from long-reads

This is an alternative to the second step which avoids using the short-reads all together and inteads builds a whiltelist of cellular barcodes from the long-reads directly. This is done by looking for exact matches of the 10x Chromium list of cellular barcodes on the long-read barcode segments. The barcodes are sorted by frequency and the most frequent barcodes are kept using the strategy as the extract_sr_bc module.

./scTagger.py extract_sr_bc_from_lr -i "path/to/long-read-segments" -wl "/path/to/10x-barcode-list.txt" -o "path/to/output.txt"'

Arguments

-i: Input TSV file containing the long-read segments file generated by extract_lr_bc step
-o: Path to output file.
-wl: Path to 10x Genomics cellular barcode whiltelist (e.g. 3M-february-2018.txt.gz). Accepts both txt.gz files and .txt files.
--thresh: Percentage theshold required per step to continue adding read barcodes (Optional, Default: 0.005)
--step-size: Number of barcodes processed at a time and whose sum is used to check against the theshold (Optional, Default: 1000)
--max-barcode-cnt: Max number of barcodes to keep (Optional, Default: 25000)

Input

The output file of the extract_lr_bc step
10x Genomics cellular barcode whiltelist (e.g. 3M-february-2018.txt.gz)

Output

A TSV file
- First column is barcodes
- Second column is the number of appearances of the barcode

3) Match long-reads segment with short-reads barcodes

The last step is to match long-read segments with selected barcodes from short reads

./scTagger.py match_trie -lr "path/to/output/extract/long-read/segment" -sr "path/to/output/extract/top/short-read" -o "path/to/output/file" -t "number of threads"

Arguments

-lr: Long-read segments TSV file
-sr: Short-read barcode list TSV file
-mr: Maximum number of errors allowed for barcode matching (Optional, Default: 2)
-m: Maximum number of GB of RAM to be used (Optional, Default: 16.0)
-bl: Length of barcodes (Optional, Default: 16)
-t: Number of threads to use for searching (Optional, Default: 16)
-p: Path of plot file
-o: Path to output file. Output file is gzipped

Inputs

Use the output of extracting long-read segment and selecting top barcodes part as the inputs of this section

Outputs

A TSV file
- First column is the read id
- Second column is the minimum edit distance
- Third column is the number of short reads barcodes that match with the long-read
- Fourth column is the long-read segment, and the Fifth column is a list of all short-read barcodes with minimum edit distance
A bar plot that shows the number of long-reads by the minimum edit distance of their match barcode

Citing scTaggger

scTagger was first accepted to RECOMB-seq 2022 and is now published by iScience:

Ghazal Ebrahimi, Baraa Orabi, Meghan Robinson, Cedric Chauve, Ryan Flannigan, and Faraz Hach. “Fast and accurate matching of cellular barcodes across short-and long-reads of single-cell RNA-seq experiments.” iScience (2022). DOI:10.1016/j.isci.2022.104530

Please check the paper branch of this repository for the archived paper experiements and implementation.