scTagger matches barcodes of short- and long-reads of single-cell RNA-seq experiments to enable relating at the cell level gene expression (from short-reads) and RNA splicing (from the long-reads).
We provided a simple Snakefile alongside a config.yaml file that runs the three stages of scTagger as well as Cell Ranger (assumes Cell Ranger is in path).
Running manually
scTagger has a single python script containing different functions to match long-reads and short-reads barcodes.
The whole pipeline contains three steps that you can run each part separately:
1) Extract long-reads segment
The first step of the scTagger pipeline is to extract a segment where the probability of seeing a barcode is more than in other places.
To run this step, you can use the following command.
-p: Path to plot file (Optional, Default: No plotting)
--thresh: Percentage theshold required per step to continue adding read barcodes (Optional, Default: 0.005)
--step-size: Number of barcodes processed at a time and whose sum is used to check against the theshold (Optional, Default: 1000)
--max-barcode-cnt: Max number of barcodes to keep (Optional, Default: 25000)
Input
A bam file of short reads data
Output
A TSV file
First column is barcodes
Second column is the number of appearances of the barcode
A cumulative plot of SR coverage with batches of 1,000 barcodes
Alt. 2) Extract short-reads barcodes directly from long-reads
This is an alternative to the second step which avoids using the short-reads all together and inteads builds a whiltelist of cellular barcodes from the long-reads directly.
This is done by looking for exact matches of the 10x Chromium list of cellular barcodes on the long-read barcode segments.
The barcodes are sorted by frequency and the most frequent barcodes are kept using the strategy as the extract_sr_bc module.
-mr: Maximum number of errors allowed for barcode matching (Optional, Default: 2)
-m: Maximum number of GB of RAM to be used (Optional, Default: 16.0)
-bl: Length of barcodes (Optional, Default: 16)
-t: Number of threads to use for searching (Optional, Default: 16)
-p: Path of plot file
-o: Path to output file. Output file is gzipped
Inputs
Use the output of extracting long-read segment and selecting top barcodes part as the inputs of this section
Outputs
A TSV file
First column is the read id
Second column is the minimum edit distance
Third column is the number of short reads barcodes that match with the long-read
Fourth column is the long-read segment, and the Fifth column is a list of all short-read barcodes with minimum edit distance
A bar plot that shows the number of long-reads by the minimum edit distance of their match barcode
Citing scTaggger
scTagger was first accepted to RECOMB-seq 2022 and is now published by iScience:
Ghazal Ebrahimi, Baraa Orabi, Meghan Robinson, Cedric Chauve, Ryan Flannigan, and Faraz Hach. “Fast and accurate matching of cellular barcodes across short-and long-reads of single-cell RNA-seq experiments.” iScience (2022). DOI:10.1016/j.isci.2022.104530
Please check the paper branch of this repository for the archived paper experiements and implementation.
scTagger
scTagger matches barcodes of short- and long-reads of single-cell RNA-seq experiments to enable relating at the cell level gene expression (from short-reads) and RNA splicing (from the long-reads).
Installation
Conda
scTagger is available as a Conda package:
Running with Snakemake
We provided a simple
Snakefilealongside aconfig.yamlfile that runs the three stages of scTagger as well as Cell Ranger (assumes Cell Ranger is in path).Running manually
scTagger has a single python script containing different functions to match long-reads and short-reads barcodes.
The whole pipeline contains three steps that you can run each part separately:
1) Extract long-reads segment
The first step of the scTagger pipeline is to extract a segment where the probability of seeing a barcode is more than in other places. To run this step, you can use the following command.
Augments
-r: Space separated paths to reads in FASTQ-g: Space separated of the ranges of where SR adapter should be found on the LR’s (Optional, Default: Detect from data)-z: Indicate input is gzipped (Optional, Default: Assume input is gzipped if it ends with ".gz")-t: Number of threads (Optional, Default: 1)-sa: Short-read adapter (Optional, Default:CTACACGACGCTCTTCCGATCT)--num-bp-afte: Number of bases after the end of the SR adapter alignment to generate (Optional, Default: 20)-o: Path to output file-p: Path to plot file (Optional, Default: No plotting)Inputs
Outputs
2) Extract short-reads barcodes
The second step is to extract the top short-reads barcodes that cover most of the reads.
Arguments
-i: Input file-o: Path to output file.-p: Path to plot file (Optional, Default: No plotting)--thresh: Percentage theshold required per step to continue adding read barcodes (Optional, Default: 0.005)--step-size: Number of barcodes processed at a time and whose sum is used to check against the theshold (Optional, Default: 1000)--max-barcode-cnt: Max number of barcodes to keep (Optional, Default: 25000)Input
Output
Alt. 2) Extract short-reads barcodes directly from long-reads
This is an alternative to the second step which avoids using the short-reads all together and inteads builds a whiltelist of cellular barcodes from the long-reads directly. This is done by looking for exact matches of the 10x Chromium list of cellular barcodes on the long-read barcode segments. The barcodes are sorted by frequency and the most frequent barcodes are kept using the strategy as the
extract_sr_bcmodule.Arguments
-i: Input TSV file containing the long-read segments file generated byextract_lr_bcstep-o: Path to output file.-wl: Path to 10x Genomics cellular barcode whiltelist (e.g. 3M-february-2018.txt.gz). Accepts both txt.gz files and .txt files.--thresh: Percentage theshold required per step to continue adding read barcodes (Optional, Default: 0.005)--step-size: Number of barcodes processed at a time and whose sum is used to check against the theshold (Optional, Default: 1000)--max-barcode-cnt: Max number of barcodes to keep (Optional, Default: 25000)Input
extract_lr_bcstepOutput
3) Match long-reads segment with short-reads barcodes
The last step is to match long-read segments with selected barcodes from short reads
Arguments
-lr: Long-read segments TSV file-sr: Short-read barcode list TSV file-mr: Maximum number of errors allowed for barcode matching (Optional, Default: 2)-m: Maximum number of GB of RAM to be used (Optional, Default: 16.0)-bl: Length of barcodes (Optional, Default: 16)-t: Number of threads to use for searching (Optional, Default: 16)-p: Path of plot file-o: Path to output file. Output file is gzippedInputs
Outputs
Citing scTaggger
scTagger was first accepted to RECOMB-seq 2022 and is now published by iScience:
Ghazal Ebrahimi, Baraa Orabi, Meghan Robinson, Cedric Chauve, Ryan Flannigan, and Faraz Hach. “Fast and accurate matching of cellular barcodes across short-and long-reads of single-cell RNA-seq experiments.” iScience (2022). DOI:10.1016/j.isci.2022.104530
Please check the paper branch of this repository for the archived paper experiements and implementation.