Fully Automated and Standardized iCLIP (FAST-iCLIP) is a fully automated tool to process iCLIP data. Please cite the following paper:
Zarnegar B, Flynn RA, Shen Y, Do BT, Chang HY, Khavari PA. Ultraefficient irCLIP pipeline for characterization for protein-RNA interactions. Nature Methods (2016)
This package contains two main sets of tools: an executable called fasticlip to run iCLIP on human and mouse data, and several (possibly deprecated) iPython notebooks to process iCLIP data from viral genomes.
The following README will focus mainly on fasticlip. The pdf in the repository contains further instructions for using the iPython notebooks.
Note that the current pipeline is compatible with only GRCh38 (human) and GRCm38 (mouse) assemblies. This is due to a tailored set of annotations used in the pipeline. We will release details of generating annotation files for other genomes shortly in future.
Required arguments
flag
description
-h, –help
show this help message and exit
-i INPUT(s)
At least one input FASTQ (or fastq.gz) files; separated by spaces
–GRCh38
required if your CLIP is from human
–GRCm38
required if your CLIP is from mouse
-n NAME
Name of output directory
-o OUTPUT
Name of directory where output directory will be made
Optional arguments
flag
description
–trimmed
flag if files are already trimmed
-f N
Number of bases to trim from 5’ end of each read. Default is 14. If using irCLIP RT primers, this value should be 18.
-a ADAPTER
3’ adapter to trim from the end of each read. Default is AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG.
-tr REPEAT_THRESHOLD_RULE
m,n: at least m samples must each have at least n RT stops mapped to repeat RNAs. Default is 1,4 (1 sample); 2,3 (2 samples); x,2 (x>2 samples)
-tv EXOVIRAL_THRESHOLD_RULE
m,n: at least m samples must each have at least n RT stops mapped to viral genome. Default is 1,4 (1 sample); 2,3 (2 samples); x,2 (x>2 samples)
-tn NONREPEAT_THRESHOLD_RULE
m,n: at least m samples must each have at least n RT stops mapped to nonrepeat RNAs. Default is 1,4 (1 sample); 2,3 (2 samples); x,2 (x>2 samples)
-bm BOWTIE_MAPQ
Minimum MAPQ (Bowtie alignment to repeat/tRNA/retroviral indexes) score allowed. Default is 42.
-q Q
Minimum average quality score allowed during read filtering. Default is 25.
-p P
Percentage of bases that must have quality > q during filtering. Default is 80.
-l L
Minimum length of read. Default is 15.
-c C
Number of cores used by bowtie2. Default is 8.
–verbose
Prints out lots of things :)
Installation instructions
Clone this repository by running one of the following:
git clone git@github.com:ChangLab/FAST-iCLIP.git if you use ssh authentication
Run ./configure. This will check for dependencies (below) and download necessary files (bowtie indices, gene lists and genomes, and example iCLIP data). Note that the configure will download a very large annotation file from Amazon that contains all necessary annotation files to run the pipeline. Please wait until all annotations are downloaded and extracted. No additional annotation file is needed. The annotations are compatible only with the tools specificed in the following.
Run sudo python setup.py install. If you do not have sudo privileges, run python setup.py install --user or python setup.py install --prefix=<desired directory>.
You should see three new folders inside FAST-iCLIP: docs, rawdata, and results.
Add the following lines to your ~/.bashrc and ~/.bash_profile:
export FASTICLIP_PATH=~/.local/bin/
export PATH=$FASTICLIP_PATH:$PATH
Save the file, then run source ~/.bash_profile.
Try running the following command:
fasticlip -i rawdata/example_MMhur_R1.fastq rawdata/example_MMhur_R2.fastq --GRCm38 -n MMhur -o results. It should run in ~1 hour. Look inside results/MMhur for output files.
Dependencies
The version numbers listed have been tested successfully. There can be difficulties if you choose to run updated versions of some of these dependencies.
iCLIPro analyzes the RT stops vs cDNA length. It has been observed that RBPs and the complexes they bind RNAs in can occupy small or large areas on target RNAs. If RNase digestion is too stringent the identified RT stops via cDNA truncation may be shifted.
For each strand, we merge RT stops between replicates.
This means that at RT stop position must be ‘conserved’ between replicates.
If conserved, we count the total number of instances of the RT position for both replicates.
If the total counts exceed a specified threshold, then we record these RT stops.
We then make bedGraph and BigWig files, allowing visualization in genome browsers.
Partition RT stops by gene type.
We intersect merged RT stops with Ensembl annotation of gene name by RNA type.
Some protein coding and lincRNA genes can have embedded snoRNAs or miRNAs in their introns, making this more challenging.
In turn, we re-generate the initial RT stop and intersect this with two different filters.
One filter is a snoRNA mask and the other filter is a miR mask: both derived from Ensembl annotations.
These masks allow us to remove all “protein coding” RT stops that fall within annotated snoRNA/miR regions.
Quantification of reads per gene.
For each gene type, we quantify the number of reads per gene.
For all but snoRNAs, this is computed using the bed files obtained above.
For snoRNAs, we intersect the initial pool of RT stops with a custom annotation file of snoRNAs built from Ensembl annotations of noncoding RNAs.
Collectively, this gives us reads per gene for each gene type.
Partition protein coding reads by functional mRNA elements.
We intersect snoRNA/miR filtered reads with Ensembl-derived UTR coordinates.
We perform this such output files generated are both “exclusive” and “inclusive”.
Exclusive = a gene will have RT stops only in the specific element, without any RT stops elsewhere in that gene.
Inclusive = a gene will have RT stops in the specific element, but can also have RT stops elsewhere in that gene.
Partition reads by ncRNA binding region.
For non-coding RNAs, we simply annotate reads with the start and stop position for each ncRNA.
This allows us to determine the position of each RT stop with respect to the full length of the gene.
Partition repeat-mapped RT stops by region.
The repeat RNA mapped RT stops are paritioned using the repeat custom index annotation.
As with the ncRNAs, this is later used for visualization. Both sense and antisense mapped RT stops are visualized.
FAST-iCLIP
Fully Automated and Standardized iCLIP (FAST-iCLIP) is a fully automated tool to process iCLIP data. Please cite the following paper:
Zarnegar B, Flynn RA, Shen Y, Do BT, Chang HY, Khavari PA. Ultraefficient irCLIP pipeline for characterization for protein-RNA interactions. Nature Methods (2016)This package contains two main sets of tools: an executable called
fasticlipto run iCLIP on human and mouse data, and several (possibly deprecated) iPython notebooks to process iCLIP data from viral genomes.The following README will focus mainly on
fasticlip. The pdf in the repository contains further instructions for using the iPython notebooks.Table of Contents
Usage [*** UPDATE ***]
fasticlip [-h] -i INPUT [INPUT ...] [--trimmed] [--GRCh38 | --GRCm38] -n NAME -o OUTPUT [-f N] [-a ADAPTER] [-tr REPEAT_THRESHOLD_RULE] [-tn NONREPEAT_THRESHOLD_RULE] [-tv EXOVIRUS_THRESHOLD_RULE] [-bm BOWTIE_MAPQ] [-q Q] [-p P] [-l L] [-c C] [--verbose]Example:
fasticlip -i rawdata/example_MMhur_R1.fastq rawdata/example_MMhur_R2.fastq --GRCm38 -n MMhur -o resultsExample:
fasticlip -i rawdata/example_Hmhur_R1.fastq rawdata/example_Hmhur_R2.fastq --GRCh38 -n Hmhur -o resultsNote that the current pipeline is compatible with only GRCh38 (human) and GRCm38 (mouse) assemblies. This is due to a tailored set of annotations used in the pipeline. We will release details of generating annotation files for other genomes shortly in future.
Required arguments
Optional arguments
Installation instructions
Clone this repository by running one of the following:
git clone git@github.com:ChangLab/FAST-iCLIP.gitif you use ssh authenticationgit clone https://github.com/ChangLab/FAST-iCLIP.gitotherwiseType
cd FAST-iCLIPto enter the folder.Run
./configure. This will check for dependencies (below) and download necessary files (bowtie indices, gene lists and genomes, and example iCLIP data). Note that the configure will download a very large annotation file from Amazon that contains all necessary annotation files to run the pipeline. Please wait until all annotations are downloaded and extracted. No additional annotation file is needed. The annotations are compatible only with the tools specificed in the following.Run
sudo python setup.py install. If you do not have sudo privileges, runpython setup.py install --userorpython setup.py install --prefix=<desired directory>.You should see three new folders inside
FAST-iCLIP:docs,rawdata, andresults.Add the following lines to your ~/.bashrc and ~/.bash_profile:
export FASTICLIP_PATH=~/.local/bin/export PATH=$FASTICLIP_PATH:$PATHSave the file, then run
source ~/.bash_profile.fasticlip -i rawdata/example_MMhur_R1.fastq rawdata/example_MMhur_R2.fastq --GRCm38 -n MMhur -o results. It should run in ~1 hour. Look insideresults/MMhurfor output files.Dependencies
The version numbers listed have been tested successfully. There can be difficulties if you choose to run updated versions of some of these dependencies.
Input
At least one FASTQ or compressed FASTQ (fastq.gz). Use the
--trimmedflag if trimming has already been done.Output
Three subdirectories inside the named directory within
results.figureshas 6 figures in pdf and png format.Figure 1 visualizes the some of the relevant summary data.
Figure 2 provides coverage histograms of binding across each repeat RNA element, both sense and antisense strands.
Figure 3 provides coverage histograms of binding across the rRNA, highlighting mature rRNA regions.
Figures 4a and 4b provide a summary of snoRNA binding data.
Figure 5 provides histograms of RT stop position within gene body for all remaining ncRNA types.
Figure 6 provides a pie chart composed of RT stops from the top 15 best bound endoVirus elements.
Figure 7 provides histograms of RT stop position across the genome for any exoViruses (DV, ZV, or HCV).
rawdatahas all the PlotData files used to make the figures, as well as intermediate files that can be useful in generating other plots.todeletehas files that are unnecessary to keep.How the pipeline works
Prepare 2 FASTQ or FASTQ.gz files corresponding to the two replicates for an iCLIP/irCLIP experiment.
Duplicate removal, quality filter, and trim adapter from the 3’ end from the reads