LoDEI - the local differential editing index - offers a collection of programs to detect and analyze differentially edited A-to-I regions in two sets of RNA-seq samples.
lodei -h # get a list of all commands
lodei subcommand -h # get info of a subcommand
The subcommand to detect differential A-to-I editing is lodei find.
Analyzing RNA-seq data typcially requires the mapping of NGS reads in fastq format to a reference genome.
The primary input for lodei find are sorted BAM files as produced by NGS-read mappers like STAR.
Linux operating system (our systems run on Ubuntu 22.04)
conda/mamba or Podman/Docker
Installation
Prior to installation, we recommend to follow the instructions of the provided test data to be able to verify a proper installation.
Install LoDEI by using one of the following ways:
use the conda/mamba package manager to install LoDEI.
build a Podman/Docker image locally by using the provided Containerfile.
Publication
Torkler, P., Sauer, M., Schwartz, U. et al. LoDEI: a robust and sensitive tool to detect transcriptome-wide differential A-to-I editing in RNA-seq data. Nat Commun 15, 9121 (2024). https://doi.org/10.1038/s41467-024-53298-y
Test Data
We provide a small test dataset (~15MB, https://zenodo.org/doi/10.5281/zenodo.10907019) that contains all required input files to run lodei find to demonstrate the proper usage for detecting differentially edited A-to-I regions.
The test dataset contains sorted BAM files belonging to two different conditions that are thought to be compared against each other, genomic annotations, and the nucleotide sequences for three genes of the human genome.
Let’s create an example directory and download the testdata:
cd ~ # change to your home directory
mkdir example # create a new directory to store example data
cd example # switch to the example directory
# download and unpack test data:
wget https://zenodo.org/records/10907020/files/test_data.tar.gz
tar -xzf test_data.tar.gz
After unpacking, the directory data_testrun should be in your example directory (see below).
The subdirectory data_testrun/annotation contains genomic sequences in the fasta format and genomic annotations in the GFF3 format.
The data_testrun/bam subdirectory contains BAM files for 10 samples whereas samples 01-05 belong to set 1 and samples 06-10 belong to set 2.
DO NOT CHANGE ANYTHING IN THE data_testrun DIRECTORY
To verify a proper installation we run LoDEI on the provided test dataset.
Make sure to get back into your example directory where your unpacked the test dataset.
Create a new output directory at ~/example where LoDEI can save the results and finally move into the example data directory:
provide the list of sorted BAM files (separated by space) of samples belonging to group 1. Note, for each input.bam file a corresponding .bai file is required to be present in the same directory.
--group2 ...
provide the list of sorted BAM files (separated by space) of samples belonging to group 2. Note, for each input.bam file a corresponding .bai file is required to be present in the same directory.
-f annotation/genome.fa
provide the reference genome used to generate the provided BAM files.
annotation/test_anno.gff3
LoDEI caculates differential editing for all entries of the provided annotation file.
-o ../output_conda
define the output directory. LoDEI generates many automatically named files.
-c 3
Number of used CPU cores.
--library SR
provide the strandedness of your BAM files. Strandedness is defined as in salmon see: https://salmon.readthedocs.io/en/latest/library_type.html Currently, the following types are supported: SR = reverse stranded, SF = forward stranded, U = unstranded (also use U for IU) ISR = paired-end, reverse stranded ISF = paired-end, forward stranded, If you are unsure what kind of library type (strandedness) your data is, have a look at the FAQ and
https://github.com/rna-editing1/getlibtype
--min_coverage 5
only consider single positions that have a coverage >= min_coverage in all samples.
Wait until LoDEI finishes the calculation (~1-2min) and have a look at the output.
Installation and Usage via Podman
Common Linux distributions are typically shipped with Podman. Podman is a tool to create, run and maintain containers.
For a detailed introduction of Podman we refer the reader to the primary documentation at https://podman.io.
Build the image via the Containerfile
Let’s build the image locally:
cd ~ # enter your home directory
git clone https://github.com/rna-editing1/lodei.git # get the repository
cd lodei
podman build -f Containerfile -t lodei
Usage via Podman
Verify that podman is able to start LoDEI by trying to run the new container:
podman run -it --rm localhost/lodei:latest lodei find -h
If your container runs successfully,
you should see the help page of LoDEI.
Run LoDEI Using Podman
To verify a proper installation we run LoDEI on the provided test dataset.
Mount a volume/directory into the container
The LoDEI container needs access to the provided files (annotations and bam files) as well as a directory where it can save results to.
The -v option is needed to make directories of your host file system available in the container.
In a nutshell, -v mounts directories of your file system into the container.
The general syntax is
-v /path/on/host/system:/path/in/container:option
Option can be ro for read only and rw for read and write.
Note, the directory in the container does not need to exist there. You can specify any directory.
Run LoDEI
If you’ve followed the steps of the test dataset the directory ~/example exists.
Switch to the ~/example directory and create a new directory where LoDEI shall save all output into:
cd ~/example
mkdir output_test
Next, we will apply LoDEI on the test dataset via calling
mount the host directory ~/example/data_testrun/annotation to the directoy /annotation in the container with read only permission.
-v ~/example/data_testrun/bam:/bam:ro
mount the host directory ~/example/data_testrun/bam to the directoy /bam in the container with read only permission.
-v ~/example/output_test:/output:rw
mount the host directory ~/example/output_test to the directoy /output in the container with read and write permissions.
localhost/lodei_0.0.1:latest lodei find
localhost/lodei_0.0.1:latest is the name of the image from which a new container shall be started followed by the command line call to start lodei find.
--group1 ...
provide the list of sorted BAM files (separated by space) of samples belonging to group 1. Note, for each input.bam file a corresponding .bai file is required to be present in the same directory.
--group2 ...
provide the list of sorted BAM files (separated by space) of samples belonging to group 2. Note, for each input.bam file a corresponding .bai file is required to be present in the same directory.
-f /annotation/genome.fa
provide the reference genome used to generate the provided BAM files.
/annotation/test_anno.gff3
LoDEI caculates differential editing for all entries of the provided annotation file.
-o /output
define the output directory. LoDEI generates many automatically named files.
-c 3
Number of used CPU cores.
--library SR
provide the strandedness of your BAM files. Strandedness is defined as in salmon see: https://salmon.readthedocs.io/en/latest/library_type.html Currently, the following types are supported: SR = reverse stranded, SF = forward stranded, U = unstranded (also use U for IU) ISR = paired-end, reverse stranded ISF = paired-end, forward stranded, If you are unsure what kind of library type (strandedness) your data is, have a look at the FAQ and
https://github.com/rna-editing1/getlibtype
--min_coverage 5
only consider single positions that have a coverage >= min_coverage in all samples.
Wait until LoDEI finishes the calculation (~1-2min) and have a look at the output.
Output
The primary outputs are BED-format-like plaintext files containing the genomic coordinates, their differential editing signals and q-values of all windows. The first line is the header. Each subsequent line corresponds to a single window.
Column name
Description
chrom
The name of the chromosome (e.g. chr2, 2) where the window was detected (string)
wstart
The starting position of the window (int)
wend
The stopping position of the window (int)
name
Contains the gene name where the window was detected or empty (string)
wEI
The calculated differential signal (see eq. 4 in the publication) (float)
strand
Defines the strand where the differential signals was detected. Either “+” or “-“ (char)
q_value
Calculated q value of the detected wEI signal (float)
LoDEI computes differential signals for all possible mismatch pairs. As a consequence, for each nucleotide mismatch X and Y an output file is generated according to the following scheme /windows/windows_XY.txt where X and Y are the nucleotide mismatches. Consequently, the file /windows/windows_AG.txt should be examined in case of A-to-I editing. Note, that the nucleotides mismatches X and Y refer to the 5’-3’ orientation. If you are interested analyzing A-to-I editing you only need to look at the _AG.txt files. LoDEI properly handles the mismatch detection with respect to the used sequencing library and strand orientation of RNAs internally.
The results of all mismatches are located in the sub-directoy /windows in the output directory:
If windows achieve a q value < 0.1, LoDEI creates additional output files for each mismatch pair for windows with a q value < 0.1 according to the naming scheme windows_qfiltered_XY.txt, where X and Y are the nucleotide mismatches.
Getting Started
Since LoDEI requires sorted BAM files as input the following steps/programs are typically performed/run prior to running LoDEI:
fastqc / multiqc for quality control of the data
cutadapt for quality filtering of the reads. The -q parameter specifies the quality filtering. A value of at least 20 (-q 20) is recommended.
STAR for aligning RNA-seq data to the reference
samtools for sorting and indexing the BAM files obtained from STAR.
lodei for differential RNA editing analysis.
Keep in mind to set the --library parameter of lodei properly.
If you are unsure what kind of library type (strandedness) your data is, have a look at the FAQ and https://github.com/rna-editing1/getlibtype
FAQ
What kind of annotation file do I need to use?
The annotation file provided in GFF format that LoDEI takes as input should contain the genomic regions of interest for your analysis question.
LoDEI uses a sliding window approach.
For a given genomic region (that’s an entry/line in your GFF), LoDEI calculates the differential editing for all windows that fit into that given region.
Thus, your GFF file should fulfill the following requirements:
Make sure your GFF annotation file does not contain redundant entries.
The genomic locations in your file should be unique and not overlapping with each other.
Common genomic annotation files like basic gene annotation files obtained from gencodegenes.org should not be used without prior filtering since annotation files typically contain many redundant genomic locations since they contain genes, transcripts, and exons. A starting point for standard RNA-seq might be the set of protein-coding genes (but it depends on your experiment):
Ensure that the annotation file you provide to LoDEI covers a large set of genomic locations to ensure that LoDEI gets enough data to calculate q values.
Should I use –rm_snps?
Short answer: if you are unsure, yes.
Long answer: If you compare datasets from the same cell line you typically don’t need that option. If the sets that you compare against each other contain sequencing data from different cells/samples/patients/etc. you should use this option.
How do I infer the library type of my data?
To run LoDEI it is required to specify the library type for your sequencing data via the --library parameter.
We provide the additional small program getlibtype here https://github.com/rna-editing1/getlibtype to help you identifying your library type.
LoDEI uses the same library type specification as Salmon (https://salmon.readthedocs.io/en/latest/library_type.html).
getlibtype is a small wrapper for salmon that utilizes salmon only for the library type detection.
Why is the library type so important?
Detecting RNA editing is based on scanning for mismatches between the sequencing data and the reference genome.
In case of A-to-I editing, publications typically refer to the analysis of A/G mismatches between the reads and the reference. This description is correct from the perspective of 5’-3’ transcript orientation and transcripts that originate from genes from the forward strand.
Unfortunately, transcripts can be located on the forward or reverse strand of the genome and the used sequencing chemistry has an impact on the type of mismatches with respect to the relative orientation of transcripts.
In other words, the type of mismatch to look at is dependent on the location of the gene (forward or reverse) and the underlying sequencing chemistry.
To ease the analysis and not getting down into this rabbit hole, LoDEI takes care of all of this internally.
Local Differential Editing Index (LoDEI)
General Notes
LoDEI - the local differential editing index - offers a collection of programs to detect and analyze differentially edited A-to-I regions in two sets of RNA-seq samples.
The subcommand to detect differential A-to-I editing is
lodei find.Analyzing RNA-seq data typcially requires the mapping of NGS reads in fastq format to a reference genome. The primary input for
lodei findare sorted BAM files as produced by NGS-read mappers like STAR.LoDEI is free software and licensed under GPLv3.
If you use LoDEI, please cite https://doi.org/10.1038/s41467-024-53298-y
System Requirements
Installation
Prior to installation, we recommend to follow the instructions of the provided test data to be able to verify a proper installation.
Install LoDEI by using one of the following ways:
Publication
Torkler, P., Sauer, M., Schwartz, U. et al. LoDEI: a robust and sensitive tool to detect transcriptome-wide differential A-to-I editing in RNA-seq data. Nat Commun 15, 9121 (2024). https://doi.org/10.1038/s41467-024-53298-y
Test Data
We provide a small test dataset (~15MB, https://zenodo.org/doi/10.5281/zenodo.10907019) that contains all required input files to run
lodei findto demonstrate the proper usage for detecting differentially edited A-to-I regions. The test dataset contains sorted BAM files belonging to two different conditions that are thought to be compared against each other, genomic annotations, and the nucleotide sequences for three genes of the human genome.Let’s create an example directory and download the testdata:
After unpacking, the directory
data_testrunshould be in yourexampledirectory (see below). The subdirectorydata_testrun/annotationcontains genomic sequences in the fasta format and genomic annotations in the GFF3 format.The
data_testrun/bamsubdirectory contains BAM files for 10 samples whereas samples 01-05 belong to set 1 and samples 06-10 belong to set 2.DO NOT CHANGE ANYTHING IN THE
data_testrunDIRECTORYInstallation and Usage via conda
Generate a new environment and install LoDEI:
Run LoDEI Using Conda
To verify a proper installation we run LoDEI on the provided test dataset.
Make sure to get back into your example directory where your unpacked the test dataset. Create a new output directory at
~/examplewhere LoDEI can save the results and finally move into the example data directory:Run LoDEI on the testdata:
Detailed explanation of parameters and arguments:
--group1 ....bamfile a corresponding.baifile is required to be present in the same directory.--group2 ....bamfile a corresponding.baifile is required to be present in the same directory.-f annotation/genome.faannotation/test_anno.gff3-o ../output_conda-c 3--library SRsalmonsee: https://salmon.readthedocs.io/en/latest/library_type.htmlCurrently, the following types are supported:
SR = reverse stranded,
SF = forward stranded,
U = unstranded (also use
UforIU)ISR = paired-end, reverse stranded
ISF = paired-end, forward stranded,
If you are unsure what kind of library type (strandedness) your data is, have a look at the FAQ and https://github.com/rna-editing1/getlibtype
--min_coverage 5--rm_snps?Wait until LoDEI finishes the calculation (~1-2min) and have a look at the output.
Installation and Usage via Podman
Common Linux distributions are typically shipped with Podman. Podman is a tool to create, run and maintain containers. For a detailed introduction of Podman we refer the reader to the primary documentation at https://podman.io.
Build the image via the Containerfile
Let’s build the image locally:
Usage via Podman
Verify that podman is able to start LoDEI by trying to run the new container:
If your container runs successfully, you should see the help page of LoDEI.
Run LoDEI Using Podman
To verify a proper installation we run LoDEI on the provided test dataset.
Mount a volume/directory into the container
The LoDEI container needs access to the provided files (annotations and bam files) as well as a directory where it can save results to. The
-voption is needed to make directories of your host file system available in the container. In a nutshell,-vmounts directories of your file system into the container. The general syntax isOption can be
rofor read only andrwfor read and write.Note, the directory in the container does not need to exist there. You can specify any directory.
Run LoDEI
If you’ve followed the steps of the test dataset the directory
~/exampleexists. Switch to the~/exampledirectory and create a new directory where LoDEI shall save all output into:Next, we will apply LoDEI on the test dataset via calling
Detailed explanation of parameters and arguments:
-v ~/example/data_testrun/annotation:/annotation:ro~/example/data_testrun/annotationto the directoy/annotationin the container with read only permission.-v ~/example/data_testrun/bam:/bam:ro~/example/data_testrun/bamto the directoy/bamin the container with read only permission.-v ~/example/output_test:/output:rw~/example/output_testto the directoy/outputin the container with read and write permissions.localhost/lodei_0.0.1:latest lodei findlocalhost/lodei_0.0.1:latestis the name of the image from which a new container shall be started followed by the command line call to startlodei find.--group1 ....bamfile a corresponding.baifile is required to be present in the same directory.--group2 ....bamfile a corresponding.baifile is required to be present in the same directory.-f /annotation/genome.fa/annotation/test_anno.gff3-o /output-c 3--library SRsalmonsee: https://salmon.readthedocs.io/en/latest/library_type.htmlCurrently, the following types are supported:
SR = reverse stranded,
SF = forward stranded,
U = unstranded (also use
UforIU)ISR = paired-end, reverse stranded
ISF = paired-end, forward stranded,
If you are unsure what kind of library type (strandedness) your data is, have a look at the FAQ and https://github.com/rna-editing1/getlibtype
--min_coverage 5--rm_snps?Wait until LoDEI finishes the calculation (~1-2min) and have a look at the output.
Output
The primary outputs are BED-format-like plaintext files containing the genomic coordinates, their differential editing signals and q-values of all windows. The first line is the header. Each subsequent line corresponds to a single window.
LoDEI computes differential signals for all possible mismatch pairs. As a consequence, for each nucleotide mismatch X and Y an output file is generated according to the following scheme
/windows/windows_XY.txtwhere X and Y are the nucleotide mismatches. Consequently, the file/windows/windows_AG.txtshould be examined in case of A-to-I editing. Note, that the nucleotides mismatches X and Y refer to the 5’-3’ orientation. If you are interested analyzing A-to-I editing you only need to look at the_AG.txtfiles. LoDEI properly handles the mismatch detection with respect to the used sequencing library and strand orientation of RNAs internally.The results of all mismatches are located in the sub-directoy
/windowsin the output directory:If windows achieve a q value < 0.1, LoDEI creates additional output files for each mismatch pair for windows with a q value < 0.1 according to the naming scheme
windows_qfiltered_XY.txt, where X and Y are the nucleotide mismatches.Getting Started
Since LoDEI requires sorted BAM files as input the following steps/programs are typically performed/run prior to running LoDEI:
fastqc/multiqcfor quality control of the datacutadaptfor quality filtering of the reads. The-qparameter specifies the quality filtering. A value of at least 20 (-q 20) is recommended.STARfor aligning RNA-seq data to the referencesamtoolsfor sorting and indexing the BAM files obtained fromSTAR.lodeifor differential RNA editing analysis.Keep in mind to set the
--libraryparameter oflodeiproperly. If you are unsure what kind of library type (strandedness) your data is, have a look at the FAQ and https://github.com/rna-editing1/getlibtypeFAQ
What kind of annotation file do I need to use?
The annotation file provided in GFF format that LoDEI takes as input should contain the genomic regions of interest for your analysis question. LoDEI uses a sliding window approach. For a given genomic region (that’s an entry/line in your GFF), LoDEI calculates the differential editing for all windows that fit into that given region. Thus, your GFF file should fulfill the following requirements:
Make sure your GFF annotation file does not contain redundant entries. The genomic locations in your file should be unique and not overlapping with each other. Common genomic annotation files like basic gene annotation files obtained from gencodegenes.org should not be used without prior filtering since annotation files typically contain many redundant genomic locations since they contain genes, transcripts, and exons. A starting point for standard RNA-seq might be the set of protein-coding genes (but it depends on your experiment):
Ensure that the annotation file you provide to LoDEI covers a large set of genomic locations to ensure that LoDEI gets enough data to calculate q values.
Should I use –rm_snps?
Short answer: if you are unsure, yes. Long answer: If you compare datasets from the same cell line you typically don’t need that option. If the sets that you compare against each other contain sequencing data from different cells/samples/patients/etc. you should use this option.
How do I infer the library type of my data?
To run LoDEI it is required to specify the library type for your sequencing data via the
--libraryparameter. We provide the additional small programgetlibtypehere https://github.com/rna-editing1/getlibtype to help you identifying your library type. LoDEI uses the same library type specification as Salmon (https://salmon.readthedocs.io/en/latest/library_type.html).getlibtypeis a small wrapper forsalmonthat utilizessalmononly for the library type detection.Why is the library type so important?
Detecting RNA editing is based on scanning for mismatches between the sequencing data and the reference genome. In case of A-to-I editing, publications typically refer to the analysis of A/G mismatches between the reads and the reference. This description is correct from the perspective of 5’-3’ transcript orientation and transcripts that originate from genes from the forward strand. Unfortunately, transcripts can be located on the forward or reverse strand of the genome and the used sequencing chemistry has an impact on the type of mismatches with respect to the relative orientation of transcripts. In other words, the type of mismatch to look at is dependent on the location of the gene (forward or reverse) and the underlying sequencing chemistry. To ease the analysis and not getting down into this rabbit hole, LoDEI takes care of all of this internally.