fix: improve GFF parsing robustness for non-standard formats

Fix aggressive sed pattern that corrupted GFF attributes (ID→gene_id)
Add word boundary check (\bID=) to prevent false matches
Handle GFF files without ID attribute (common in EcoCyc/BioCyc)
Normalize name/Name/gene_id to gene column automatically in R script
Fix GFF header corruption (extra spaces in version directive)
Improve compatibility with ParsedEcocyc and other curated databases

The R script now checks for gene column and creates it from:

Name (standard GFF3)
name (lowercase, non-standard)
gene_id (from main.sh transformation)
locus_tag (fallback, always present in prokaryotes)

This resolves parsing errors with GFF files that only contain locus_tag and name attributes without standard ID field.

ExcludonFinder

Untitled (7)

An easy to use tool for identifying and analyzing excludons in genomic data using RNA-seq data.

Outline

rect1

From a given RNA-seq data, alignment is performed against reference genome (1) and coverage per nucleotide is calculated (2). Convergent (-> <-) and divergent (<- ->) pairs of genes are substarcted and median covergage is calculated for each of them (3). Trancriptional units (TUs) for each gene is annotated (4) based on gene coverage. A threshold of the covegare decreasing is set, gene gene expression decays under this threshold transcription start and end sites (TSS and TTS) is set. If TUs of convergen and divergent pairs overlaps, this pair is annotated as Excludon (5).

Features

Fast parallel processing for large datasets
Support for both short and long-read data
Support for paired-end and single-end RNA-seq data
Built-in quality checks and mapping statistics

Installation

Using Conda (Recommended)

conda install -c bioconda excludonfinder

From source

git clone https://github.com/Alvarosmb/ExcludonFinder.git
cd ExcludonFinder
conda env create -f environment.yml
conda activate ExcludonFinder

Usage

If installed with conda:

ExcludonFinder -f <reference.fasta> -1 <reads_R1.fastq> -2 <reads_R2.fastq> -g <annotation.gff>

If installed from source

./scripts/ExcludonFinder -f <reference.fasta> -1 <reads_R1.fastq> -2 <reads_R2.fastq> -g <annotation.gff>

Options

- `-f`: Reference genome in FASTA format
- `-1`: Input FASTQ file for Read 1
- `-2`: Input FASTQ file for Read 2
- `-g`: Annotation file in GFF format
- `-t`: Coverage threshold (default: 0.5)
- `-j`: Number of threads (default: 8)
- `-l`: Long-read data
- `-o`: Custom output dir (default: `./output`)
- `-k`: Keep intermediate files (default: remove)

Example

./scripts/ExcludonFinder \
 -f data/example/E.coli_K12_MG1655.fasta \
 -1 data/example/test_R1.fastq \
 -2 data/example/test_R2.fastq \
 -g data/example/E.coli_K12_MG1655.gff \
 -t 0.5 \
 -j 4

Examples

The data/examples directory contains test RNA-seq data from E. coli K12 MG1655. For faster testing and analysis, the dataset is reduced to reads mapping only to the first 50 genes. Expected results can be found in data/examples/output/.

Citation

If you found this tool useful, please cite:

Alvaro Sanmartin, Pablo Iturbe, Jeronimo Rodriguez-Beltran, Iñigo Lasa. ExcludonFinder: Mapping Transcriptional Overlaps Between Neighboring Genes