eFISHent

Design RNA smFISH oligonucleotide probes from the command line. One command to install, one command to design probes.

Key features:

Pre-built genome indices for 7 organisms — no index building needed
Automatic gene sequence download from NCBI
Multi-layer off-target detection (genome alignment, transcriptome BLAST, repeat masking, expression weighting, and more)
Adaptive probe length to normalize Tm across the probe set
Protocol presets (smfish, merfish, dna-fish, etc.)
Automated probe validation with PASS/FLAG/FAIL recommendations

Installation

Tested on macOS and Linux with Python 3.10+. Works on HPC/cluster servers via SSH — no sudo, Docker, or conda needed. For Windows, use WSL.

curl -LsSf https://raw.githubusercontent.com/BBQuercus/eFISHent/main/install.sh | bash

Restart your shell, then verify:

efishent --check

Installation options

With BLAST+ and transcriptome tools (for transcriptome-level off-target filtering):

curl -LsSf https://raw.githubusercontent.com/BBQuercus/eFISHent/main/install.sh | bash -s -- --with-blast

Custom install path:

curl -LsSf https://raw.githubusercontent.com/BBQuercus/eFISHent/main/install.sh | bash -s -- --prefix /path/to/install

Update:

efishent --update

Development install:

git clone https://github.com/BBQuercus/eFISHent.git
cd eFISHent/
./install.sh --deps-only
uv venv && source .venv/bin/activate
uv pip install -e .

Uninstall:

curl -LsSf https://raw.githubusercontent.com/BBQuercus/eFISHent/main/install.sh | bash -s -- --uninstall

Or simply: rm -rf ~/.local/efishent

Quick Start

The fastest way to design probes — genome indices are downloaded automatically:

efishent --genome hg38 --gene-name "TP53" --organism-name "homo sapiens" --preset smfish

That’s it. This downloads the pre-built human genome index on first use and designs smFISH probes for TP53.

Available Genomes

Organism	Aliases
Human	`hg38`, `GRCh38`, `human`
Mouse	`mm39`, `GRCm39`, `mouse`
Zebrafish	`danRer11`, `GRCz11`, `zebrafish`
Rat	`rn7`, `GRCr8`, `rat`
Drosophila	`dm6`, `BDGP6`, `fly`
C. elegans	`ce11`, `WBcel235`, `worm`, `elegans`
Yeast	`sacCer3`, `R64`, `yeast`

efishent --list-genomes          # List all available genomes
efishent --download-genome hg38  # Pre-download for offline use

Indices are cached in ~/.local/efishent/indices/ by default. Override with --index-cache-dir /path/to/dir or the EFISHENT_INDEX_DIR environment variable.

Specifying Your Target Gene

Three ways to provide the target sequence:

Method	Example
Gene name + organism	`--gene-name "TP53" --organism-name "homo sapiens"`
Ensembl ID	`--ensembl-id ENSG00000141510 --organism-name "homo sapiens"`
FASTA file	`--sequence-file ./my_gene.fasta`

Using Your Own Genome

For organisms without a pre-built index, provide your own reference genome:

# Build indices once (can take 30-60 min for large genomes)
efishent --reference-genome <genome.fa> --build-indices True

# Design probes
efishent --reference-genome <genome.fa> --gene-name <gene> --organism-name <organism>

Downloading genomes and annotations

For any organism, download the genome FASTA and GTF annotation from Ensembl or UCSC. Prefer primary_assembly if available, otherwise toplevel. Unzip with gunzip.

Example for human (GRCh38):

# Reference genome
wget https://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

# GTF annotation (for intergenic filtering, rRNA filtering, expression weighting)
wget https://ftp.ensembl.org/pub/current_gtf/homo_sapiens/Homo_sapiens.GRCh38.115.gtf.gz
gunzip Homo_sapiens.GRCh38.115.gtf.gz

Ensembl GTFs use gene_biotype while GENCODE uses gene_type — eFISHent supports both.

Reference transcriptome (optional, for BLAST cross-validation):

gffread Homo_sapiens.GRCh38.115.gtf -g Homo_sapiens.GRCh38.dna.primary_assembly.fa -w transcriptome.fa

# Append rRNA sequences (18S/28S/5.8S are NOT in standard GTFs)
efetch -db nucleotide -id NR_003286.4 -format fasta >> transcriptome.fa  # 45S pre-rRNA
efetch -db nucleotide -id NR_023363.1 -format fasta >> transcriptome.fa  # 5S rRNA

The major rRNA genes exist in ~300 tandem copies in unassembled regions, so they’re absent from standard GTFs. Including them in the transcriptome FASTA ensures the BLAST filter catches probes binding these abundant sequences.

Count table (optional, for expression-weighted filtering):

Download a normalized RNA-seq dataset (FPKM/TPM) for your cell line from GEO or Expression Atlas. The file needs Ensembl gene IDs in column 1 and normalized counts in column 2.

Presets

Use --preset to apply optimized parameters for common FISH protocols:

Preset	Description
`smfish`	Standard smFISH (20-24nt probes, adaptive length, 10% formamide)
`merfish`	MERFISH encoding probes (tight Tm, 30% formamide)
`dna-fish`	DNA FISH (longer probes, relaxed specificity)
`strict`	Maximum specificity (low k-mer tolerance, low-complexity filter)
`relaxed`	Maximum probe yield (permissive thresholds + rescue filters)
`exogenous`	Exogenous genes — GFP, Renilla, reporters (no k-mer filter, strict BLAST)

Use --preset list to see details. Explicit arguments override preset values.

Workflow

flowchart TD
    A["Gene Sequence
FASTA file or NCBI download"] --> B["Generate Candidate Probes
Sliding window (adaptive or fixed length)"]

    B --> C["Basic Filtering
TM, GC, homopolymers, low-complexity"]

    C --> D["Genome Alignment
Bowtie2 (default) or Bowtie
+ repeat masking, intergenic, Tm scoring"]

    D --> E{"Transcriptome
provided?"}
    E -- Yes --> F["Transcriptome BLAST
Off-target detection (TrueProbes params)"]
    E -- No --> G["K-mer Filtering
Jellyfish frequency count"]
    F --> G

    G --> H["Secondary Structure
deltaG prediction (RNAstructure)"]

    H --> H2["Accessibility Scoring
RNA folding (optional)"]

    H2 --> I["Quality-Weighted Optimization
Greedy or optimal (MILP) + gap filling + Tm refinement"]

    I --> K["Validation Report
Quality scores, off-target genes, recommendations"]

    K --> J["Final Probe Set"]

    style A fill:#e1f5fe
    style J fill:#e8f5e9

Candidate probes are generated from the input sequence using a sliding window. When --adaptive-length is enabled, probe lengths are adjusted based on local GC content to normalize Tm.
Basic filtering removes probes failing sequence criteria: melting temperature, GC content, homopolymer runs, optionally low-complexity regions, and optionally G-quadruplex motifs (--filter-g-quadruplex).
Probes are aligned to the reference genome using Bowtie2 (sensitive local alignment with OligoMiner/Tigerfish parameters). Optional filters refine off-target counting: repeat masking, intergenic filtering, thermodynamic scoring, and expression weighting.
If a reference transcriptome is provided, probes are BLASTed against expressed transcripts to catch off-targets that genome alignment alone may miss (e.g., splice junctions).
Short k-mers are counted using Jellyfish — probes with frequently occurring k-mers are discarded.
Secondary structure is predicted using a nearest-neighbor thermodynamic model — probes with too-stable structures are filtered.
If --accessibility-scoring is enabled, target RNA accessibility is scored using RNA folding predictions.
Quality-weighted optimization selects non-overlapping probes maximizing coverage. A gap-filling pass covers remaining regions and Tm uniformity refinement swaps outlier probes.
The output includes per-probe quality scores, off-target gene names, expression risk, and PASS/FLAG/FAIL recommendations.

Output

eFISHent produces three files per run:

File	Description
`GENE_HASH.fasta`	Final probes in FASTA format
`GENE_HASH.csv`	Detailed probe table (see columns below)
`GENE_HASH.txt`	Run parameters and command for reproducibility

The HASH is a unique identifier based on the parameters used — rerunning with the same parameters reuses cached results.

Output CSV columns

Column	Description
`name`	Probe identifier
`sequence`	Probe nucleotide sequence
`start`, `end`	Position along the target gene
`length`	Probe length in nucleotides
`GC`	GC content (%)
`TM`	Predicted melting temperature (deg C)
`deltaG`	Secondary structure free energy (kcal/mol)
`kmers`	Maximum k-mer count in reference genome
`count`	Genome off-target hit count
`txome_off_targets`	Transcriptome off-target count (when `--reference-transcriptome` is used)
`off_target_genes`	Off-target gene names with hit counts, e.g., `ACTG1(3), MYH9(1)`
`worst_match`	Best off-target match quality, e.g., `95%/20bp/0mm`
`expression_risk`	Expression risk for off-target genes, e.g., `ACTG1:HIGH(850)`
`quality`	Composite quality score (0-100)
`recommendation`	`PASS`, `FLAG(reason)`, or `FAIL`

Probe Set Analysis

Analyze an existing probe set with comprehensive metrics and a PDF report:

efishent \
    --reference-genome <genome.fa> \
    --sequence-file <gene.fa> \
    --analyze-probeset <probes.fasta>

Analysis report contents

Plot	Description
Lengths	Distribution of probe lengths
Melting temperatures	Boxplot of calculated Tm values
GC Content	Boxplot of GC percentages
G quadruplet	Count of G-quadruplet motifs per probe
K-mer count	Maximum k-mer frequency in genome
Free energy	Predicted secondary structure stability (deltaG)
Off target count	Number of off-target binding sites per probe
Binding affinity	Probe-to-probe similarity matrix (potential cross-hybridization)
Gene coverage	Visual map of probe positions along the target sequence

Parameters

Core Parameters

Parameter	Description
`--reference-genome`	Path to reference genome FASTA
`--genome`	Use a pre-built genome index (e.g., `hg38`, `mm39`, `zebrafish`)
`--gene-name`	Gene name for automatic sequence download from NCBI
`--organism-name`	Organism name (used with `--gene-name` or `--ensembl-id`)
`--sequence-file`	Path to target gene FASTA file
`--preset`	Parameter preset: `smfish`, `merfish`, `dna-fish`, `strict`, `relaxed`, `exogenous`
`--threads`	Number of threads for parallel processing
`--is-plus-strand`	Strand orientation of the gene of interest
`--is-endogenous`	Whether the gene is endogenous to the organism

Probe Design Parameters

Parameter	Description
`--min-length`, `--max-length`	Probe length range in nucleotides
`--spacing`	Minimum distance between probes
`--min-tm`, `--max-tm`	Melting temperature range
`--min-gc`, `--max-gc`	GC content range (%)
`--formamide-concentration`	Formamide concentration (%)
`--na-concentration`	Sodium ion concentration (mM)
`--adaptive-length`	Adjust probe length by local GC to normalize Tm
`--max-homopolymer-length`	Max homopolymer run (default: 5, 0 to disable)
`--filter-low-complexity`	Filter dinucleotide repeats and low entropy regions
`--filter-g-quadruplex`	Filter G-quadruplex motifs in target
`--max-deltag`	Secondary structure free energy threshold
`--target-regions`	Target region: `exon` (default), `intron`, `both`, `cds-only`, `utr-only`
`--accessibility-scoring`	Score target RNA accessibility via RNA folding
`--optimization-method`	`greedy` (default, fast) or `optimal` (MILP, max coverage)
`--optimization-time-limit`	Time limit in seconds for optimal solver
`--sequence-similarity`	Max allowed inter-probe similarity (%) to avoid cross-hybridization

Off-target filtering parameters

Genome alignment (default):

Parameter	Description
`--max-off-targets`	Maximum genome hits per probe (default: 0)
`--aligner`	`bowtie2` (default) or `bowtie` (legacy)
`--mask-repeats`	Ignore off-targets in repetitive regions (uses dustmasker)
`--intergenic-off-targets`	Ignore off-targets outside annotated genes (requires `--reference-annotation`)
`--off-target-min-tm`	Min Tm (deg C) for an off-target to count. Set to hybridization temp to rescue thermodynamically unstable hits (default: 0)
`--filter-rrna`	Remove probes hitting rRNA genes (requires `--reference-annotation`)

Transcriptome BLAST (optional):

Parameter	Description
`--reference-transcriptome`	Transcriptome FASTA for BLAST cross-validation
`--max-transcriptome-off-targets`	Max transcriptome hits per probe (default: 0)
`--blast-identity-threshold`	Min % identity for BLAST hit (default: 75)
`--min-blast-match-length`	Min effective alignment length (default: max(18, 0.8 * min_probe_length))

Expression weighting (optional):

Parameter	Description
`--reference-annotation`	GTF annotation file
`--encode-count-table`	Normalized RNA-seq count table (FPKM/TPM)
`--max-expression-percentage`	Top expression percentile to exclude
`--max-probes-per-off-target`	Cap on probes hitting same off-target gene (default: 0 = disabled, recommended: 5)

Index and cache parameters

Parameter	Description
`--build-indices`	Build genome indices (bowtie2, jellyfish, BLAST)
`--download-genome`	Pre-download a genome index for offline use
`--list-genomes`	List available pre-built genomes
`--index-cache-dir`	Override index cache directory (default: `~/.local/efishent/indices/`). Also settable via `EFISHENT_INDEX_DIR`
`--kmer-length`	K-mer length for Jellyfish filtering
`--max-kmers`	Max k-mer occurrences in genome before discarding probe
`--save-intermediates`	Keep all intermediate files for debugging

Examples

Full examples

smFISH with pre-built index (simplest):

efishent --genome hg38 --gene-name "TP53" --organism-name "homo sapiens" --preset smfish --threads 8

smFISH with full off-target filtering:

efishent \
    --reference-genome ./hg-38.fa \
    --reference-annotation ./hg-38.gtf \
    --reference-transcriptome ./transcriptome.fa \
    --gene-name "GAPDH" \
    --organism-name "homo sapiens" \
    --preset smfish \
    --mask-repeats True \
    --intergenic-off-targets True \
    --filter-rrna True \
    --max-probes-per-off-target 5 \
    --threads 8

Long probes (45-50nt) with optimal solver:

efishent \
    --reference-genome ./hg-38.fa \
    --gene-name "norad" \
    --organism-name "homo sapiens" \
    --is-plus-strand True \
    --optimization-method optimal \
    --min-length 45 \
    --max-length 50 \
    --formamide-concentration 45 \
    --threads 8

Exogenous gene (GFP, Renilla, etc.):

efishent \
    --reference-genome ./hg38.fa \
    --reference-transcriptome ./transcriptome.fa \
    --reference-annotation ./hg38.gtf \
    --sequence-file "./renilla.fasta" \
    --preset exogenous \
    --threads 8

Expression-weighted off-target filtering:

efishent \
    --reference-genome ./hg-38.fa \
    --reference-annotation ./hg-38.gtf \
    --ensembl-id ENSG00000128272 \
    --organism-name "homo sapiens" \
    --is-plus-strand False \
    --max-off-targets 5 \
    --encode-count-table ./count_table.tsv \
    --max-expression-percentage 20 \
    --threads 8

Rescue probes with thermodynamic and repeat masking filters:

efishent \
    --reference-genome ./hg-38.fa \
    --reference-annotation ./hg-38.gtf \
    --sequence-file ./my_gene.fasta \
    --mask-repeats True \
    --intergenic-off-targets True \
    --off-target-min-tm 37 \
    --threads 8

FAQ

Have questions? Open an issue on GitHub.