CATS-rf is the reference-free module of the CATS (Comprehensive Assessment of Transcript Sequences) framework. It evaluates the quality of transcriptomes assembled de novo from short reads, relying solely on RNA-seq reads used in the assembly construction. The pipeline maps reads back to the assembled transcripts and examines mapping evidence suggesting misassembly. Quality evaluation is performed at the transcript level, integrating four score components each targeting specific assembly errors:
Transcript quality score St is calculated as the product of the described score components, equally weighting detected assembly errors. Assembly score S is computed as the mean of individual transcript scores. All components and scores are normalized to a range between 0 and 1, where higher values indicate better quality.
In addition to transcript scores, CATS-rf provides a comprehensive set of assembly metrics, including transcript length and composition statistics, read mapping rates, positional coverage and accuracy profiles, and pair mapping consistency metrics.
CATS-rf consistently displays stronger performance than currently existing reference-free transcriptome assembly evaluation tools. For detailed benchmarks and methodology, please refer to the CATS preprint
Installation
Running CATS-rf via Docker
Starting from version 1.0.2, CATS-rf can be run using Docker:
Build the Docker image with:
docker pull bodulic/cats-rf
Run the container with:
docker run --rm -v "$PWD":/data -w /data bodulic/cats-rf CATS_rf
docker run --rm -v "$PWD":/data -w /data bodulic/cats-rf CATS_rf_compare
singularity run cats_rf.sif CATS_rf
singularity run cats_rf.sif CATS_rf_compare
Installing CATS-rf via conda
CATS-rf and its dependencies can be directly installed via Bioconda:
conda install -c bioconda cats-rf
In case of dependency conflicts, please see the Troubleshooting section.
Installing CATS-rf from source
CATS-rf consists of Bash and R scripts located in the scripts directory of this repository. After cloning the repository, all CATS-rf scripts must be included in the PATH environment variable.
The following dependencies are required:
R, Bowtie2, Samtools, kallisto, GNU Parallel, bedtools (bamToBed), and pysamstats executables must be included in PATH. R package data.table can be installed via conda or directly in R with install.packages("data.table")
MacOS
If you are using MacOS, Bash (version >= 4.0) and GNU versions of core utilities are required. In this case, PATH should be adjusted so that CATS-rf uses GNU versions of core utilities:
The stated changes can be made permanent by modifying the appropriate .rc file.
Test data
CATS-rf installation can be tested using instructions and files located in test_data directory.
Using typical hardware and default settings, the total runtime of CATS-rf on the provided test data is approximately 1 minute.
For real-world datasets, runtime scales with sequencing depth. On average-depth human transcriptomes (~20 million reads), CATS-rf typically completes in approximately 1 hour.
Example usage
CATS-rf requires a transcriptome assembly in FASTA format, along with short RNA-seq reads used during assembly in either FASTQ or FASTA format. Compressed (.gz) read files are supported.
CATS-rf supports both paired-end and single-end library configurations.
Example paired-end mode usage:
CATS_rf [OPTIONS] TRANSCRIPTOME READS1 READS2
Example single-end mode usage:
CATS_rf -C se -m MEAN_INS_SIZE -s SD_INS_SIZE [OTHER_OPTIONS] TRANSCRIPTOME READS1
Single-end mode requires three options to be specified: C for library configuration, m for mean fragment size, and s for standard deviation of fragment size. Note that single-end runs will output only general assembly statistics, read mapping metrics, and positional coverage and accuracy analysis.
Detailed options
CATS-rf offers a comprehensive list of options which allow users to control the analysis parameters.
Library type options
-C: Paired- vs. single-end library configuration: pe = paired-end, se = single-end, default: pe
-S: Library strandness, fr = forward-reverse, rf = reverse-forward, u = unstranded, a = automatic detection, default: u
CATS-rf can leverage strandness information when quantifying transcripts and calculating the local fidelity score component. When the automatic detection option is enabled, strandness is estimated using the first 100 000 read mappings.
While CATS-rf was primarily tested on Illumina data, the analysis can be run on assemblies generated from other short-read platforms. In such scenario, S should be adjusted accordingly. If the strandness of the data is unknown, it is recommended to use either unstranded mode or automatic detection. Note that in unstranded mode, read pairs are expected to map to opposite strands. This is consistent with the behavior of virtually every short-read sequencing technology.
Read mapping, transcript quantification, and read assignment options
-R: Random seed for read mapping, transcript quantification, and read assignment, default: 12345
Random seed is defined to ensure reproducible CATS-rf runs.
-N: Maximum number of distinct mappings per read, default: 10
The value of N should be increased for complex transcriptome assemblies that contain a large number of isoforms, and decreased for simpler assemblies with fewer isoforms to maximize performance and accuracy. Note that Bowtie2 mapping parameters are optimized to detect transcript errors, while minimizing the number of false-positive mappings. Furthermore, secondary mappings of each read are filtered based on edit distance.
-m: Estimated mean of fragment length needed for transcript quantification (single-end mode only)
-s: Estimated standard deviation of fragment length needed for transcript quantification (single-end mode only)
Fragment length distribution parameters m and n are required in single-end mode for transcript quantification by kallisto.
Coverage analysis options
-i: Per-base coverage distribution breakpoints (specified with x,y,z…), default: “0,5,10,20,40,60,80,100”
Per-base coverage is split into intervals defined by i (e.g. [0-5>, [5-10>…). This category variable is used for plotting by the CATS_rf_compare script.
All category variable breaks (i, p, r, u, I, P, U, y, and F) should be supplied as strings separated with commas and enclosed in quotes (e.g. “0,0.2,0.4,0.6,0.8,0.85,0.9,0.95,1”).
-p: Per-transcript proportion of covered bases distribution breakpoints (specified with x,y,z…), default: “0,0.2,0.4,0.6,0.8,0.85,0.9,0.95,1”
Per-transcript proportion of covered bases is split into intervals defined by p (e.g. [0-0.2>, [0.2-0.4>…). This category variable is used for plotting by the CATS_rf_compare script.
-r: Mean transcript coverage distribution breakpoints (specified with x,y,z…), default: “0,5,10,20,40,60,80,100”
Mean transcript coverage is split into intervals defined by r (e.g. [0-5>, [5-10>…). This category variable is used for plotting by the CATS_rf_compare script.
-l: Proportion of transcript length for positional relative coverage distribution analysis, default: 0.01
Transcripts are split into fractional segments of size l for positional relative coverage distribution analysis. Coverage is expressed relative to the base with the highest coverage within the same transcript. Relative coverage for each segment is calculated as mean relative coverage within the segment. Positional analysis output contains assembly-level median relative coverage for each transcript segment.
-n: Proportion of transcript length for transcript end definition when calculating mean transcript end coverage, default: 0.02
-k: Rolling window size for local coverage calculation (in bp) when defining low-coverage regions (LCR), default: 10
-z: Local coverage threshold for LCR characterization, default: 3
LCRs are defined as rolling windows of size k with mean coverage lower than or equal to z.
-u: Per-transcript proportion of LCR bases distribution breakpoints (specified with x,y,z…), default: “0,0.2,0.4,0.6,0.8,0.85,0.9,0.95,1”
Per-transcript proportion of LCR bases is split into intervals defined by u (e.g. [0-0.2>, [0.2-0.4>…). This category variable is used for plotting by the CATS_rf_compare script.
-w: Base coverage weight, default: 1.5
-e: LCR extension penalty, default: 0.5
Coverage penalties assigned to LCRs are controlled by w and e. Lower values of w and higher values of e increase the relative impact of LCR length on coverage penalty.
Accuracy analysis options
-I: Per-base accuracy distribution breakpoints (specified with x,y,z…), default: “0,0.2,0.4,0.6,0.8,0.85,0.9,0.95,0.99,1”
Accuracy is defined as the proportion of aligned read bases matching the transcript base. Per-base accuracy is split into intervals defined by I (e.g. [0-0.2>, [0.2-0.4>…). This category variable is used for plotting by the CATS_rf_compare script.
-A: Minimum accuracy for a base to be considered accurate, default: 0.95
-P: Per-transcript proportion of accurate bases distribution breakpoints (specified with x,y,z…), default: “0,0.2,0.4,0.6,0.8,0.85,0.9,0.95,0.99,1”
Per-transcript proportion of accurate bases (bases with accuracy higher or equal to A) is split into intervals defined by P (e.g. [0-0.2>, [0.2-0.4>…). This category variable is used for plotting by the CATS_rf_compare script.
-L: Proportion of transcript length for positional accuracy distribution analysis, default: 0.01
Transcripts are split into fractional segments of size L for positional accuracy distribution analysis. Accuracy for each segment is calculated as mean accuracy within the segment. Positional analysis output contains assembly-level median accuracy for each transcript segment.
-K: Rolling window size for local accuracy calculation (in bp) when defining low-accuracy regions (LAR), default: 10
-Z: Local accuracy threshold for LAR characterization, default: 0.98
LARs are defined as rolling windows of size K with mean accuracy lower than or equal to Z.
-U: Per-transcript proportion of LAR bases distribution breakpoints (specified with x,y,z…), default: “0,0.2,0.4,0.6,0.8,0.85,0.9,0.95,0.99,1”
Per-transcript proportion of LAR bases is split into intervals defined by U (e.g. [0-0.2>, [0.2-0.4>…). This category variable is used for plotting by the CATS_rf_compare script.
-E: LAR extension penalty, default: 0.1
Accuracy penalties assigned to LARs are controlled with E. Higher values of E increase the relative impact of LAR length on accuracy penalty.
Paired-end read analysis options
These options should only be supplied in paired-end mode.
-d: Maximum distance from transcript ends for reads with unmapped pair to be considered evidence of transcript end incompleteness or fragmentation (in bp), default: 40
Reads with unmapped pair mapping to transcript ends are considered evidence for transcript end incompleteness or fragmentation. Relative size of transcript end regions when identifying such reads is controlled by d.
-c: Correction factor for distance outlier threshold calculation, default: 5
Read pair distance penalty calculation is controlled by x, X, and c. Read pairs are classified as mapping too far apart if their distance exceeds the lower distance threshold, defined as D1 = Q3(d) + x * (IQR(d) + c). These reads are assigned a distance penalty Pd = d / D2, where D2 = Q3(d) + X * (IQR(d) + c), with the penalty capped at 1. Higher values of x increase the threshold for classifying read pairs as too distant, while X controls the scaling of the distance penalty. Higher values of c increase penalty robustness in libraries with a high proportion of overlapping read pairs.
-y: Per-transcript proportion of improperly paired reads within a transcript distribution breakpoints (specified with x,y,z…), default: “0,0.2,0.4,0.6,0.8,0.85,0.9,0.95,1”
Improperly paired reads include reads with pair not mapped to the assembly, reads with pair mapped in an unexpected orientation, and reads with pair mapped too far apart. Per-transcript proportion of improperly paired reads within a transcript is split into intervals defined by y (e.g. [0-0.2>, [0.2-0.4>…). This category variable is used for plotting by the CATS_rf_compare script.
-f: Minimum number of bridging events for transcripts to be considered fragmented, default: 3
A transcript is considered fragmented if more than f reads representing bridging events map to transcript end regions.
-F: Per-transcript proportion of reads with pair mapped to another transcript distribution breakpoints (specified with x,y,z…), default: “0,0.2,0.4,0.6,0.8,0.85,0.9,0.95,1”
Per-transcript proportion of reads with pair mapped to another transcript is split into intervals defined by F (e.g. [0-0.2>, [0.2-0.4>…). This category variable is used for plotting by the CATS_rf_compare script.
-a: Alpha compression factor for sigmoid transformation applied to bridge index during integrity score component calculation, default: 7
-b: Beta compression factor for sigmoid transformation applied to bridge index during integrity score component calculation, default: 0.5
Bridge index measures the proportion of reads with pair mapped to a different transcript and considers the mapping distance of such reads from the ends of their respective transcript. This definition gives more weight to bridging events near transcript ends. Integrity score component is calculated using a sigmoid transformation of bridge index. Compression factors a and b control the shape of the transformation: higher values of a increase sensitivity to fragmentation, while higher values of b reduce the likelihood of false-positive fragmentation penalties in transcripts with minimal bridging evidence.
General options
-t: Number of CPU threads, default: 10
Several steps of CATS-rf pipeline are parallelized. This includes read mapping, transcript quantification, read assignment, SAM/BAM file processing, positional coverage and accuracy calculation and analysis, as well as positional paired-end analysis. Recommended number of threads: 10-20.
-G: Percentage of available RAM used by GNU sort, default: 50
CATS-rf utilizes GNU sort in several steps of the pipeline. Higher values of G will ensure faster sorting, but may exhaust available RAM. In such scenarios, CATS-rf will resort to sorting with minimal RAM usage.
-M: Memory block size for GNU Parallel, default: 512M
Block size used by GNU Parallel when splitting the mapping table for read assignment is controlled by M. If sufficient RAM is available, increasing the value of M is recommended to minimize artifacts introduced by file splitting.
-T: Number of splits performed on positional and read pair mapping tables, default: 3
Positional and read pair mapping tables are split before analysis to reduce RAM usage. Increase the value of T when working with limited memory to further reduce RAM demands.
-O: Overwrite the CATS-rf output directory, default: off
-h: Show usage information
Output explanation
Summary table
Summary files represent the main output of CATS-rf. In paired-end mode, four summary files are produced:
assembly_score_summary.tsv contains descriptive statistics of transcript score components and the overall assembly score. The content of this file is printed when CATS-rf finishes running in paired-end mode.
coverage_and_accuracy_analysis_summary.tsv contains summarized results of coverage and accuracy analysis. The content of this file is printed when CATS-rf finishes running in single-end mode.
paired_end_read_analysis_summary.tsv contains summarized results of paired-end read analysis, including local fidelity and integrity analysis.
CATS-rf also produces several .tsv files containing detailed per-transcript metrics:
Transcript score components
transcript_scores.tsv contains CATS-rf score components and transcript score for each transcript.
Coverage analysis
coverage_stats.tsv contains coverage analysis results for each transcript:
Column
Description
transcript
Transcript name
covered_base_N
Number of covered bases
covered_base_prop
Proportion of covered bases
covered_base_prop_category
Proportion of covered bases category
coverage_mean
Mean transcript coverage
coverage_mean_category
Mean transcript coverage category
uncov_region_length_max
Maximum uncovered region length
transcript_end_coverage_mean
Mean transcript end coverage
lcr_base_N
Number of bases in low-coverage regions
lcr_base_prop
Proportion of bases in low-coverage regions
lcr_base_prop_category
Proportion of bases in low-coverage regions category
coverage_score_component
Coverage score component
per_base_coverage_distribution.tsv contains distribution of assembly-level per-base coverage.
relative_coverage_median_by_transcript_position.tsv contains median values of mean relative coverage per transcript fraction.
lcr_list.tsv contains low-coverage region coordinates.
Accuracy analysis
accuracy_stats.tsv contains accuracy analysis results for each transcript:
Column
Description
transcript
Transcript name
acc_base_N
Number of accurate bases
acc_base_prop
Proportion of accurate bases
acc_base_prop_category
Proportion of accurate bases category
lar_base_N
Number of bases in low-accuracy regions
lar_base_prop
Proportion of bases in low-accuracy regions
lar_base_prop_category
Proportion of bases in low-accuracy regions category
accuracy_score_component
Accuracy score component
per_base_accuracy_distribution.tsv contains distribution of assembly-level per-base accuracy.
accuracy_median_by_transcript_position.tsv contains median values of mean accuracy per transcript fraction.
lar_list.tsv contains low-accuracy region coordinates.
Local fidelity analysis
local_fidelity_stats.tsv contains local fidelity analysis results for each transcript:
Column
Description
transcript
Transcript name
unmapped_pair_read_N
Number of reads with pair not mapped to the assembly
unmapped_pair_read_prop
Proportion of reads with pair not mapped to the assembly
unmapped_pair_tr_end_read_N
Number of reads with pair not mapped to the assembly on transcript ends
unmapped_pair_tr_end_read_prop
Proportion of reads with pair not mapped to the assembly on transcript ends
improp_pair_orientation_read_N
Number of reads with pair mapped in an unexpected orientation
improp_pair_orientation_read_prop
Proportion of reads with pair mapped in an unexpected orientation
improp_pair_distance_read_N
Number of reads with pair mapped too far apart
transcript_distance_penalty
Transcript distance penalty
improp_pair_within_tr_read_N
Number of improperly paired reads within a transcript
improp_pair_within_tr_read_prop
Proportion of improperly paired reads within a transcript
improp_pair_within_tr_read_prop_category
Proportion of improperly paired reads within a transcript category
local_fidelity_score_component
Local fidelity score component
read_pairs_mapping_in_unexpected_orientation.tsv contains coordinates of read pairs mapping in an unexpected orientation.
read_pairs_mapping_too_far_apart.tsv contains coordinates of read pairs mapping too far apart.
Integrity analysis
integrity_stats.tsv contains integrity analysis results for each transcript:
Column
Description
transcript
Transcript name
pair_mapped_to_other_tr_N
Number of reads with pair mapped to another transcript
pair_mapped_to_other_tr_prop
Proportion of reads with pair mapped to another transcript
pair_mapped_to_other_tr_prop_category
Proportion of reads with pair mapped to another transcript category
bridge_N
Number of reads representing bridging events
bridge_prop
Proportion of reads representing bridging events on transcript ends
integrity_score_component
Integrity score component
read_pairs_mapping_to_different_transcripts.tsv contains coordinates of read pairs mapping to different transcripts.
Assembly comparison with CATS_rf_compare
CATS-rf also supports direct comparison of multiple analysed assemblies. The CATS_rf_compare script generates summary tables and visualizations that compare the most significant CATS-rf results of each assembly. As such, CATS-rf should be run on each individual assembly and the resulting CATS-rf output directories should then act as input to CATS_rf_compare.
CATS_rf_compare dependencies
CATS_rf_compare requires the following dependencies:
R (Rscript) executable must be included in PATH. Tools denoted with (R) correspond to R packages and can be installed via conda or directly in R with the supplied commands.
CATS_rf_compare example usage
CATS_rf_compare requires one or more CATS-rf output directories as input.
While CATS_rf_compare is primarily designed to compare multiple transcriptome assemblies, it can also be used with a single assembly to visualize its CATS-rf results.
Example CATS_rf_compare usage:
CATS_rf_compare [OPTIONS] CATS_RF_DIR ...
Detailed CATS_rf_compare options
CATS_rf_compare offers a comprehensive list of options which allow users to control the graphical and general comparison parameters.
Graphical options
-x: Figure extension, default: png
-d: Figure DPI, default: 600
Extension (device) and DPI of each plotted figure are controlled with x and d, respectively.
-r: Raincloud plot colors (quoted hexadecimal codes or R color names, specified with x,y,z…), default: adjusted Set1 palette from RColorBrewer package
Raincloud plot densities are normalized for each transcriptome assembly. Boxplots within raincloud plots mark the distribution median, Q1, and Q3, with whiskers extending from Q1 - 1.5 * IQR to Q3 + 1.5 * IQR of the distribution.
All color sets (r, l, H, and b) should be supplied as R color names or hexadecimal codes separated with commas and enclosed in quotes (e.g. “#FDAF4A,#DC151D”). R color cheatsheet is available here.
-l: Lineplot colors (quoted hexadecimal codes or R color names, specified with x,y,z…), default: adjusted Set1 palette from RColorBrewer package
-H: Histogram colors (quoted hexadecimal codes or R color names, specified with x,y,z…), default: adjusted Set1 palette from RColorBrewer package
-b: Barplot colors (quoted hexadecimal codes or R color names, specified with x,y,z…), default: adjusted YlOrRd palette from RColorBrewer package
-q: Maximum right-tail distribution quantile for histograms, default: 0.98”
Histograms show relative density per transcriptome assembly and omit right-tail extreme values for visualization purposes. The x-axis in all histograms is square-root scaled.
General options
-t: Number of CPU threads, default: 10
Several steps of CATS_rf_compare are parallelized. This mainly includes operations performed by the data.table package. Recommended number of threads: 8-12.
-O: Overwrite the comparison output directory, default: off
-h: Show usage information
CATS_rf_compare output explanation
The analysis is summarized in the CATS_rf_comparison.html HTML file.
An example of the HTML output is provided here.
Note on transcriptome assembly order and names: Assemblies will appear in the order they were provided on the command line when running the tool. For visualization purposes, assembly names are limited to a maximum of 20 characters; names exceeding this limit will be truncated. If multiple assemblies share the same name, a numeric suffix (e.g., .1, .2, etc.) will be appended to distinguish these assemblies.
Summary tables
CATS_rf_compare aggregates individual summary tables into comprehensive joint tables encompassing all analyzed transcriptome assemblies:
CATS_rf_general_statistics.tsv contains aggregated CATS-rf general statistics table.
CATS_rf_compare produces several figures, providing a detailed visualization of CATS-rf quality metrics.
transcript_score visualizes the distribution of transcript scores.
base_coverage and base_accuracy visualize the distribution of per-base coverage/accuracy.
proportion_of_covered_bases visualizes the distribution of the proportion of covered bases per transcript.
mean_transcript_coverage visualizes the distribution of mean transcript coverage.
positional_relative_coverage_median and positional_accuracy_median visualize the positional relative coverage/accuracy distribution.
maximum_uncovered_region_length visualizes the distribution of maximum uncovered region length per transcript.
mean_transcript_end_coverage visualizes the distribution of mean transcript end coverage.
proportion_of_bases_in_lcrs and proportion_of_bases_in_lars visualize the distribution of the proportion of bases in LCRs/LARs per transcript.
lcr_length and lar_length visualize the distribution of LCR/LAR length.
coverage_score_component visualizes the distribution of coverage score component per transcript.
proportion_of_accurate_bases visualizes the distribution of the proportion of accurate bases per transcript.
accuracy_score_component visualizes the distribution of accuracy score component per transcript.
proportion_of_improperly_paired_reads visualizes the per-transcript distribution of the proportion of improperly paired reads within a transcript.
local_fidelity_score_component visualizes the distribution of local fidelity score component per transcript.
prop_reads_with_pair_mapped_to_another_tr visualizes the per-transcript distribution of the proportion of reads with pair mapped to another transcript.
integrity_score_component visualizes the distribution of integrity score component per transcript.
Citation
CATS is an academic software distributed under the MIT license.
Bodulić, K. and Vlahoviček, K. (2025). Comprehensive Transcriptome Quality Assessment Using CATS: Reference‑free and Reference‑based Approaches. bioRxiv. https://doi.org/10.1101/2025.07.22.666112
Troubleshooting
Please report all potential bugs in the Issues tracker.
Singularity
If you run into Singularity errors involving the default TMPDIR environment variable, set TMPDIR to your current working directory so the container uses a writable location for temporary files:
TMPDR=$(pwd)
Conda installation
In case of dependency conflicts, try setting the following channel priority:
CATS-rf
Multipage documentation: https://bodulic.github.io/CATS-rf/
Table of Contents
CATS_rf_compareIntroduction
CATS-rf is the reference-free module of the CATS (Comprehensive Assessment of Transcript Sequences) framework. It evaluates the quality of transcriptomes assembled de novo from short reads, relying solely on RNA-seq reads used in the assembly construction. The pipeline maps reads back to the assembled transcripts and examines mapping evidence suggesting misassembly. Quality evaluation is performed at the transcript level, integrating four score components each targeting specific assembly errors:
Transcript quality score St is calculated as the product of the described score components, equally weighting detected assembly errors. Assembly score S is computed as the mean of individual transcript scores. All components and scores are normalized to a range between 0 and 1, where higher values indicate better quality.
In addition to transcript scores, CATS-rf provides a comprehensive set of assembly metrics, including transcript length and composition statistics, read mapping rates, positional coverage and accuracy profiles, and pair mapping consistency metrics.
CATS-rf consistently displays stronger performance than currently existing reference-free transcriptome assembly evaluation tools. For detailed benchmarks and methodology, please refer to the CATS preprint
Installation
Running CATS-rf via Docker
Starting from version 1.0.2, CATS-rf can be run using Docker:
Build the Docker image with:
Run the container with:
Running CATS-rf via Singularity
Build the Singularity image with:
Run the container with:
Installing CATS-rf via conda
CATS-rf and its dependencies can be directly installed via Bioconda:
In case of dependency conflicts, please see the Troubleshooting section.
Installing CATS-rf from source
CATS-rf consists of Bash and R scripts located in the
scriptsdirectory of this repository. After cloning the repository, all CATS-rf scripts must be included in thePATHenvironment variable. The following dependencies are required:conda install conda-forge::r-baseconda install conda-forge::r-data.tableconda install -c bioconda bowtie2conda install -c bioconda samtoolsconda install -c bioconda kallistoconda install conda-forge::parallelconda install -c bioconda bedtoolsconda install -c bioconda bedtoolsconda install -c bioconda pysamstatsR, Bowtie2, Samtools, kallisto, GNU Parallel, bedtools (bamToBed), and pysamstats executables must be included in
PATH. R package data.table can be installed via conda or directly in R withinstall.packages("data.table")MacOS
If you are using MacOS, Bash (version >= 4.0) and GNU versions of core utilities are required. In this case,
PATHshould be adjusted so that CATS-rf uses GNU versions of core utilities:PATH(adjust path depending on your architecture):For Apple Silicon:
For Intel-based configurations:
The stated changes can be made permanent by modifying the appropriate .rc file.
Test data
CATS-rf installation can be tested using instructions and files located in
test_datadirectory.Using typical hardware and default settings, the total runtime of CATS-rf on the provided test data is approximately 1 minute.
For real-world datasets, runtime scales with sequencing depth. On average-depth human transcriptomes (~20 million reads), CATS-rf typically completes in approximately 1 hour.
Example usage
CATS-rf requires a transcriptome assembly in FASTA format, along with short RNA-seq reads used during assembly in either FASTQ or FASTA format. Compressed (.gz) read files are supported.
CATS-rf supports both paired-end and single-end library configurations.
Example paired-end mode usage:
Example single-end mode usage:
Single-end mode requires three options to be specified:
Cfor library configuration,mfor mean fragment size, andsfor standard deviation of fragment size. Note that single-end runs will output only general assembly statistics, read mapping metrics, and positional coverage and accuracy analysis.Detailed options
CATS-rf offers a comprehensive list of options which allow users to control the analysis parameters.
Library type options
-C: Paired- vs. single-end library configuration: pe = paired-end, se = single-end, default: pe-S: Library strandness, fr = forward-reverse, rf = reverse-forward, u = unstranded, a = automatic detection, default: uCATS-rf can leverage strandness information when quantifying transcripts and calculating the local fidelity score component. When the automatic detection option is enabled, strandness is estimated using the first 100 000 read mappings.
While CATS-rf was primarily tested on Illumina data, the analysis can be run on assemblies generated from other short-read platforms. In such scenario,
Sshould be adjusted accordingly. If the strandness of the data is unknown, it is recommended to use either unstranded mode or automatic detection. Note that in unstranded mode, read pairs are expected to map to opposite strands. This is consistent with the behavior of virtually every short-read sequencing technology.-Q: Phred quality encoding of FASTQ files, 33 = phred33, 64 = phred64, default: 33Read mapping, transcript quantification, and read assignment options
-R: Random seed for read mapping, transcript quantification, and read assignment, default: 12345Random seed is defined to ensure reproducible CATS-rf runs.
-N: Maximum number of distinct mappings per read, default: 10The value of
Nshould be increased for complex transcriptome assemblies that contain a large number of isoforms, and decreased for simpler assemblies with fewer isoforms to maximize performance and accuracy. Note that Bowtie2 mapping parameters are optimized to detect transcript errors, while minimizing the number of false-positive mappings. Furthermore, secondary mappings of each read are filtered based on edit distance.-m: Estimated mean of fragment length needed for transcript quantification (single-end mode only)-s: Estimated standard deviation of fragment length needed for transcript quantification (single-end mode only)Fragment length distribution parameters
mandnare required in single-end mode for transcript quantification by kallisto.Coverage analysis options
-i: Per-base coverage distribution breakpoints (specified with x,y,z…), default: “0,5,10,20,40,60,80,100”Per-base coverage is split into intervals defined by
i(e.g. [0-5>, [5-10>…). This category variable is used for plotting by theCATS_rf_comparescript.All category variable breaks (
i,p,r,u,I,P,U,y, andF) should be supplied as strings separated with commas and enclosed in quotes (e.g. “0,0.2,0.4,0.6,0.8,0.85,0.9,0.95,1”).-p: Per-transcript proportion of covered bases distribution breakpoints (specified with x,y,z…), default: “0,0.2,0.4,0.6,0.8,0.85,0.9,0.95,1”Per-transcript proportion of covered bases is split into intervals defined by
p(e.g. [0-0.2>, [0.2-0.4>…). This category variable is used for plotting by theCATS_rf_comparescript.-r: Mean transcript coverage distribution breakpoints (specified with x,y,z…), default: “0,5,10,20,40,60,80,100”Mean transcript coverage is split into intervals defined by
r(e.g. [0-5>, [5-10>…). This category variable is used for plotting by theCATS_rf_comparescript.-l: Proportion of transcript length for positional relative coverage distribution analysis, default: 0.01Transcripts are split into fractional segments of size
lfor positional relative coverage distribution analysis. Coverage is expressed relative to the base with the highest coverage within the same transcript. Relative coverage for each segment is calculated as mean relative coverage within the segment. Positional analysis output contains assembly-level median relative coverage for each transcript segment.-n: Proportion of transcript length for transcript end definition when calculating mean transcript end coverage, default: 0.02-k: Rolling window size for local coverage calculation (in bp) when defining low-coverage regions (LCR), default: 10-z: Local coverage threshold for LCR characterization, default: 3LCRs are defined as rolling windows of size
kwith mean coverage lower than or equal toz.-u: Per-transcript proportion of LCR bases distribution breakpoints (specified with x,y,z…), default: “0,0.2,0.4,0.6,0.8,0.85,0.9,0.95,1”Per-transcript proportion of LCR bases is split into intervals defined by
u(e.g. [0-0.2>, [0.2-0.4>…). This category variable is used for plotting by theCATS_rf_comparescript.-w: Base coverage weight, default: 1.5-e: LCR extension penalty, default: 0.5Coverage penalties assigned to LCRs are controlled by
wande. Lower values ofwand higher values ofeincrease the relative impact of LCR length on coverage penalty.Accuracy analysis options
-I: Per-base accuracy distribution breakpoints (specified with x,y,z…), default: “0,0.2,0.4,0.6,0.8,0.85,0.9,0.95,0.99,1”Accuracy is defined as the proportion of aligned read bases matching the transcript base. Per-base accuracy is split into intervals defined by
I(e.g. [0-0.2>, [0.2-0.4>…). This category variable is used for plotting by theCATS_rf_comparescript.-A: Minimum accuracy for a base to be considered accurate, default: 0.95-P: Per-transcript proportion of accurate bases distribution breakpoints (specified with x,y,z…), default: “0,0.2,0.4,0.6,0.8,0.85,0.9,0.95,0.99,1”Per-transcript proportion of accurate bases (bases with accuracy higher or equal to
A) is split into intervals defined byP(e.g. [0-0.2>, [0.2-0.4>…). This category variable is used for plotting by theCATS_rf_comparescript.-L: Proportion of transcript length for positional accuracy distribution analysis, default: 0.01Transcripts are split into fractional segments of size
Lfor positional accuracy distribution analysis. Accuracy for each segment is calculated as mean accuracy within the segment. Positional analysis output contains assembly-level median accuracy for each transcript segment.-K: Rolling window size for local accuracy calculation (in bp) when defining low-accuracy regions (LAR), default: 10-Z: Local accuracy threshold for LAR characterization, default: 0.98LARs are defined as rolling windows of size
Kwith mean accuracy lower than or equal toZ.-U: Per-transcript proportion of LAR bases distribution breakpoints (specified with x,y,z…), default: “0,0.2,0.4,0.6,0.8,0.85,0.9,0.95,0.99,1”Per-transcript proportion of LAR bases is split into intervals defined by
U(e.g. [0-0.2>, [0.2-0.4>…). This category variable is used for plotting by theCATS_rf_comparescript.-E: LAR extension penalty, default: 0.1Accuracy penalties assigned to LARs are controlled with
E. Higher values ofEincrease the relative impact of LAR length on accuracy penalty.Paired-end read analysis options
These options should only be supplied in paired-end mode.
-d: Maximum distance from transcript ends for reads with unmapped pair to be considered evidence of transcript end incompleteness or fragmentation (in bp), default: 40Reads with unmapped pair mapping to transcript ends are considered evidence for transcript end incompleteness or fragmentation. Relative size of transcript end regions when identifying such reads is controlled by
d.-x: Multiplicative factor for lower distance outlier threshold calculation, default: 8-X: Multiplicative factor for higher distance outlier threshold calculation, default: 10-c: Correction factor for distance outlier threshold calculation, default: 5Read pair distance penalty calculation is controlled by
x,X, andc. Read pairs are classified as mapping too far apart if their distance exceeds the lower distance threshold, defined as D1 = Q3(d) + x * (IQR(d) + c). These reads are assigned a distance penalty Pd = d / D2, where D2 = Q3(d) + X * (IQR(d) + c), with the penalty capped at 1. Higher values ofxincrease the threshold for classifying read pairs as too distant, whileXcontrols the scaling of the distance penalty. Higher values ofcincrease penalty robustness in libraries with a high proportion of overlapping read pairs.-y: Per-transcript proportion of improperly paired reads within a transcript distribution breakpoints (specified with x,y,z…), default: “0,0.2,0.4,0.6,0.8,0.85,0.9,0.95,1”Improperly paired reads include reads with pair not mapped to the assembly, reads with pair mapped in an unexpected orientation, and reads with pair mapped too far apart. Per-transcript proportion of improperly paired reads within a transcript is split into intervals defined by
y(e.g. [0-0.2>, [0.2-0.4>…). This category variable is used for plotting by theCATS_rf_comparescript.-f: Minimum number of bridging events for transcripts to be considered fragmented, default: 3A transcript is considered fragmented if more than
freads representing bridging events map to transcript end regions.-F: Per-transcript proportion of reads with pair mapped to another transcript distribution breakpoints (specified with x,y,z…), default: “0,0.2,0.4,0.6,0.8,0.85,0.9,0.95,1”Per-transcript proportion of reads with pair mapped to another transcript is split into intervals defined by
F(e.g. [0-0.2>, [0.2-0.4>…). This category variable is used for plotting by theCATS_rf_comparescript.-a: Alpha compression factor for sigmoid transformation applied to bridge index during integrity score component calculation, default: 7-b: Beta compression factor for sigmoid transformation applied to bridge index during integrity score component calculation, default: 0.5Bridge index measures the proportion of reads with pair mapped to a different transcript and considers the mapping distance of such reads from the ends of their respective transcript. This definition gives more weight to bridging events near transcript ends. Integrity score component is calculated using a sigmoid transformation of bridge index. Compression factors
aandbcontrol the shape of the transformation: higher values ofaincrease sensitivity to fragmentation, while higher values ofbreduce the likelihood of false-positive fragmentation penalties in transcripts with minimal bridging evidence.General options
-t: Number of CPU threads, default: 10Several steps of CATS-rf pipeline are parallelized. This includes read mapping, transcript quantification, read assignment, SAM/BAM file processing, positional coverage and accuracy calculation and analysis, as well as positional paired-end analysis. Recommended number of threads: 10-20.
-G: Percentage of available RAM used by GNU sort, default: 50CATS-rf utilizes GNU sort in several steps of the pipeline. Higher values of
Gwill ensure faster sorting, but may exhaust available RAM. In such scenarios, CATS-rf will resort to sorting with minimal RAM usage.-M: Memory block size for GNU Parallel, default: 512MBlock size used by GNU Parallel when splitting the mapping table for read assignment is controlled by
M. If sufficient RAM is available, increasing the value ofMis recommended to minimize artifacts introduced by file splitting.-T: Number of splits performed on positional and read pair mapping tables, default: 3Positional and read pair mapping tables are split before analysis to reduce RAM usage. Increase the value of
Twhen working with limited memory to further reduce RAM demands.-D: CATS-rf output directory name, default: TRANSCRIPTOME_CATS_rf_dir-o: CATS-rf output file prefix, default: TRANSCRIPTOME-O: Overwrite the CATS-rf output directory, default: off-h: Show usage informationOutput explanation
Summary table
Summary files represent the main output of CATS-rf. In paired-end mode, four summary files are produced:
assembly_score_summary.tsvcontains descriptive statistics of transcript score components and the overall assembly score. The content of this file is printed when CATS-rf finishes running in paired-end mode.general_statistics_table.tsvcontains descriptive statistics of transcript length (mean, median, interquartile range, range, N50, L50, N90, L90), GC content, and read mapping rate.coverage_and_accuracy_analysis_summary.tsvcontains summarized results of coverage and accuracy analysis. The content of this file is printed when CATS-rf finishes running in single-end mode.paired_end_read_analysis_summary.tsvcontains summarized results of paired-end read analysis, including local fidelity and integrity analysis.CATS-rf also produces several .tsv files containing detailed per-transcript metrics:
Transcript score components
transcript_scores.tsvcontains CATS-rf score components and transcript score for each transcript.Coverage analysis
coverage_stats.tsvcontains coverage analysis results for each transcript:transcriptcovered_base_Ncovered_base_propcovered_base_prop_categorycoverage_meancoverage_mean_categoryuncov_region_length_maxtranscript_end_coverage_meanlcr_base_Nlcr_base_proplcr_base_prop_categorycoverage_score_componentper_base_coverage_distribution.tsvcontains distribution of assembly-level per-base coverage.relative_coverage_median_by_transcript_position.tsvcontains median values of mean relative coverage per transcript fraction.lcr_list.tsvcontains low-coverage region coordinates.Accuracy analysis
accuracy_stats.tsvcontains accuracy analysis results for each transcript:transcriptacc_base_Nacc_base_propacc_base_prop_categorylar_base_Nlar_base_proplar_base_prop_categoryaccuracy_score_componentper_base_accuracy_distribution.tsvcontains distribution of assembly-level per-base accuracy.accuracy_median_by_transcript_position.tsvcontains median values of mean accuracy per transcript fraction.lar_list.tsvcontains low-accuracy region coordinates.Local fidelity analysis
local_fidelity_stats.tsvcontains local fidelity analysis results for each transcript:transcriptunmapped_pair_read_Nunmapped_pair_read_propunmapped_pair_tr_end_read_Nunmapped_pair_tr_end_read_propimprop_pair_orientation_read_Nimprop_pair_orientation_read_propimprop_pair_distance_read_Ntranscript_distance_penaltyimprop_pair_within_tr_read_Nimprop_pair_within_tr_read_propimprop_pair_within_tr_read_prop_categorylocal_fidelity_score_componentread_pairs_mapping_in_unexpected_orientation.tsvcontains coordinates of read pairs mapping in an unexpected orientation.read_pairs_mapping_too_far_apart.tsvcontains coordinates of read pairs mapping too far apart.Integrity analysis
integrity_stats.tsvcontains integrity analysis results for each transcript:transcriptpair_mapped_to_other_tr_Npair_mapped_to_other_tr_proppair_mapped_to_other_tr_prop_categorybridge_Nbridge_propintegrity_score_componentread_pairs_mapping_to_different_transcripts.tsvcontains coordinates of read pairs mapping to different transcripts.Assembly comparison with
CATS_rf_compareCATS-rf also supports direct comparison of multiple analysed assemblies. The
CATS_rf_comparescript generates summary tables and visualizations that compare the most significant CATS-rf results of each assembly. As such, CATS-rf should be run on each individual assembly and the resulting CATS-rf output directories should then act as input toCATS_rf_compare.CATS_rf_comparedependenciesCATS_rf_comparerequires the following dependencies:conda install conda-forge::r-baseconda install conda-forge::pandocconda install conda-forge::r-rmarkdowninstall.packages("rnarkdown)conda install conda-forge::r-data.tableinstall.packages("data.table")conda install conda-forge::r-ggplot2install.packages("ggplot2")conda install conda-forge::r-ggdistinstall.packages("ggdist")R (Rscript) executable must be included in
PATH. Tools denoted with (R) correspond to R packages and can be installed via conda or directly in R with the supplied commands.CATS_rf_compareexample usageCATS_rf_comparerequires one or more CATS-rf output directories as input.While
CATS_rf_compareis primarily designed to compare multiple transcriptome assemblies, it can also be used with a single assembly to visualize its CATS-rf results.Example
CATS_rf_compareusage:Detailed
CATS_rf_compareoptionsCATS_rf_compareoffers a comprehensive list of options which allow users to control the graphical and general comparison parameters.Graphical options
-x: Figure extension, default: png-d: Figure DPI, default: 600Extension (device) and DPI of each plotted figure are controlled with
xandd, respectively.-r: Raincloud plot colors (quoted hexadecimal codes or R color names, specified with x,y,z…), default: adjusted Set1 palette from RColorBrewer packageRaincloud plot densities are normalized for each transcriptome assembly. Boxplots within raincloud plots mark the distribution median, Q1, and Q3, with whiskers extending from Q1 - 1.5 * IQR to Q3 + 1.5 * IQR of the distribution.
All color sets (
r,l,H, andb) should be supplied as R color names or hexadecimal codes separated with commas and enclosed in quotes (e.g. “#FDAF4A,#DC151D”). R color cheatsheet is available here.-l: Lineplot colors (quoted hexadecimal codes or R color names, specified with x,y,z…), default: adjusted Set1 palette from RColorBrewer package-H: Histogram colors (quoted hexadecimal codes or R color names, specified with x,y,z…), default: adjusted Set1 palette from RColorBrewer package-b: Barplot colors (quoted hexadecimal codes or R color names, specified with x,y,z…), default: adjusted YlOrRd palette from RColorBrewer package-q: Maximum right-tail distribution quantile for histograms, default: 0.98”Histograms show relative density per transcriptome assembly and omit right-tail extreme values for visualization purposes. The x-axis in all histograms is square-root scaled.
General options
-t: Number of CPU threads, default: 10Several steps of
CATS_rf_compareare parallelized. This mainly includes operations performed by the data.table package. Recommended number of threads: 8-12.-D: Comparison output directory name, default: CATS_rf_comparison-O: Overwrite the comparison output directory, default: off-h: Show usage informationCATS_rf_compareoutput explanationThe analysis is summarized in the
CATS_rf_comparison.htmlHTML file. An example of the HTML output is provided here.Note on transcriptome assembly order and names: Assemblies will appear in the order they were provided on the command line when running the tool. For visualization purposes, assembly names are limited to a maximum of 20 characters; names exceeding this limit will be truncated. If multiple assemblies share the same name, a numeric suffix (e.g., .1, .2, etc.) will be appended to distinguish these assemblies.
Summary tables
CATS_rf_compareaggregates individual summary tables into comprehensive joint tables encompassing all analyzed transcriptome assemblies:CATS_rf_general_statistics.tsvcontains aggregated CATS-rf general statistics table.CATS_rf_assembly_scores.tsvcontains aggregated CATS-rf score component statistics and overall assembly score table.CATS_rf_coverage_accuracy_statistics.tsvcontains aggregated CATS-rf coverage and accuracy analysis table.CATS_rf_local_fidelity_integrity_statistics.tsvcontains aggregated CATS-rf paired-end read analysis table.Figures
CATS_rf_compareproduces several figures, providing a detailed visualization of CATS-rf quality metrics.transcript_scorevisualizes the distribution of transcript scores.base_coverageandbase_accuracyvisualize the distribution of per-base coverage/accuracy.proportion_of_covered_basesvisualizes the distribution of the proportion of covered bases per transcript.mean_transcript_coveragevisualizes the distribution of mean transcript coverage.positional_relative_coverage_medianandpositional_accuracy_medianvisualize the positional relative coverage/accuracy distribution.maximum_uncovered_region_lengthvisualizes the distribution of maximum uncovered region length per transcript.mean_transcript_end_coveragevisualizes the distribution of mean transcript end coverage.proportion_of_bases_in_lcrsandproportion_of_bases_in_larsvisualize the distribution of the proportion of bases in LCRs/LARs per transcript.lcr_lengthandlar_lengthvisualize the distribution of LCR/LAR length.coverage_score_componentvisualizes the distribution of coverage score component per transcript.proportion_of_accurate_basesvisualizes the distribution of the proportion of accurate bases per transcript.accuracy_score_componentvisualizes the distribution of accuracy score component per transcript.proportion_of_improperly_paired_readsvisualizes the per-transcript distribution of the proportion of improperly paired reads within a transcript.local_fidelity_score_componentvisualizes the distribution of local fidelity score component per transcript.prop_reads_with_pair_mapped_to_another_trvisualizes the per-transcript distribution of the proportion of reads with pair mapped to another transcript.integrity_score_componentvisualizes the distribution of integrity score component per transcript.Citation
CATS is an academic software distributed under the MIT license.
Copyright © 2025 Kristian Bodulić
if you use CATS, please cite the CATS preprint:
Bodulić, K. and Vlahoviček, K. (2025). Comprehensive Transcriptome Quality Assessment Using CATS: Reference‑free and Reference‑based Approaches. bioRxiv. https://doi.org/10.1101/2025.07.22.666112
Troubleshooting
Please report all potential bugs in the Issues tracker.
Singularity
If you run into Singularity errors involving the default
TMPDIRenvironment variable, setTMPDIRto your current working directory so the container uses a writable location for temporary files:Conda installation
In case of dependency conflicts, try setting the following channel priority:
If this does not work, try setting channel priority to flexible:
Changelog
Version 1.0.4: Fixed N50/N90 calculation, September 24, 2025.
Version 1.0.3: Fixed automatic library strandness detection, September 5, 2025.
Version 1.0.2: Added gawk to dependency list. Code polishing, August 22, 2025.
Version 1.0.1: Moved executable and R package testing after the getopts call, July 24, 2025.
Version 1.0.0: Initial commit, July 10, 2025.