Aletsch implements an efficient algorithm to assemble multiple RNA-seq samples (or multiple cells
for single-cell RNA-seq data).
The datasets and scripts used to compare the performance of Aletsch with other assemblers are available at
aletsch-test.
Installation
Aletsch can be installed through conda
or by compiling source (see INSTALLATION).
Note: The directory <profile> and <gtf> should exist before execution.
Format of Input and Output
Each line of input-bam-list describes a single sample, with 3 fields separated by space.
The 3 fields are: alignment-file (in .bam format), index-alignment-file (in. bai format), and protocol.
The index-file can be generated using samtools (e.g., samtools index ...).
The protocol is chosen from the 5 options: single_end (for illumina single-end RNA-seq protocol),
paired_end (for illumina paired-end RNA-seq protocol),
pacbio_ccs (for PacBio Iso-Seq CCS reads),
pacbio_sub (for PacBio Iso-Seq sub-reads),
ont (for Oxford Nanopore RNA-seq).
Aletsch will use different parameters / algorithms to process different data types.
Aletsch requires that each input alignment file is sorted; otherwise run samtools to sort it (samtools sort input.bam > input.sort.bam).
The assembled transcripts from all these samples will be written to output.gtf, in standard .gtf format.
Options
Aletsch provides several options for transcript assembly, supporting both its unique parameters and those required by the core algorithm of Scallop. For a detailed list, execute aletsch without arguments.
Parameters
Type
Default Value
Description
–help
Displays Aletsch usage information and exits.
–version
Shows Aletsch version information and exits.
–profile
Profiles individual samples and exits. Writes to files if -p is specified.
-l
string
Specifies chromosomes to assemble.
-L
string
Specifies a file containing a list of chromosomes to assemble.
-d
string
Output directory for individual sample transcripts. Directory must exist prior to execution.
-p
string
Directory for reading/saving individual sample profiles. Directory must exist prior to execution.
-t
integer
10
Number of threads.
-c
integer
200
Maximum number of splice graphs in a cluster, recommended as twice the number of samples.
-s
float
0.2
Minimum similarity for combining two splice graphs.
If -l string or -L file option is provided, Aletsch assembles only the specified chromosomes; otherwise, it assembles all chromosomes.
Directories specified by -d and -p must exist before running Aletsch; the tool does not create directories.
With --profile, Aletsch infers profiles of individual samples, using the XS tag from input BAM files.
Scoring Transcripts with Pre-trained Model
Aletsch employs a random forest model for scoring transcripts, available for download from Zenodo. Use the provided Python script score.py with this model.
Directory containing Aletsch’s feature files(x.trstFeature.csv). This is the same directory where Aletsch outputs individual GTF files, as designated by the -d option in Aletsch’s assembly process.
-m
String
Path to the pre-trained model file for scoring.
-c
Integer
Number of samples/cells
-p
String
0.2
Minimum probability score threshold (range: 0 to 1).
-o
String
Output directory of scored .csv file.
Assuming a collection of n samples, the directory <individual_gtf_dir> contains a total of n+1 feature files, enumerated from 0.trstFeature.csv through to n.trstFeature.csv. Files 0.trstFeature.csv to (n-1).trstFeature.csv correspond to feature files for individual samples, sequentially from the first to the last sample. The file n.trstFeature.csv is derived from the combined graph.
Introduction
Aletsch implements an efficient algorithm to assemble multiple RNA-seq samples (or multiple cells for single-cell RNA-seq data). The datasets and scripts used to compare the performance of Aletsch with other assemblers are available at aletsch-test.
Installation
Aletsch can be installed through conda or by compiling source (see INSTALLATION).
Usage
The usage of
aletschis:We highly recommend to generate profiles for individual samples first:
Note: The directory
<profile>and<gtf>should exist before execution.Format of Input and Output
Each line of
input-bam-listdescribes a single sample, with 3 fields separated by space. The 3 fields are:alignment-file(in .bam format),index-alignment-file(in. bai format), andprotocol. Theindex-filecan be generated using samtools (e.g.,samtools index ...). Theprotocolis chosen from the 5 options:single_end(for illumina single-end RNA-seq protocol),paired_end(for illumina paired-end RNA-seq protocol),pacbio_ccs(for PacBio Iso-Seq CCS reads),pacbio_sub(for PacBio Iso-Seq sub-reads),ont(for Oxford Nanopore RNA-seq). Aletsch will use different parameters / algorithms to process different data types.Aletsch requires that each input alignment file is sorted; otherwise run
samtoolsto sort it (samtools sort input.bam > input.sort.bam).The assembled transcripts from all these samples will be written to
output.gtf, in standard .gtf format.Options
Aletsch provides several options for transcript assembly, supporting both its unique parameters and those required by the core algorithm of Scallop. For a detailed list, execute
aletschwithout arguments.-pis specified.-l stringor-L fileoption is provided, Aletsch assembles only the specified chromosomes; otherwise, it assembles all chromosomes.-dand-pmust exist before running Aletsch; the tool does not create directories.--profile, Aletsch infers profiles of individual samples, using theXStag from input BAM files.Scoring Transcripts with Pre-trained Model
Aletsch employs a random forest model for scoring transcripts, available for download from Zenodo. Use the provided Python script
score.pywith this model.Dependencies
Required Python libraries: numPy, pandas, scikit-learn, joblib
Using pip:
Using conda (recommended):
Usage
Score transcripts with the syntax below:
-i-doption in Aletsch’s assembly process.Assuming a collection of n samples, the directory
<individual_gtf_dir>contains a total of n+1 feature files, enumerated from0.trstFeature.csvthrough ton.trstFeature.csv. Files0.trstFeature.csvto(n-1).trstFeature.csvcorrespond to feature files for individual samples, sequentially from the first to the last sample. The filen.trstFeature.csvis derived from the combined graph.