NSCCN/oncofuse：用于癌症基因组学中融合基因的预测和功能注释

Oncofuse is a framework designed to estimate the oncogenic potential of de-novo discovered gene fusions. It uses several hallmark features and employs a bayesian classifier to provide the probability of a given gene fusion being a driver mutation.

Please cite the following paper if you are using Oncofuse:

Mikhail Shugay, Inigo Ortiz de Mendibil, Jose L. Vizmanos and Francisco J. Novo. Oncofuse: a computational framework for the prediction of the oncogenic potential of gene fusions. Bioinformatics. 16 Aug 2013. doi:10.1093/bioinformatics/btt445.

This repository contains the Oncofuse source code and the latest binary version of the framework in releases section. Please see this page for details on running the pipeline. You may be also interested in checking README.txt, examples folder and running java -jar Oncofuse.jar -h.

Notes

Disclaimer: Oncofuse solely performs fusions annotation and oncogenic potential prediction under the assumption that a given fusion exists (i.e. is verifiable by PCR). It is the goal of experimental setup and fusion detection software to filter out those fusions that do not physically exist.
- Oncofuse reports, but does not account for fusion frame when calculating P-values. Is is done intentionally, as the fusion information often could be incomplete, e.g. there are cases when random nucleotides are added in fusion junction restoring the frame. So it is left up to the user to decide should he ignore out-of-frame fusions or not.
Note that since v1.0.9 there is an -a hgXX option, specifying genome assembly. It defaults to hg19, yet certain tools, for example FusionCatcher v0.99.3e provide output in hg38 coordinates.
Also note that support for FusionCatcher versions earlier than 0.99.3 was deprecated
Please use the issue tracker to report bugs and suggest new features.

Compiling Oncofuse

To obtain and compile the latest version of legacy Oncofuse package execute:

git clone https://github.com/mikessh/oncofuse --branch legacy
cd oncofuse
mvm clean install

Then run as

cd target/
java -jar oncofuse-v1.X.X.jar [args]

If you need to copy oncofuse to other folder make sure libs and common folders are placed in the same directory, or just use symlink

Documentation

Oncofuse is described in the following paper: Mikhail Shugay, Inigo Ortiz de Mendibil, Jose L. Vizmanos and Francisco J. Novo. Oncofuse: a computational framework for the prediction of the oncogenic potential of gene fusions. Bioinformatics. 16 Aug 2013. doi:10.1093/bioinformatics/btt445.

See http://www.unav.es/genetica/oncofuse.html for additional details.

Options

-p option specifies the number of threads Oncofuse will use

-a option specifies genome assembly version. Allowed values: hg18, hg19 and hg38. Default value: hg19

Input

This tool is designed to predict the oncogenic potential of fusion genes found by Next-Generation Sequencing in cancer cells. It also provides information on hallmarks of driver gene fusions, such as expression gain of resulting fusion gene, retained protein interaction interfaces and resulting protein domain functional profile.

Pre-requisites: Java(TM) SE Runtime Environment (build 1.7.0 and higher)

Running:

$ java -Xmx1G -jar Oncofuse.jar input_file input_type tissue_type output_file

Supported tissue types (tissue of origin for gene fusion): EPI (epithelial), HEM (hematopoietic), MES (mesenchymal), AVG (averaged, when tissue of origin is unknown)

Supported input types:

input_type = “coord” Default format accepted by Oncofuse Tab-delimited file with lines containing 5’ and 3’ breakpoint positions (first nucleotide lost upon fusion) and tissue of origin:

5’ chrom	5’ coord	3’ chrom	3’ coord	tissue_type

     |          |          |          |

For this format tissue of origin is set individually for each entry in input file and tissue_type argument should be set as “-“ Note that there are optional additional columns:

5’ fusion partner gene (FPG) strand
3’ FPG strand
Number of spanning reads (reads that include junction bases)
Number of encompassing reads (reads that encompass junction, but the junction itself is in the insert region)

input_type = “tophat” Default output file of Tophat-fusion and Tophat2 (usually fusions.out file in results folder). Data is pre-filtered based on number of spanning N>=1 and total number of supporting reads M>=2 reads. These parameters could be changed with extended input type argument “tophat-N-M”. Tissue type has to be set using tissue_type argument. Tophat-fusion-post is also supported with extended input type argument “tophat-post”.

input_type = “fcatcher” Default output file of FusionCatcher software. Tissue type has to be set using tissue_type argument.

input_type = “rnastar” Default output file of RNASTAR software. Data is pre-filtered based on number of spanning N>=1 and total number of supporting reads M>=2 reads. These parameters could be changed with extended input type argument “rnastar-N-M”. Tissue type has to be set using tissue_type argument.

Output

A tab-delimited table with the following columns

column name	description
SAMPLE_ID	The ID of sample for tophat-post, input file name otherwise
FUSION_ID	The original line number in input file
TISSUE	As specified by library argument or in ‘coord’ input file
GENOMIC	Chromosomal coordinates for both breakpoints (as in input file)
SPANNING_READS	Number of reads that cover fusion junction
ENCOMPASSING_READS	Number of reads that map discordantly with one mate mapping to 5’FPG (fusion partner gene) and other mapping to 3’FPG
5_FPG_GENE_NAME	The HGNC symbol of 5’ fusion partner gene
5_IN_CDS?	Indicates whether breakpoint is within the CDS of this gene
5_SEGMENT_TYPE	Indicates whether breakpoint is located within either exon or intron
5_SEGMENT_ID	Indicates number of exon or intron where breakpoint is located
5_COORD_IN_SEGMENT	Indicates coordinates for breakpoint within that exon/intron
5_FULL_AA	Length of translated 5’ fusion partner gene (FPG) in full amino acids
5_FRAME	Frame of translated 5’ FPG
(Same as 7 lines above for the 3’ fusion partner gene)	…
FPG_FRAME_DIFFERENCE	The resulting frame of fusion gene, if equals to 0 then the fusion is in-frame
P_VAL_CORR	he Bayesian probability of fusion being a passenger (class 0), given as Bonferroni-corrected P-value
DRIVER_PROB	The Bayesian probability of fusion being a driver (class 1)
EXPRESSION_GAIN	Expression gain of fusion calculated as [(expression of 5’ gene)/(expression of 3’ gene)]-1
5_DOMAINS_RETAINED	List of protein domains retained in 5’ fusion partner gene
3_DOMAINS_RETAINED	List of protein domains retained in 3’ fusion partner gene
5_DOMAINS_BROKEN	List of protein domains that overlap breakpoint in 5’ fusion partner gene
3_DOMAINS_BROKEN	List of protein domains that overlap breakpoint in 3’ fusion partner gene
5_PII_RETAINED	List of protein interaction interfaces retained in 5’ fusion partner gene
3_PII_RETAINED	List of protein interaction interfaces retained in 3’ fusion partner gene
CTF, G, H, K, P and TF	Corresponding functional family association scores (FFAS, see paper for details). Values are log-transformed and scaled to the largest score obtained from classifier training set.