CHORD is a random forest model that uses the relative counts of somatic
mutation contexts to predict homologous recombination deficiency (HRD).
The primary contexts used by CHORD are deletions with flanking
microhomology and 1-100kb structural duplications. Additionally, 1-100kb
structural duplications are used to distinguish BRCA1-type HRD from
BRCA2-type HRD.
CHORD is itself an R package. Before using CHORD, some other R packages
will also need to be installed, with the main one being mutSigExtractor,
which is required for extracting the mutation contexts that CHORD uses.
The below code can be used to install CHORD and mutSigExtractor, as well
as the dependencies for both packages.
## Bioconductor packages required by mutSigExtractor
install.packages('BiocManager')
BiocManager::install('BSgenome') ## Install genome parser
BiocManager::install('BSgenome.Hsapiens.UCSC.hg19') ## Install the default genome
## randomForest is required by CHORD
install.packages('randomForest')
## Install CHORD and mutSigExtractor directly from github using devtools
install.packages("devtools")
devtools::install_github('https://github.com/UMCUGenetics/mutSigExtractor/')
devtools::install_github('https://github.com/UMCUGenetics/CHORD/')
Quick start
Vcfs as input
CHORD can be supplied with vcf files containing somatic (no
germline) SNVs, indels, and SVs per sample. The first step is to
extract the mutation contexts required by CHORD.
contexts <- extractSigsChord(
vcf.snv = '/path/to/vcf/with/snvs/',
vcf.indel = '/path/to/vcf/with/indels/',
vcf.sv = '/path/to/vcf/with/svs/',
sv.caller = 'gridss' ## Can be 'gridss' (default) or 'manta'
)
Note that there is no real standard for SV vcfs. However, parsing vcfs
produced by Manta or GRIDSS has been built into CHORD.
Dataframes as input
Alternatively, dataframes can be used as input (handy if you want to
parse the vcfs yourself and supply this preprocessed data to CHORD).
For SNVs and indels, a dataframe with the columns: chrom, pos, ref, alt.
## CHROM POS REF ALT
## 1 1 16145827 A C
## 2 1 16492085 G C
## 3 1 17890303 C G
## 4 1 18877885 G A
## 5 1 18919776 T C
For SVs, a dataframe with the columns: sv_type, sv_len. For sv_type,
values must be DEL, DUP, INV, TRA (deletions, duplications, inversions,
translocations). For translocations, sv_len information is discarded.
## sv_type sv_len
## 1 TRA NA
## 2 DEL 1696
## 3 DEL 22644
## 4 DUP 1703
## 5 DEL 1789
## 6 DEL 49256
The column names themselves do not matter, as long as the columns are in
the aforementioned order.
Inputs are flexible
The vcf.* and df.* arguments can be mixed. This is especially useful
in the case of SV vcfs, since it is possible that you will need to parse
these yourself (in this case the sv.caller argument can be left out).
Make sure to only use PASS variants! If your vcfs contain unfiltered
variants, you can provide the desired FILTER values to keep to
vcf.filters.
contexts <- extractSigsChord(
vcf.snv = '/path/to/vcf/with/snvs/',
vcf.indel = '/path/to/vcf/with/indels/',
df.sv = dataframe_with_svtype_and_svlen,
## This is optional (for when your vcfs are not filtered for PASS variants)
vcf.filters=list(snv="PASS",indel="PASS",sv="PASS")
)
Often, SNVs and indels are reported in the same vcfs. In such cases, the
vcf path can be specified to the vcf.snv argument (vcf.indel can be
left out). Alternatively, SNVs and indels can be provided as a
dataframe, which can be specified to df.snv (with df.indel left
out).
A different reference genome than the default
(BSgenome.Hsapiens.UCSC.hg19) can be used. Genomes should be BSgenomes.
The variable name (i.e. no quotes) of the BSgenome object is
specified to ref.genome.
## Make sure to install and load the desired ref genome first
install.packages('BiocManager')
BiocManager::install('BSgenome.Hsapiens.UCSC.hg38')
## Non-default genomes need to be explicitly loaded. The default (BSgenome.Hsapiens.UCSC.hg19)
## is automatically loaded.
library(BSgenome.Hsapiens.UCSC.hg38)
## Specify the name of the BSgenome object to the ref.genome argument
contexts <- extractSigsChord(
vcf.snv = '/path/to/vcf/with/snvs_and_indels/',
df.sv = dataframe_with_svtype_and_svlen
ref.genome=BSgenome.Hsapiens.UCSC.hg38
)
Predicting HRD
Once we have the mutation contexts, a prediction can be made.
chordPredict(contexts)
Tutorial
This tutorial will demonstrate how HRD prediction can be performed
locally. For large vcfs and/or many samples, it is advised to run CHORD
on a high-performance cluster (HPC).
Somatic vcfs from a few primary tumors from the 560 breast cancer cohort
(BRCA-EU, ICGC) will be used as example input. These are located at
example/vcf/ in the CHORD git repository. These vcfs have been
generated from tsv files which are publically available at
https://dcc.icgc.org/repositories.
In part 1, we will use the vcfs directly as input for CHORD.
Part 2 of this tutorial will provide a demonstration of how we can parse
the vcfs into dataframes that CHORD accepts. This is particularly useful
if you have non-standard vcfs, as is often the case with SV vcfs. Of
course, it is possible to produce the required input dataframes from any
data source, and not just vcfs; it only requires a bit of data wrangling
:).
To run the example, clone/download the git repository to your own
computer (e.g. to /Users/lnguyen/Desktop/), and set the working
directory to the example/ directory in the CHORD package.
setwd('/Users/lnguyen/Desktop/CHORD/example/')
1. Running CHORD on vcfs
Loading vcfs into R
We will first create a dataframe to store the vcf paths as well as
assign them sample names.
library(CHORD)
options(stringsAsFactors=F)
## Get file paths from the vcf/ subdirectory
vcf_files <- data.frame(
snv_indel=list.files('vcf', pattern='*snv_indel.vcf.gz', full.names=T),
sv=list.files('vcf', pattern='*sv.vcf.gz', full.names=T)
)
## Assign sample names to files
vcf_files$sample <- sapply(strsplit(basename(vcf_files$snv_indel),'_'),`[`,1)
vcf_files
Vcfs can either be in plain text format (.vcf) or compressed (.gz).
SNVs and indels are often stored in the same vcf (as is the case with
the example vcfs). CHORD therefore accepts such vcfs as input; SNVs and
indels are automatically split up. However, it is also possible to
specify the SNV and indel vcfs separately (by specifying the paths
separately to vcf.snv and vcf.indel to extractSigsChord(); see
below).
Extracting mutation contexts
Create a directory to write all subsequent output. Note: the output
from this tutorial exists in the output/ directory by default. Delete
this directory to create new output files.
dir.create('output/')
Extract the features that are used as input for CHORD. With the example
data, it should take about 30s to run.
## Make dir to output contexts
contexts_dir <- 'output/contexts/'
dir.create(contexts_dir, recursive=T)
## Extract contexts for all samples
for(i in 1:nrow(vcf_files)){
params <- as.list(vcf_files[i,])
out_path <- paste0(contexts_dir,'/',params$sample,'_contexts.txt')
if(!file.exists(out_path)){
extractSigsChord(
vcf.snv=params$snv_indel,
vcf.sv=params$sv, sv.caller='manta',
sample.name=params$sample,
output.path=out_path, verbose=F
)
}
}
The sv.caller parameter has been included in extractSigsChord()
since different SV callers report SVs in different ways.
extractSigsChord() includes parsing for vcfs produced by Manta and
GRIDSS (sv.caller='manta'or sv.caller='gridss'). If you have vcfs
from other SV callers, a solution is to provide the relevant data as a
dataframe (see section 2 of the tutorial).
We can then merge the contexts into a matrix.
## Get the paths of the contexts txt files
context_files <- list.files(contexts_dir, full.names=T)
## Read the txt files into R
l_contexts <- lapply(context_files, function(i){
read.delim(i, check.names=F)
})
## Merged the contexts into a matrix
merged_contexts <- do.call(rbind, l_contexts)
## Write to output directory
write.table(merged_contexts, 'output/merged_contexts.txt', sep='\t', quote=F)
Below we can see what a part of the context matrix looks like.
hr_status tells us if a sample is HR deficient (p_hrd >= 0.5) or
proficient. If a sample is HRD, then hrd_type will tell us if the
sample has BRCA1-type HRD or BRCA2-type HRD (=max(p_BRCA1,p_BRCA2)).
Pre-requisites for accurate HRD prediction
CHORD requires >=50 indels to accurately determine whether a sample
is HRD. If this criterion is not met, hr_status will be
cannot_be_determined and remarks_hr_status will be <50 indels.
CHORD cannot be applied to MSI samples. If an MSI sample is
detected, hr_status will be cannot_be_determined and
remarks_hr_status will be Has MSI (>14000 indel.rep)
If a sample is HRD, CHORD requires >=30 SVs to accurately determine
HRD subtype. If this criterion is not met, hrd_type will be
cannot_be_determined, and remarks_hr_status will be <30 SVs.
The user may of course ignore these remarks and proceed with using the
raw probabilities outputted by CHORD (p_hrd and/or
p_BRCA1/pBRCA2).
To assess the stability of prediction for each sample, bootstrapping is
performed by resampling the feature vector 20 times and calculating HRD
probabilities for each iteration. The probabilities at the 5%, 50% 95%
quantiles are then calculated. In other words, the bootstraped
predictions provide a lower (5% quantile) and upper (95% quantile) error
from the median probability (50% quantile) for each prediction class.
The user may choose to use the median probabilities in place of the
single probabilities (as in chord_output$pred). This can give more
accurate HRD probabilities, especially for samples with low mutational
load.
Note that performing the bootstrapping is computationally expensive.
Bootstrapping can be turned off by specifying do.bootstrap=F to
chordPredict().
2. Running CHORD from dataframes
To run CHORD on non-standard vcfs or from other sources, we can create
dataframes that extractSigsChord() accepts. The output can then be
passed to chordPredict().
In this part of the tutorial, we will create the required dataframes
from the example vcfs. This will thus also serve as a short
demonstration on how one could parse vcfs in R. Furthermore, we will
focus on one sample for simplicity’s sake.
SNV and indel dataframes
For SNVs and indels, a dataframe containing CHROM, POS, REF and ALT
information (i.e. a bed file like format) is required. The column names
of this dataframe can be anything; as long as the columns are in the
aforementioned order.
We can use readVcfFields() from the mutSigExtractor package to read
a vcf into R as a dataframe.
The dataframe shown below is in the correct format as required by CHORD.
This dataframe contains both SNVs and indels. This is not a problem as
we can specify the argumentdf.snv (and leaving out df.indel) in
extractSigsChord() as is done later in the tutorial.
df_snv_indel[25:30,]
## CHROM POS REF ALT
## 25 1 16145827 A C
## 26 1 16333830 GTGTGAAGGTGTC G
## 27 1 16492085 G C
## 28 1 17890303 C G
## 29 1 18877885 G A
## 30 1 18919776 T C
Extracting SV contexts
For SVs, a dataframe is required with SV type (values must be one of the
following: DEL, DUP, INV, TRA) and SV length information. Again, the
column names of this dataframe can be anything; as long as the columns
are in the aforementioned order.
Again, readVcfFields() can be used to read the vcf. The example vcfs
have been produced by Manta, which reports SVTYPE and SVLEN in the INFO
field, so only this field needs to be extracted from the vcf.
Extract the SVTYPE and SVLEN data, and put it in the correct dataframe
format.
sv_type <- sapply(vcf_sv_info,`[`,2) ## Select the 2nd object from each INFO entry
sv_type <- gsub('SVTYPE=','',sv_type) ## Remove the 'SVTYPE=' prefix
sv_len <- sapply(vcf_sv_info,`[`,1) ## Select the 1st object from each INFO entry
sv_len <- gsub('SVLEN=','',sv_len) ## Remove the 'SVLEN=' prefix
sv_len <- as.integer(sv_len) ## Convert the character vector to an integer vector
df_sv <- data.frame(sv_type, sv_len)
head(df_sv)
## sv_type sv_len
## 1 TRA -13737754
## 2 DEL 1696
## 3 DEL 22644
## 4 DUP 1703
## 5 DEL 1789
## 6 DEL 49256
You may notice that translocations (TRA) have an SV length, which
doesn’t really make sense. This is no problem since SV length info is
discarded for translocations.
Predicting HRD
Extract the relevant mutation contexts from the variant data.
CHORD: Classifier of HOmologous Recombination Deficiency
CHORD is a random forest model that uses the relative counts of somatic mutation contexts to predict homologous recombination deficiency (HRD). The primary contexts used by CHORD are deletions with flanking microhomology and 1-100kb structural duplications. Additionally, 1-100kb structural duplications are used to distinguish BRCA1-type HRD from BRCA2-type HRD.
Reference
Pan-cancer landscape of homologous recombination deficiency
Luan Nguyen, John Martens, Arne Van Hoeck, Edwin Cuppen. Nat Commun 11, 5584 (2020). https://www.nature.com/articles/s41467-020-19406-4
Installation
CHORD is itself an R package. Before using CHORD, some other R packages will also need to be installed, with the main one being mutSigExtractor, which is required for extracting the mutation contexts that CHORD uses. The below code can be used to install CHORD and mutSigExtractor, as well as the dependencies for both packages.
Quick start
Vcfs as input
CHORD can be supplied with vcf files containing somatic (no germline) SNVs, indels, and SVs per sample. The first step is to extract the mutation contexts required by CHORD.
Note that there is no real standard for SV vcfs. However, parsing vcfs produced by Manta or GRIDSS has been built into CHORD.
Dataframes as input
Alternatively, dataframes can be used as input (handy if you want to parse the vcfs yourself and supply this preprocessed data to CHORD).
For SNVs and indels, a dataframe with the columns: chrom, pos, ref, alt.
For SVs, a dataframe with the columns: sv_type, sv_len. For sv_type, values must be DEL, DUP, INV, TRA (deletions, duplications, inversions, translocations). For translocations, sv_len information is discarded.
The column names themselves do not matter, as long as the columns are in the aforementioned order.
Inputs are flexible
The vcf.* and df.* arguments can be mixed. This is especially useful in the case of SV vcfs, since it is possible that you will need to parse these yourself (in this case the
sv.callerargument can be left out). Make sure to only use PASS variants! If your vcfs contain unfiltered variants, you can provide the desired FILTER values to keep tovcf.filters.Often, SNVs and indels are reported in the same vcfs. In such cases, the vcf path can be specified to the
vcf.snvargument (vcf.indelcan be left out). Alternatively, SNVs and indels can be provided as a dataframe, which can be specified todf.snv(withdf.indelleft out).A different reference genome than the default (BSgenome.Hsapiens.UCSC.hg19) can be used. Genomes should be BSgenomes. The variable name (i.e. no quotes) of the BSgenome object is specified to
ref.genome.Predicting HRD
Once we have the mutation contexts, a prediction can be made.
Tutorial
This tutorial will demonstrate how HRD prediction can be performed locally. For large vcfs and/or many samples, it is advised to run CHORD on a high-performance cluster (HPC).
Somatic vcfs from a few primary tumors from the 560 breast cancer cohort (BRCA-EU, ICGC) will be used as example input. These are located at
example/vcf/in the CHORD git repository. These vcfs have been generated from tsv files which are publically available at https://dcc.icgc.org/repositories.In part 1, we will use the vcfs directly as input for CHORD.
Part 2 of this tutorial will provide a demonstration of how we can parse the vcfs into dataframes that CHORD accepts. This is particularly useful if you have non-standard vcfs, as is often the case with SV vcfs. Of course, it is possible to produce the required input dataframes from any data source, and not just vcfs; it only requires a bit of data wrangling :).
To run the example, clone/download the git repository to your own computer (e.g. to /Users/lnguyen/Desktop/), and set the working directory to the
example/directory in the CHORD package.1. Running CHORD on vcfs
Loading vcfs into R
We will first create a dataframe to store the vcf paths as well as assign them sample names.
Vcfs can either be in plain text format (.vcf) or compressed (.gz).
SNVs and indels are often stored in the same vcf (as is the case with the example vcfs). CHORD therefore accepts such vcfs as input; SNVs and indels are automatically split up. However, it is also possible to specify the SNV and indel vcfs separately (by specifying the paths separately to
vcf.snvandvcf.indeltoextractSigsChord(); see below).Extracting mutation contexts
Create a directory to write all subsequent output. Note: the output from this tutorial exists in the
output/directory by default. Delete this directory to create new output files.Extract the features that are used as input for CHORD. With the example data, it should take about 30s to run.
The
sv.callerparameter has been included inextractSigsChord()since different SV callers report SVs in different ways.extractSigsChord()includes parsing for vcfs produced by Manta and GRIDSS (sv.caller='manta'orsv.caller='gridss'). If you have vcfs from other SV callers, a solution is to provide the relevant data as a dataframe (see section 2 of the tutorial).We can then merge the contexts into a matrix.
Below we can see what a part of the context matrix looks like.
Predicting HRD and interpreting CHORD’s output
Once we have the context matrix ready, we can use it for predicting HRD.
Main output
To extract the HRD probabilities:
CHORD outputs the probability of:
p_hrdp_BRCA1p_BRCA2hr_statustells us if a sample is HR deficient (p_hrd>= 0.5) or proficient. If a sample is HRD, thenhrd_typewill tell us if the sample has BRCA1-type HRD or BRCA2-type HRD (=max(p_BRCA1,p_BRCA2)).Pre-requisites for accurate HRD prediction
CHORD requires >=50 indels to accurately determine whether a sample is HRD. If this criterion is not met,
hr_statuswill becannot_be_determinedandremarks_hr_statuswill be<50 indels.CHORD cannot be applied to MSI samples. If an MSI sample is detected,
hr_statuswill becannot_be_determinedandremarks_hr_statuswill beHas MSI (>14000 indel.rep)If a sample is HRD, CHORD requires >=30 SVs to accurately determine HRD subtype. If this criterion is not met,
hrd_typewill becannot_be_determined, andremarks_hr_statuswill be<30 SVs.The user may of course ignore these remarks and proceed with using the raw probabilities outputted by CHORD (
p_hrdand/orp_BRCA1/pBRCA2).Stability of predictions
To extract the bootstrap predictions:
To assess the stability of prediction for each sample, bootstrapping is performed by resampling the feature vector 20 times and calculating HRD probabilities for each iteration. The probabilities at the 5%, 50% 95% quantiles are then calculated. In other words, the bootstraped predictions provide a lower (5% quantile) and upper (95% quantile) error from the median probability (50% quantile) for each prediction class.
The user may choose to use the median probabilities in place of the single probabilities (as in
chord_output$pred). This can give more accurate HRD probabilities, especially for samples with low mutational load.Note that performing the bootstrapping is computationally expensive. Bootstrapping can be turned off by specifying
do.bootstrap=FtochordPredict().2. Running CHORD from dataframes
To run CHORD on non-standard vcfs or from other sources, we can create dataframes that
extractSigsChord()accepts. The output can then be passed tochordPredict().In this part of the tutorial, we will create the required dataframes from the example vcfs. This will thus also serve as a short demonstration on how one could parse vcfs in R. Furthermore, we will focus on one sample for simplicity’s sake.
SNV and indel dataframes
For SNVs and indels, a dataframe containing CHROM, POS, REF and ALT information (i.e. a bed file like format) is required. The column names of this dataframe can be anything; as long as the columns are in the aforementioned order.
We can use
readVcfFields()from themutSigExtractorpackage to read a vcf into R as a dataframe.Alternatively, we can specify the vcf columns we want by index.
The dataframe shown below is in the correct format as required by CHORD. This dataframe contains both SNVs and indels. This is not a problem as we can specify the argument
df.snv(and leaving outdf.indel) inextractSigsChord()as is done later in the tutorial.Extracting SV contexts
For SVs, a dataframe is required with SV type (values must be one of the following: DEL, DUP, INV, TRA) and SV length information. Again, the column names of this dataframe can be anything; as long as the columns are in the aforementioned order.
Again,
readVcfFields()can be used to read the vcf. The example vcfs have been produced by Manta, which reports SVTYPE and SVLEN in the INFO field, so only this field needs to be extracted from the vcf.Then, split the data each INFO entry, which are separated by
;.Extract the SVTYPE and SVLEN data, and put it in the correct dataframe format.
You may notice that translocations (TRA) have an SV length, which doesn’t really make sense. This is no problem since SV length info is discarded for translocations.
Predicting HRD
Extract the relevant mutation contexts from the variant data.
Below we can see what a part of the context matrix looks like.
Lastly, make the HRD prediction.
Please refer back to section 1 of the tutorial for interpreting CHORD’s output.