SomaticSeq is an ensemble somatic SNV/indel caller that has the ability to use
machine learning to filter out false positives from other callers. It also comes
with a suite of genomic utilities. The
detailed documentation is located in
docs/Manual.pdf.
Use high-confidence call set as the “ground truth” to investigate how
different sample preparations, sequencing library kits, and bioinformatic
algorithms affect the accuracy of the somatic mutation pipelines, and develop
best practices, e.g.,
Xiao W. et al. Nat Biotechnol 2021.
Python 3, plus pysam, numpy, scipy, pandas, and xgboost libraries.
BEDTools: required when parallel
processing is invoked, and/or when any bed files are used as input files.
Optional: dbSNP VCF file (if you want to use dbSNP membership as a feature).
Optional: R and ada are required for
AdaBoost, whereas XGBoost (default) is implemented in python.
To install SomaticSeq, clone this repo, cd somaticseq, and then run
pip install . (To install extra packages for development:
pip install '.[dev]'). A number of commands prefixed with somaticseq_ will
be placed into the PATH.
To install using pip
Make sure to install bedtools separately.
pip install somaticseq
To install the bioconda version
SomaticSeq can also be found on
,
which has
so far. To
,
which also automatically installs a bunch of 3rd-party somatic mutation callers:
If installed successfully, you will be able to run somaticseq --help in the
terminal. Also make sure bedtools is executable. There are some toy data sets
and test scripts in example that should finish in <1 minute
if installed properly.
Run SomaticSeq with an example command
At minimum, given the results of the individual mutation caller(s), SomaticSeq
will extract sequencing features for the combined call set. Required inputs
for command somaticseq are:
--output-directory and --genome-reference, then
Either paired or single to invoke paired or single sample mode,
if paired: --tumor-bam-file, and --normal-bam-file are both
required.
if single: --bam-file is required.
Everything else is optional (though without a single VCF file from at least
one caller, SomaticSeq does nothing).
The following four files will be created into the output directory:
Consensus.sSNV.vcf, Consensus.sINDEL.vcf, Ensemble.sSNV.tsv, and
Ensemble.sINDEL.tsv.
If you’re searching for pipelines to run those individual somatic mutation
callers, feel free to take advantage of our
Dockerized Somatic Mutation Workflow
as a start.
Important note: multi-argument options (e.g., --extra-hyperparameters or
--features-excluded) cannot be placed immediately before paired or
single, because those options would try to “grab” paired or single
as an additional argument.
For all of those input VCF files, both .vcf and .vcf.gz are acceptable.
SomaticSeq also accepts .cram, but some callers may only take .bam.
--arbitrary-snvs and --arbitrary-indels are added since v3.7.0. It allows
users to input any arbitrary VCF file(s) from caller(s) that we did not
explicitly incorporate. SNVs and indels have to be separated.
If your caller puts SNVs and indels in the same output VCF file, you may
split it using a SomaticSeq utility script, e.g.,
somaticseq_split_vcf -infile small_variants.vcf -snv snvs.vcf -indel indels.vcf.
As usual, input can be either .vcf or .vcf.gz, but output will be
.vcf.
For those VCF file(s), any calls not labeled REJECT or LowQual will be
considered a bona fide somatic mutation call. REJECT calls will be
skipped. LowQual calls will be considered, but will not have a value of
1 in if_Caller machine learning feature.
--inclusion-region or --exclusion-region will require bedtools in your
path.
--algorithm defaults to xgboost as v3.6.0, but can also be ada (AdaBoost
in R). XGBoost supports multi-threading and can be orders of magnitude faster
than AdaBoost, and seems to be about the same in terms of accuracy, so we
changed the default from ada to xgboost as v3.6.0 and that’s what we
recommend now.
To split the job into multiple threads, place --threads X before the
paired option to indicate X threads. It simply creates multiple BED file
(each consisting of 1/X of total base pairs) for SomaticSeq to run on each of
those sub-BED files in parallel. It then merges the results. This requires
bedtools in your path.
Additional parameters to be specified beforepaired option to invoke
training mode. In addition to the four files specified above, two classifiers
(SNV and indel) will be created..
--somaticseq-train: FLAG to invoke training mode with no argument, which
also requires ground truth VCF files.
--extra-hyperparameters: add hyperparameters for xgboost, e.g.,
--extra-hyperparameters scale_pos_weight:0.1 grow_policy:lossguide max_leaves:12.
--truth-snv: if you have a ground truth VCF file for SNV
--truth-indel: if you have a ground truth VCF file for INDEL
Additional input files to be specified beforepaired option invoke
prediction mode (to use classifiers to score variants). Four additional files
will be created, i.e., SSeq.Classified.sSNV.vcf, SSeq.Classified.sSNV.tsv,
SSeq.Classified.sINDEL.vcf, and SSeq.Classified.sINDEL.tsv.
--classifier-snv: classifier previously built for SNV
--classifier-indel: classifier previously built for INDEL
Without those paramters above to invoking training or prediction mode,
SomaticSeq will default to majority-vote consensus mode.
To train for SomaticSeq classifiers with multiple data sets combined
Run somaticseq_xgboost train --help to see the options. It is recommended that
SNV and INDEL models be trained separately, but it is up to you to experiment,
e.g.,
Most SomaticSeq modules can be run on their own. They may be useful in debugging
context, or be run for your own purposes. See this page for your
options.
Dockerized workflows and pipelines
To run somatic mutation callers and then SomaticSeq
We have created a module (i.e., somaticseq_make_somatic_scripts) that can run
all the dockerized somatic mutation callers and then SomaticSeq, described at
somaticseq/utilities/dockered_pipelines.
There is also an alignment workflow described there. You need
docker to run these workflows. Singularity is also
supported, but is not optimized. Let me know if you find bugs.
To create training data to create SomaticSeq classifiers
Before well characterized real data was available, we have dockerized
pipelines for in silico mutation spike in at
somaticseq/utilities/dockered_pipelines/bamSimulator.
These pipelines are based on
BAMSurgeon. We have used it to
create training set to build SomaticSeq classifiers, though it has not been
updated for a while.
Combine both BAMSurgeon in silico spike in and the real SEQC2 training data
may give you better model than using either, which was shown in
Sahraeian S.M.E. et al. 2022.
The reason may be that the real data’s high-confidence call sets do not have
the most challenging genomic regions, whereas in silico data do not have the
most realistic data characteristics. Combining both allows them to cover each
other’s shortcomings.
Dockerized alignment pipeline based on GATK’s best practices
We have some generally useful scripts in utilities. Some
of the more useful tools, e.g.,
somaticseq_loci_counter finds overlapping regions among multiple bed files.
somaticseq_run_workflows is a rudimentary workflow manager that executes
multiple scripts at once.
somaticseq_split_bed_into_equal_regions splits one bed file into a number of
output bed files, where each output bed file will have the same total length.
somaticseq_linguistic_sequence_complexity calculates sequence complexity
given a nucleotide sequence (e.g., GCCAGAC) based on
Troyanskaya OG et al. Bioinformatics 2002.
SomaticSeq
SomaticSeq is an ensemble somatic SNV/indel caller that has the ability to use machine learning to filter out false positives from other callers. It also comes with a suite of genomic utilities. The detailed documentation is located in docs/Manual.pdf.
Training data for benchmarking and/or model building
In 2021, the FDA-led MAQC-IV/SEQC2 Consortium has produced multi-center multi-platform whole-genome and whole-exome sequencing data sets for a pair of tumor-normal reference samples (HCC1395 and HCC1395BL), along with the high-confidence somatic mutation call set. This work was published in Fang, L.T., Zhu, B., Zhao, Y. et al. Establishing community reference samples, data and call sets for benchmarking cancer mutation detection using whole-genome sequencing. Nat Biotechnol 39, 1151-1160 (2021) / PMID:34504347 / Free Read-Only Link. The following are some of the use cases for these resources:
Click for more details of the SEQC2’s somatic mutation project.
Recommendation of how to use SEQC2 data to create SomaticSeq classifiers.
Installation
Dependencies
This dockerfile reveals the dependencies
cd somaticseq, and then runpip install .(To install extra packages for development:pip install '.[dev]'). A number of commands prefixed withsomaticseq_will be placed into the PATH.To install using pip
Make sure to install
bedtoolsseparately.To install the bioconda version
SomaticSeq can also be found on
,
which has
so far. To
,
which also automatically installs a bunch of 3rd-party somatic mutation callers:
To install from github source with conda
Test your installation
If installed successfully, you will be able to run
somaticseq --helpin the terminal. Also make surebedtoolsis executable. There are some toy data sets and test scripts in example that should finish in <1 minute if installed properly.Run SomaticSeq with an example command
At minimum, given the results of the individual mutation caller(s), SomaticSeq will extract sequencing features for the combined call set. Required inputs for command
somaticseqare:--output-directoryand--genome-reference, thenpairedorsingleto invoke paired or single sample mode,paired:--tumor-bam-file, and--normal-bam-fileare both required.single:--bam-fileis required.Everything else is optional (though without a single VCF file from at least one caller, SomaticSeq does nothing).
The following four files will be created into the output directory:
Consensus.sSNV.vcf,Consensus.sINDEL.vcf,Ensemble.sSNV.tsv, andEnsemble.sINDEL.tsv.If you’re searching for pipelines to run those individual somatic mutation callers, feel free to take advantage of our Dockerized Somatic Mutation Workflow as a start.
--extra-hyperparametersor--features-excluded) cannot be placed immediately beforepairedorsingle, because those options would try to “grab”pairedorsingleas an additional argument.For all of those input VCF files, both
.vcfand.vcf.gzare acceptable. SomaticSeq also accepts.cram, but some callers may only take.bam.--arbitrary-snvsand--arbitrary-indelsare added since v3.7.0. It allows users to input any arbitrary VCF file(s) from caller(s) that we did not explicitly incorporate. SNVs and indels have to be separated.somaticseq_split_vcf -infile small_variants.vcf -snv snvs.vcf -indel indels.vcf. As usual, input can be either.vcfor.vcf.gz, but output will be.vcf.1inif_Callermachine learning feature.--inclusion-regionor--exclusion-regionwill requirebedtoolsin your path.--algorithmdefaults toxgboostas v3.6.0, but can also beada(AdaBoost in R). XGBoost supports multi-threading and can be orders of magnitude faster than AdaBoost, and seems to be about the same in terms of accuracy, so we changed the default fromadatoxgboostas v3.6.0 and that’s what we recommend now.To split the job into multiple threads, place
--threads Xbefore thepairedoption to indicate X threads. It simply creates multiple BED file (each consisting of 1/X of total base pairs) for SomaticSeq to run on each of those sub-BED files in parallel. It then merges the results. This requiresbedtoolsin your path.Additional parameters to be specified before
pairedoption to invoke training mode. In addition to the four files specified above, two classifiers (SNV and indel) will be created..--somaticseq-train: FLAG to invoke training mode with no argument, which also requires ground truth VCF files.--extra-hyperparameters: add hyperparameters for xgboost, e.g.,--extra-hyperparameters scale_pos_weight:0.1 grow_policy:lossguide max_leaves:12.--truth-snv: if you have a ground truth VCF file for SNV--truth-indel: if you have a ground truth VCF file for INDELAdditional input files to be specified before
pairedoption invoke prediction mode (to use classifiers to score variants). Four additional files will be created, i.e.,SSeq.Classified.sSNV.vcf,SSeq.Classified.sSNV.tsv,SSeq.Classified.sINDEL.vcf, andSSeq.Classified.sINDEL.tsv.--classifier-snv: classifier previously built for SNV--classifier-indel: classifier previously built for INDELWithout those paramters above to invoking training or prediction mode, SomaticSeq will default to majority-vote consensus mode.
To train for SomaticSeq classifiers with multiple data sets combined
Run
somaticseq_xgboost train --helpto see the options. It is recommended that SNV and INDEL models be trained separately, but it is up to you to experiment, e.g.,Run SomaticSeq modules seperately
Most SomaticSeq modules can be run on their own. They may be useful in debugging context, or be run for your own purposes. See this page for your options.
Dockerized workflows and pipelines
To run somatic mutation callers and then SomaticSeq
We have created a module (i.e.,
somaticseq_make_somatic_scripts) that can run all the dockerized somatic mutation callers and then SomaticSeq, described at somaticseq/utilities/dockered_pipelines. There is also an alignment workflow described there. You need docker to run these workflows. Singularity is also supported, but is not optimized. Let me know if you find bugs.To create training data to create SomaticSeq classifiers
I recommend SEQC2 Somatic Mutation Working Group‘s reference sequencing data and high-confidence somatic mutation call sets.
Before well characterized real data was available, we have dockerized pipelines for in silico mutation spike in at somaticseq/utilities/dockered_pipelines/bamSimulator. These pipelines are based on BAMSurgeon. We have used it to create training set to build SomaticSeq classifiers, though it has not been updated for a while.
Combine both BAMSurgeon in silico spike in and the real SEQC2 training data may give you better model than using either, which was shown in Sahraeian S.M.E. et al. 2022. The reason may be that the real data’s high-confidence call sets do not have the most challenging genomic regions, whereas in silico data do not have the most realistic data characteristics. Combining both allows them to cover each other’s shortcomings.
Dockerized alignment pipeline based on GATK’s best practices
Described at somaticseq/utilities/dockered_pipelines. The module is
somaticseq_make_alignment_scripts.Utilities
We have some generally useful scripts in utilities. Some of the more useful tools, e.g.,
somaticseq_loci_counterfinds overlapping regions among multiple bed files.somaticseq_run_workflowsis a rudimentary workflow manager that executes multiple scripts at once.somaticseq_split_bed_into_equal_regionssplits one bed file into a number of output bed files, where each output bed file will have the same total length.somaticseq_linguistic_sequence_complexitycalculates sequence complexity given a nucleotide sequence (e.g., GCCAGAC) based on Troyanskaya OG et al. Bioinformatics 2002.