[!NOTE]
This repository contains code that runs inside a VarFish Server.
If you are looking to run your own VarFish Server, look here at bihealth/varfish-server.
This repository contains the worker used by VarFish Server to execute certain background task.
They are written in the Rust programming language to speed up the execution of certain tasks.
At the moment, the following sub commands exist:
db – subcommands to build binary (protobuf) database files
seqvars – subcommands for processing sequence (aka small/SNV/indel) variants
seqvars ingest – convert single VCF file into internal format for use with seqvars query
seqvars query – perform sequence variant filtration and on-the-fly annotation
seqvars prefilter – limit the result of seqvars prefilter by population frequency and/or distance to exon
seqvars aggregate – read through multiple VCF files written by seqvars ingest and computes a carrier counts table.
strucvars – subcommands for processing structural (aka large variants, CNVs, etc.) variants
strucvars ingest – convert one or more structural variant files for use with strucvars query
strucvars aggregate – compile per-case structural variant into an in-house database, to be converted to .bin with strucvars txt-to-bin.
strucvars txt-to-bin – convert text files downloaded by varfish-db-downloader to binary for fast use in strucvars query commands
strucvars query – perform structural variant filtration and on-the-fly annotation
Overall Design
For running queries, the worker tool is installed into the VarFish Server image and are run as executables.
Internally, VarFish server works on VCF files stored in an S3 storage.
For import, the user gives the server access to the VCF files to import.
The server will then use the worker executable to ingest the data into the internal format using {seqvars,strucvars} ingest.
These files are then stored in the internal S3 storage.
For queries, the server will create a query JSON file and then pass this query JSON file together with the internal file to the worker executable.
The worker will create a result file that can be directly imported by the server to be displayed to the user.
Future versions may provide persistently running HTTP/REST servers that provide functionality without startup cost.
The seqvars ingest Command
This command takes as the input a single VCF file from a (supported) variant caller and converts it into a file for further querying.
The command interprets the following fields which are written out by the commonly used variant callers such as GATK UnifiedGenotyper, GATK HaplotypeCaller, and Illumina Dragen.
FORMAT/GT – genotype
the following GT values are written out as 0/0, 0/1, 1/0, 1/1, 0|0, 0|1, 1|0, 1|1, ./., .|., .
no combination of no-call (.) and called allele is written out
FORMAT/GQ – genotype quality
FORMAT/DP – total read coverage
FORMAT/AD – allelic depth, one value per allele (including reference0)
FORMAT/PS – physical phasing information as written out by GATK HaplotypeCaller in GVCF workflow and Dragen variant caller
FORMAT/SQ – “somatic quality” for each alternate allele, as written out by Illumina Dragen variant caller
this field will be written as FORMAT/GQ
The seqvars ingest command will annotate the variants with the following information:
The command will emit one output line for each variant allele from the input and each affected gene.
That is, if two variant alleles affect two genes, four records will be written to the output file.
The annotation will be written out for one highest impact.
Overall, the command will emit the following header rows in addition to the ##contig=<ID=.,length=.> lines.
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=gnomad_exomes_an,Number=1,Type=Integer,Description="Number of alleles in gnomAD exomes">
##INFO=<ID=gnomad_exomes_hom,Number=1,Type=Integer,Description="Number of hom. alt. carriers in gnomAD exomes">
##INFO=<ID=gnomad_exomes_het,Number=1,Type=Integer,Description="Number of het. alt. carriers in gnomAD exomes">
##INFO=<ID=gnomad_exomes_hemi,Number=1,Type=Integer,Description="Number of hemi. alt. carriers in gnomAD exomes">
##INFO=<ID=gnomad_genomes_an,Number=1,Type=Integer,Description="Number of alleles in gnomAD genomes">
##INFO=<ID=gnomad_genomes_hom,Number=1,Type=Integer,Description="Number of hom. alt. carriers in gnomAD genomes">
##INFO=<ID=gnomad_genomes_het,Number=1,Type=Integer,Description="Number of het. alt. carriers in gnomAD genomes">
##INFO=<ID=gnomad_genomes_hemi,Number=1,Type=Integer,Description="Number of hemi. alt. carriers in gnomAD genomes">
##INFO=<ID=helix_an,Number=1,Type=Integer,Description="Number of alleles in HelixMtDb">
##INFO=<ID=helix_hom,Number=1,Type=Integer,Description="Number of hom. alt. carriers in HelixMtDb">
##INFO=<ID=helix_het,Number=1,Type=Integer,Description="Number of het. alt. carriers in HelixMtDb">
##INFO=<ID=ANN,Number=.,Type=String,Description="Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA.length | Distance | ERRORS / WARNINGS / INFO'">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##x-varfish-case-uuid=d2bad2ec-a75d-44b9-bd0a-83a3f1331b7c
##x-varfish-version=<ID=varfish-server-worker,Version=x.y.z>
##x-varfish-version=<ID=orig-caller,Name=Dragen,Version=SW: 07.021.624.3.10.9, HW: 07.021.624>
##x-varfish-version=<ID=orig-caller,Name=GatkHaplotypeCaller,Version=4.4.0.0>
[!NOTE]
The gnomad-mtDNA information is written to the INFO/gnomdad_genome_* fields.
[!NOTE]
Future versions of the worker will annotate the worst effect on a MANE select or MANE Clinical transcript.
The seqvars prefilter Command
This file takes as the input a file created by seqvars ingest and filters the variants by population frequency and/or distance to exon.
You can pass the prefilter criteria as JSON on the command line corresponding to the following Rust structs:
struct PrefilterParams {
/// Path to output file.
pub path_out: String,
/// Maximal allele population frequency.
pub max_freq: f64,
/// Maximal distance to exon.
pub max_dist: i32,
}
You can either specify the parameters on the command line directly or pass a path to a JSONL file starting with @.
You can mix both ways.
$ varfish-server-worker strucvars prefilter \
--path-input INPUT.vcf \
--params '{"path_out": "out.vcf", "max_freq": 0.01, "max_dist": 100}' \
[--params ...] \
# OR
$ varfish-server-worker strucvars prefilter \
--path-input INPUT.vcf \
--params @path/to/params.json \
[--params ...] \
## The `seqvars aggregate` Command
This command reads through multiple files written by `seqvars ingest` and computes a in-house carrier counts table.
You can specify the VCF files individually on the command line or pass in files that have paths to the VCF files line by line.
The resulting table is a folder to a RocksDB database.
```shell session
varfish-server-worker seqvars aggregate \
--genome-build {grch37,grch38} \
--path-out-rocksdb rocksdb/folder \
--path-in-vcf path/to/vcf.gz \
--path-in-vcf @path/to/file/list.txt
The seqvars query Command
This command perform the querying of sequence variants and further annotation using annonars databases.
The strucvars ingest Command
This command takes as the input one or more VCF files from structural variant callers and converts it into a file for further querying.
The command supports the following variant callers and can guess the caller from the VCF header and first record.
Delly2
Dragen-SV (equivalent to Manta)
Dragen-CNV
GATK gCNV
Manta
MELT
PopDel
Sniffles2
One record will be written out for each variant, each with a single alternate allele.
The following symbolic ALT alleles are used:
<DEL>
<DUP>
<INS>
<INV>
VCF break-end syntax, e.g., T[chr1:5[
The following INFO fields are written:
IMPRECISE – flag that specifies that this is an imprecise variant
END – end position of the variants
SVTYPE – type of the variant, one of <DEL>, <DUP>, <INS>, <INV>, BND
SVLEN – absolute length of the SV for linear variants, . for non-linear variants
SVCLAIM – specificaton of D (change in abundance), J (novel junction), or DJ (both change in abundance and novel junction)
callers – (non-standard field), list of callers that called the variant
chr2 – (non-standard field), second chromosome for BND variants
annsv – (non-standard field), annotation of the variant effect on each affected gene
The annsv field is a pipe-character (|) separated list of the following fields:
symbolic alternate alele, e.g., <DEL>
effects on the gene’s transcript, separated by &
transcript_variant – variant affects the whole transcript
exon_variant – variant affects exon
splice_region_variant – variant affects splice region
intron_variant – variant affects only intron
upstream_variant – variant upsream of gene
downstream_variant – variant downstream of gene
intergenic_variant – default for “no gene affected”, but never written
HGNC gene symbol, e.g., BRCA1
HGNC gene ID, e.g., HGNC:1100
The following FORMAT fields are written:
GT – (standard field) genotype, if applicable
GQ – (standard field) genotype quality, if applicable
pec – total coverage with paired-end reads
pev – paired-end reads supporting the variant
src – total coverage with split reads
srv – split reads supporting the variant
amq – average mapping quality over the variant
cn – copy number of the variant in the sample
anc – average normalized coverage over the variant in the sample
pc – point count (windows/targets/probes)
Overall, the command will emit the following header rows in addition to the ##contig=<ID=.,length=.> lines.
##fileformat=VCFv4.4
##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the longest variant described in this record">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=SVLEN,Number=A,Type=Integer,Description="Length of structural variant">
##INFO=<ID=SVCLAIM,Number=A,Type=String,Description="Claim made by the structural variant call. Valid values are D, J, DJ for abundance, adjacency and both respectively">
##INFO=<ID=callers,Number=.,Type=String,Description="Callers that called the variant">
##INFO=<ID=chr2,Number=1,Type=String,Description="Second chromosome, if not equal to CHROM">
##INFO=<ID=annsv,Number=1,Type=String,Description="Effect annotations: 'Allele | Annotation | Gene_Name | Gene_ID'">
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Conditional genotype quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=pec,Number=1,Type=Integer,Description="Total coverage with paired-end reads">
##FORMAT=<ID=pev,Number=1,Type=Integer,Description="Paired-end reads supporting the variant">
##FORMAT=<ID=src,Number=1,Type=Integer,Description="Total coverage with split reads">
##FORMAT=<ID=srv,Number=1,Type=Integer,Description="Split reads supporting the variant">
##FORMAT=<ID=amq,Number=1,Type=Float,Description="Average mapping quality over the variant">
##FORMAT=<ID=cn,Number=1,Type=Integer,Description="Copy number of the variant in the sample">
##FORMAT=<ID=anc,Number=1,Type=Float,Description="Average normalized coverage over the variant in the sample">
##FORMAT=<ID=pc,Number=1,Type=Integer,Description="Point count (windows/targets/probes)">
##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=DUP,Description="Duplication">
##ALT=<ID=INS,Description="Insertion">
##ALT=<ID=CNV,Description="Copy Number Variation">
##ALT=<ID=INV,Description="Inversion">
##fileDate=20230421
##x-varfish-genome-build=GRCh37
##SAMPLE=<ID=index,Sex="Male",Disease="Affected">
##SAMPLE=<ID=father,Sex="Male",Disease="Unaffected">
##SAMPLE=<ID=mother,Sex="Female",Disease="Unaffected">
##PEDIGREE=<ID=index,Father="father",Mother="mother">
##PEDIGREE=<ID=father>
##PEDIGREE=<ID=mother>
##x-varfish-case-uuid=d2bad2ec-a75d-44b9-bd0a-83a3f1331b7c
##x-varfish-version=<ID=varfish-server-worker,Version="x.y.z">
##x-varfish-version=<ID=Delly,Name="Delly",Version="1.1.3">
[!NOTE]
The strucvars ingest step does not perform any annotation.
It only merges the input VCF files from multiple callers (all files must have the same samples) and converts them into the internal format.
The INFO/annsv field is filled by strucvars query.
The strucvars aggregate Command
Import multiple files created by strucvars ingest into a database that can be convered to .bin with strucvars txt-to-bin and then used by strucvars query.
You can specify the files individually.
Paths starting with an at (@) character are interpreted as files with lists of paths.
You can mix paths with @ and without.
[env]
ROCKSDB_LIB_DIR = "/usr/lib/" # in case of the system package manager, adjust the path accordingly for conda
SNAPPY_LIB_DIR = "/usr/lib/" # same as above
to .cargo/config.toml or set the environment variables ROCKSDB_LIB_DIR and SNAPPY_LIB_DIR to the appropriate paths:
By default, the environment variables are defined in the .cargo/config.toml as described above, i.e. may need adjustments if not using the system package manager.
VarFish Server Worker
This repository contains the worker used by VarFish Server to execute certain background task. They are written in the Rust programming language to speed up the execution of certain tasks. At the moment, the following sub commands exist:
db– subcommands to build binary (protobuf) database filesseqvars– subcommands for processing sequence (aka small/SNV/indel) variantsseqvars ingest– convert single VCF file into internal format for use withseqvars queryseqvars query– perform sequence variant filtration and on-the-fly annotationseqvars prefilter– limit the result ofseqvars prefilterby population frequency and/or distance to exonseqvars aggregate– read through multiple VCF files written byseqvars ingestand computes a carrier counts table.strucvars– subcommands for processing structural (aka large variants, CNVs, etc.) variantsstrucvars ingest– convert one or more structural variant files for use withstrucvars querystrucvars aggregate– compile per-case structural variant into an in-house database, to be converted to.binwithstrucvars txt-to-bin.strucvars txt-to-bin– convert text files downloaded by varfish-db-downloader to binary for fast use instrucvars querycommandsstrucvars query– perform structural variant filtration and on-the-fly annotationOverall Design
For running queries, the worker tool is installed into the VarFish Server image and are run as executables. Internally, VarFish server works on VCF files stored in an S3 storage.
For import, the user gives the server access to the VCF files to import. The server will then use the worker executable to ingest the data into the internal format using
{seqvars,strucvars} ingest. These files are then stored in the internal S3 storage.For queries, the server will create a query JSON file and then pass this query JSON file together with the internal file to the worker executable. The worker will create a result file that can be directly imported by the server to be displayed to the user.
Future versions may provide persistently running HTTP/REST servers that provide functionality without startup cost.
The
seqvars ingestCommandThis command takes as the input a single VCF file from a (supported) variant caller and converts it into a file for further querying. The command interprets the following fields which are written out by the commonly used variant callers such as GATK UnifiedGenotyper, GATK HaplotypeCaller, and Illumina Dragen.
FORMAT/GT– genotypeGTvalues are written out as0/0,0/1,1/0,1/1,0|0,0|1,1|0,1|1,./.,.|.,..) and called allele is written outFORMAT/GQ– genotype qualityFORMAT/DP– total read coverageFORMAT/AD– allelic depth, one value per allele (including reference0)FORMAT/PS– physical phasing information as written out by GATK HaplotypeCaller in GVCF workflow and Dragen variant callerFORMAT/SQ– “somatic quality” for each alternate allele, as written out by Illumina Dragen variant callerFORMAT/GQThe
seqvars ingestcommand will annotate the variants with the following information:Gene_Nameis writen as HGNC symbolGene_IDis written as HGNC IDThe command will emit one output line for each variant allele from the input and each affected gene. That is, if two variant alleles affect two genes, four records will be written to the output file. The annotation will be written out for one highest impact.
Overall, the command will emit the following header rows in addition to the
##contig=<ID=.,length=.>lines.The
seqvars prefilterCommandThis file takes as the input a file created by
seqvars ingestand filters the variants by population frequency and/or distance to exon. You can pass the prefilter criteria as JSON on the command line corresponding to the following Rust structs:You can either specify the parameters on the command line directly or pass a path to a JSONL file starting with
@. You can mix both ways.The
seqvars queryCommandThis command perform the querying of sequence variants and further annotation using annonars databases.
The
strucvars ingestCommandThis command takes as the input one or more VCF files from structural variant callers and converts it into a file for further querying. The command supports the following variant callers and can guess the caller from the VCF header and first record.
One record will be written out for each variant, each with a single alternate allele.
The following symbolic
ALTalleles are used:<DEL><DUP><INS><INV>T[chr1:5[The following
INFOfields are written:IMPRECISE– flag that specifies that this is an imprecise variantEND– end position of the variantsSVTYPE– type of the variant, one of<DEL>,<DUP>,<INS>,<INV>,BNDSVLEN– absolute length of the SV for linear variants,.for non-linear variantsSVCLAIM– specificaton ofD(change in abundance),J(novel junction), orDJ(both change in abundance and novel junction)callers– (non-standard field), list of callers that called the variantchr2– (non-standard field), second chromosome for BND variantsannsv– (non-standard field), annotation of the variant effect on each affected geneThe
annsvfield is a pipe-character (|) separated list of the following fields:<DEL>&transcript_variant– variant affects the whole transcriptexon_variant– variant affects exonsplice_region_variant– variant affects splice regionintron_variant– variant affects only intronupstream_variant– variant upsream of genedownstream_variant– variant downstream of geneintergenic_variant– default for “no gene affected”, but never writtenBRCA1HGNC:1100The following
FORMATfields are written:GT– (standard field) genotype, if applicableGQ– (standard field) genotype quality, if applicablepec– total coverage with paired-end readspev– paired-end reads supporting the variantsrc– total coverage with split readssrv– split reads supporting the variantamq– average mapping quality over the variantcn– copy number of the variant in the sampleanc– average normalized coverage over the variant in the samplepc– point count (windows/targets/probes)Overall, the command will emit the following header rows in addition to the
##contig=<ID=.,length=.>lines.The
strucvars aggregateCommandImport multiple files created by
strucvars ingestinto a database that can be convered to.binwithstrucvars txt-to-binand then used bystrucvars query. You can specify the files individually. Paths starting with an at (@) character are interpreted as files with lists of paths. You can mix paths with@and without.The
strucvars txt-to-binCommandConvert output of varfish-db-downloader to a directory with databases to be used by query commands such as
strucvars query.The
strucvars queryCommandRun a query on a VCF file with structural variants as created by
strucvars ingestusing a varfish worker database.The worker database has the following structure. Note that also mehari transcripts are read, thus the
mehari/directory is included.Developer Information
This section is only relevant for developers of
varfish-server-worker.Development Setup
You will also need to have git LFS installed to get the test databases.
You will need a recent version of protocolbuffers, e.g.:
For running protolint, install it as python package
protolint-bin:Building from scratch
To reduce compile times, we recommend using a pre-built version of
rocksdb, either from the system package manager or e.g. viaconda:In either case, either add
to
.cargo/config.tomlor set the environment variablesROCKSDB_LIB_DIRandSNAPPY_LIB_DIRto the appropriate paths:By default, the environment variables are defined in the
.cargo/config.tomlas described above, i.e. may need adjustments if not using the system package manager.To build the project, run:
To install the project locally, run: