VarFish Server Worker

[!NOTE] This repository contains code that runs inside a VarFish Server. If you are looking to run your own VarFish Server, look here at bihealth/varfish-server.

This repository contains the worker used by VarFish Server to execute certain background task. They are written in the Rust programming language to speed up the execution of certain tasks. At the moment, the following sub commands exist:

db – subcommands to build binary (protobuf) database files
seqvars – subcommands for processing sequence (aka small/SNV/indel) variants
- seqvars ingest – convert single VCF file into internal format for use with seqvars query
- seqvars query – perform sequence variant filtration and on-the-fly annotation
- seqvars prefilter – limit the result of seqvars prefilter by population frequency and/or distance to exon
- seqvars aggregate – read through multiple VCF files written by seqvars ingest and computes a carrier counts table.
strucvars – subcommands for processing structural (aka large variants, CNVs, etc.) variants
- strucvars ingest – convert one or more structural variant files for use with strucvars query
- strucvars aggregate – compile per-case structural variant into an in-house database, to be converted to .bin with strucvars txt-to-bin.
- strucvars txt-to-bin – convert text files downloaded by varfish-db-downloader to binary for fast use in strucvars query commands
- strucvars query – perform structural variant filtration and on-the-fly annotation

Overall Design

For running queries, the worker tool is installed into the VarFish Server image and are run as executables. Internally, VarFish server works on VCF files stored in an S3 storage.

For import, the user gives the server access to the VCF files to import. The server will then use the worker executable to ingest the data into the internal format using {seqvars,strucvars} ingest. These files are then stored in the internal S3 storage.

For queries, the server will create a query JSON file and then pass this query JSON file together with the internal file to the worker executable. The worker will create a result file that can be directly imported by the server to be displayed to the user.

Future versions may provide persistently running HTTP/REST servers that provide functionality without startup cost.

The `seqvars ingest` Command

This command takes as the input a single VCF file from a (supported) variant caller and converts it into a file for further querying. The command interprets the following fields which are written out by the commonly used variant callers such as GATK UnifiedGenotyper, GATK HaplotypeCaller, and Illumina Dragen.

FORMAT/GT – genotype
- the following GT values are written out as 0/0, 0/1, 1/0, 1/1, 0|0, 0|1, 1|0, 1|1, ./., .|., .
- no combination of no-call (.) and called allele is written out
FORMAT/GQ – genotype quality
FORMAT/DP – total read coverage
FORMAT/AD – allelic depth, one value per allele (including reference0)
FORMAT/PS – physical phasing information as written out by GATK HaplotypeCaller in GVCF workflow and Dragen variant caller
FORMAT/SQ – “somatic quality” for each alternate allele, as written out by Illumina Dragen variant caller
- this field will be written as FORMAT/GQ

The seqvars ingest command will annotate the variants with the following information:

gnomAD genomes and exomes allele frequencies
gnomAD-mtDNA and HelixMtDb allele frequencies
functional annotation following the VCF ANN field standard
- Gene_Name is writen as HGNC symbol
- Gene_ID is written as HGNC ID

The command will emit one output line for each variant allele from the input and each affected gene. That is, if two variant alleles affect two genes, four records will be written to the output file. The annotation will be written out for one highest impact.

Overall, the command will emit the following header rows in addition to the ##contig=<ID=.,length=.> lines.

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=gnomad_exomes_an,Number=1,Type=Integer,Description="Number of alleles in gnomAD exomes">
##INFO=<ID=gnomad_exomes_hom,Number=1,Type=Integer,Description="Number of hom. alt. carriers in gnomAD exomes">
##INFO=<ID=gnomad_exomes_het,Number=1,Type=Integer,Description="Number of het. alt. carriers in gnomAD exomes">
##INFO=<ID=gnomad_exomes_hemi,Number=1,Type=Integer,Description="Number of hemi. alt. carriers in gnomAD exomes">
##INFO=<ID=gnomad_genomes_an,Number=1,Type=Integer,Description="Number of alleles in gnomAD genomes">
##INFO=<ID=gnomad_genomes_hom,Number=1,Type=Integer,Description="Number of hom. alt. carriers in gnomAD genomes">
##INFO=<ID=gnomad_genomes_het,Number=1,Type=Integer,Description="Number of het. alt. carriers in gnomAD genomes">
##INFO=<ID=gnomad_genomes_hemi,Number=1,Type=Integer,Description="Number of hemi. alt. carriers in gnomAD genomes">
##INFO=<ID=helix_an,Number=1,Type=Integer,Description="Number of alleles in HelixMtDb">
##INFO=<ID=helix_hom,Number=1,Type=Integer,Description="Number of hom. alt. carriers in HelixMtDb">
##INFO=<ID=helix_het,Number=1,Type=Integer,Description="Number of het. alt. carriers in HelixMtDb">
##INFO=<ID=ANN,Number=.,Type=String,Description="Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA.length | Distance | ERRORS / WARNINGS / INFO'">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##x-varfish-case-uuid=d2bad2ec-a75d-44b9-bd0a-83a3f1331b7c
##x-varfish-version=<ID=varfish-server-worker,Version=x.y.z>
##x-varfish-version=<ID=orig-caller,Name=Dragen,Version=SW: 07.021.624.3.10.9, HW: 07.021.624>
##x-varfish-version=<ID=orig-caller,Name=GatkHaplotypeCaller,Version=4.4.0.0>

[!NOTE] The gnomad-mtDNA information is written to the INFO/gnomdad_genome_* fields.

[!NOTE] Future versions of the worker will annotate the worst effect on a MANE select or MANE Clinical transcript.

The `seqvars prefilter` Command

This file takes as the input a file created by seqvars ingest and filters the variants by population frequency and/or distance to exon. You can pass the prefilter criteria as JSON on the command line corresponding to the following Rust structs:

struct PrefilterParams {
    /// Path to output file.
    pub path_out: String,
    /// Maximal allele population frequency.
    pub max_freq: f64,
    /// Maximal distance to exon.
    pub max_dist: i32,
}

You can either specify the parameters on the command line directly or pass a path to a JSONL file starting with @. You can mix both ways.

$ varfish-server-worker strucvars prefilter \
    --path-input INPUT.vcf \
    --params '{"path_out": "out.vcf", "max_freq": 0.01, "max_dist": 100}' \
    [--params ...] \

# OR

$ varfish-server-worker strucvars prefilter \
    --path-input INPUT.vcf \
    --params @path/to/params.json \
    [--params ...] \

## The `seqvars aggregate` Command

This command reads through multiple files written by `seqvars ingest` and computes a in-house carrier counts table.
You can specify the VCF files individually on the command line or pass in files that have paths to the VCF files line by line.
The resulting table is a folder to a RocksDB database.

```shell session
varfish-server-worker seqvars aggregate \
    --genome-build {grch37,grch38} \
    --path-out-rocksdb rocksdb/folder \
    --path-in-vcf path/to/vcf.gz \
    --path-in-vcf @path/to/file/list.txt

The `seqvars query` Command

This command perform the querying of sequence variants and further annotation using annonars databases.

The `strucvars ingest` Command

This command takes as the input one or more VCF files from structural variant callers and converts it into a file for further querying. The command supports the following variant callers and can guess the caller from the VCF header and first record.

Delly2
Dragen-SV (equivalent to Manta)
Dragen-CNV
GATK gCNV
Manta
MELT
PopDel
Sniffles2

One record will be written out for each variant, each with a single alternate allele.

The following symbolic ALT alleles are used:

<DEL>
<DUP>
<INS>
<INV>
VCF break-end syntax, e.g., T[chr1:5[

The following INFO fields are written:

IMPRECISE – flag that specifies that this is an imprecise variant
END – end position of the variants
SVTYPE – type of the variant, one of <DEL>, <DUP>, <INS>, <INV>, BND
SVLEN – absolute length of the SV for linear variants, . for non-linear variants
SVCLAIM – specificaton of D (change in abundance), J (novel junction), or DJ (both change in abundance and novel junction)
callers – (non-standard field), list of callers that called the variant
chr2 – (non-standard field), second chromosome for BND variants
annsv – (non-standard field), annotation of the variant effect on each affected gene

The annsv field is a pipe-character (|) separated list of the following fields:

symbolic alternate alele, e.g., <DEL>
effects on the gene’s transcript, separated by &
- transcript_variant – variant affects the whole transcript
- exon_variant – variant affects exon
- splice_region_variant – variant affects splice region
- intron_variant – variant affects only intron
- upstream_variant – variant upsream of gene
- downstream_variant – variant downstream of gene
- intergenic_variant – default for “no gene affected”, but never written
HGNC gene symbol, e.g., BRCA1
HGNC gene ID, e.g., HGNC:1100

The following FORMAT fields are written:

GT – (standard field) genotype, if applicable
GQ – (standard field) genotype quality, if applicable
pec – total coverage with paired-end reads
pev – paired-end reads supporting the variant
src – total coverage with split reads
srv – split reads supporting the variant
amq – average mapping quality over the variant
cn – copy number of the variant in the sample
anc – average normalized coverage over the variant in the sample
pc – point count (windows/targets/probes)

Overall, the command will emit the following header rows in addition to the ##contig=<ID=.,length=.> lines.

##fileformat=VCFv4.4
##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the longest variant described in this record">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=SVLEN,Number=A,Type=Integer,Description="Length of structural variant">
##INFO=<ID=SVCLAIM,Number=A,Type=String,Description="Claim made by the structural variant call. Valid values are D, J, DJ for abundance, adjacency and both respectively">
##INFO=<ID=callers,Number=.,Type=String,Description="Callers that called the variant">
##INFO=<ID=chr2,Number=1,Type=String,Description="Second chromosome, if not equal to CHROM">
##INFO=<ID=annsv,Number=1,Type=String,Description="Effect annotations: 'Allele | Annotation | Gene_Name | Gene_ID'">
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Conditional genotype quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=pec,Number=1,Type=Integer,Description="Total coverage with paired-end reads">
##FORMAT=<ID=pev,Number=1,Type=Integer,Description="Paired-end reads supporting the variant">
##FORMAT=<ID=src,Number=1,Type=Integer,Description="Total coverage with split reads">
##FORMAT=<ID=srv,Number=1,Type=Integer,Description="Split reads supporting the variant">
##FORMAT=<ID=amq,Number=1,Type=Float,Description="Average mapping quality over the variant">
##FORMAT=<ID=cn,Number=1,Type=Integer,Description="Copy number of the variant in the sample">
##FORMAT=<ID=anc,Number=1,Type=Float,Description="Average normalized coverage over the variant in the sample">
##FORMAT=<ID=pc,Number=1,Type=Integer,Description="Point count (windows/targets/probes)">
##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=DUP,Description="Duplication">
##ALT=<ID=INS,Description="Insertion">
##ALT=<ID=CNV,Description="Copy Number Variation">
##ALT=<ID=INV,Description="Inversion">
##fileDate=20230421
##x-varfish-genome-build=GRCh37
##SAMPLE=<ID=index,Sex="Male",Disease="Affected">
##SAMPLE=<ID=father,Sex="Male",Disease="Unaffected">
##SAMPLE=<ID=mother,Sex="Female",Disease="Unaffected">
##PEDIGREE=<ID=index,Father="father",Mother="mother">
##PEDIGREE=<ID=father>
##PEDIGREE=<ID=mother>
##x-varfish-case-uuid=d2bad2ec-a75d-44b9-bd0a-83a3f1331b7c
##x-varfish-version=<ID=varfish-server-worker,Version="x.y.z">
##x-varfish-version=<ID=Delly,Name="Delly",Version="1.1.3">

[!NOTE] The strucvars ingest step does not perform any annotation. It only merges the input VCF files from multiple callers (all files must have the same samples) and converts them into the internal format. The INFO/annsv field is filled by strucvars query.

The `strucvars aggregate` Command

Import multiple files created by strucvars ingest into a database that can be convered to .bin with strucvars txt-to-bin and then used by strucvars query. You can specify the files individually. Paths starting with an at (@) character are interpreted as files with lists of paths. You can mix paths with @ and without.

$ varfish-server-worker strucvars aggregate \
    --genome-release {Grch37,Grch38} \
    --path-output OUT.tsv \
    --path-input IN/file1.vcf.gz \
    [--path-input IN/file1.vcf.gz] \

# OR:

$ varfish-server-worker db mk-inhouse \
    --genome-release {Grch37,Grch38} \
    --path-output OUT.tsv \
    --path-input @IN/path-list.txt \
    [--path-input @IN/path-list2.txt]

The `strucvars txt-to-bin` Command

Convert output of varfish-db-downloader to a directory with databases to be used by query commands such as strucvars query.

$ varfish-server-worker strucvars txt-to-bin \
    --input-type {ClinvarSv,StrucvarInhouse,...} \
    --path-input IN.txt \
    --path-output DST.bin

The `strucvars query` Command

Run a query on a VCF file with structural variants as created by strucvars ingest using a varfish worker database.

$ varfish-server-worker strucvars query \
    --genome-release grch37 \
    --path-db path/to/worker-db \
    --path-input IN.vcf.gz \
    --path-output OUT.jsonl

The worker database has the following structure. Note that also mehari transcripts are read, thus the mehari/ directory is included.

mehari/
    {genome_release}/
        txs.bin.zst
worker/
    noref/
        genes/
            acmg.tsv                -- ACMG SF list genes
            mim2gene.tsv            -- OMIM to NCBI mapping from clingen
            xlink.bin               -- gene crosslinks
    {genome_release}/               -- one per genome release
        features/                   -- features important for annotation
            masked_repeat.bin       -- masked repeats
            masked_seqdup.bin       -- masked segmental duplications
        strucvars/                  -- structural variant specific
            bgdbs/                  -- background databases
                dbvar.bin           -- dbVar
                dgv.bin             -- DGV
                dgv_gs.bin          -- DGV gold standard
                g1k.bin             -- 1000 genomes CNVs
                gnomad_exomes.bin   -- gnomAD-exomes/ExAC SVs
                gnomad_genomes.bin  -- gnomAD-genomes SVs
            clinvar.bin             -- ClinVar SVs
            inhouse.bin             -- inhouse SV database
            patho_mms.bed           -- well-known pathogenic DELs/DUPs
        tads/
            hesc.bed                -- hESC TAD definitions

Developer Information

This section is only relevant for developers of varfish-server-worker.

Development Setup

You will also need to have git LFS installed to get the test databases.

You will need a recent version of protocolbuffers, e.g.:

# bash utils/install-protoc.sh
# export PATH=$PATH:$HOME/.local/share/protoc/bin

For running protolint, install it as python package protolint-bin:

# virtualenv /tmp/varfish-server-worker
# source /tmp/varfish-server-worker/bin/activate
# pip install protolint-bin

Building from scratch

To reduce compile times, we recommend using a pre-built version of rocksdb, either from the system package manager or e.g. via conda:

# Ubuntu
sudo apt-get install librocksdb-dev

# Conda
conda install -c conda-forge rocksdb

In either case, either add

[env]
ROCKSDB_LIB_DIR = "/usr/lib/" # in case of the system package manager, adjust the path accordingly for conda
SNAPPY_LIB_DIR = "/usr/lib/"  # same as above

to .cargo/config.toml or set the environment variables ROCKSDB_LIB_DIR and SNAPPY_LIB_DIR to the appropriate paths:

export ROCKSDB_LIB_DIR=/usr/lib/
export SNAPPY_LIB_DIR=/usr/lib/

By default, the environment variables are defined in the .cargo/config.toml as described above, i.e. may need adjustments if not using the system package manager.

To build the project, run:

cargo build --release

To install the project locally, run:

cargo install --path .

VarFish Server Worker

Overall Design

The seqvars ingest Command

The seqvars prefilter Command

The seqvars query Command

The strucvars ingest Command

The strucvars aggregate Command

The strucvars txt-to-bin Command

The strucvars query Command