SVJedi : Genotyping structural variations with long read data
Note [June 2023]: SVJedi has been replaced by SVJedi-graph, a newer version that is faster and improves the genotyping quality especially for close and overlapping SVs.
SVJedi is a structural variation (SV) genotyper for long read data.
Based on a representation of the different alleles, it estimates the genotype of each variant in a given individual sample based on allele-specific alignment counts.
SVJedi takes as input a variant file (VCF), a reference genome (fasta) and a long read file (fasta/fastq) and
outputs the initial variant file with an additional column containing genotyping information (VCF).
SVJedi processes deletions, insertions, inversions and translocations.
SVJedi is organized in three main steps:
Generate representative allele sequences of a set of SVs given in a vcf file
Map reads on previously generated allele sequences using Minimap2
Genotype SVs and output a vcf file
Jedi comes from the verb jediñ [‘ʒeːdɪ] in Breton, it means calculate.
Note: Chromosome names in reference.fasta and in set_of_sv.vcf must be the same.
Also, the SVTYPE tag must be present in the VCF (SVTYPE=DEL or SVTYPE=INS or SVTYPE=INV or SVTYPE=BND).
More details are given in SV representation in VCF.
The folder Data/HG002_son includes an example of 20 SVs (10 insertions and 10 deletions) to genotype on a subsample of a real human dataset of the Ashkenazim son HG002.
Example command line:
python3 svjedi.py -v Data/HG002_son/HG002_20SVs_Tier1_v0.6_PASS.vcf -a Data/HG002_son/reference_at_breakpoints.fasta -i Data/HG002_son/PacBio_reads_set.fastq.gz -o Data/HG002_son/genotype_results.vcf
Note: Genotyping results in Data/HG002_son/expected_genotype_results.vcf were obtained using Minimap2 version 2.17-r941.
The folder Data/C_elegans includes an example on 12 SVs (del, ins, inv, bnd) to genotype with a small synthetic read dataset on a subset of the Caenorhabditis elegans genome.
Soft-clipping length allowed to consider a semi-global alignment (default 100 bp)
-ladj
Length of sequences adjacent to each end of breakpoints (default 5,000 bp)
-d/–data
Type of sequencing data, either ont or pb (default pb)
-t/–threads
Number of threads for mapping
-h/–help
Show help
SV representation in VCF
Here are the information needed for SVJedi to genotype the following SV types. All variants must have the CHROM and POS fields defined, with the chromosome names in reference.fasta and in set_of_sv.vcf that must be the same. Then additional information is required according to SV type:
Deletion
Either ALT field is <DEL> or INFO field must contain SVTYPE=DEL
INFO field must contain either END=pos (with pos being the end position of the deleted segment) or SVLEN=len (with len being the size of the deletion) tags
Insertion
INFO field must contain SVTYPE=INS
ALT field must contain the sequence of the insertion
Inversion
Either ALT field is <INV> or INFO field must contain SVTYPE=INV
INFO field must contain END=pos tag, with pos being the second breakpoint position
Translocation
INFO field must contain SVTYPE=BND and CHR2= and END= tags
CHR2 name and sequence must be in the reference genome fasta file
ALT field must be formated as: t[chr:pos[, t]chr:pos], ]chr:pos]t or [chr:pos[t, with chrand pos indicating the second breakpoint position and brackets directions indicating which parts of the two chromosomes should be joined together
SVJedi : Genotyping structural variations with long read data
Note [June 2023]: SVJedi has been replaced by SVJedi-graph, a newer version that is faster and improves the genotyping quality especially for close and overlapping SVs.
Go to https://github.com/SandraLouise/SVJedi-graph
SVJedi is a structural variation (SV) genotyper for long read data. Based on a representation of the different alleles, it estimates the genotype of each variant in a given individual sample based on allele-specific alignment counts. SVJedi takes as input a variant file (VCF), a reference genome (fasta) and a long read file (fasta/fastq) and outputs the initial variant file with an additional column containing genotyping information (VCF).
SVJedi processes deletions, insertions, inversions and translocations.
SVJedi is organized in three main steps:
Jedi comes from the verb jediñ [‘ʒeːdɪ] in Breton, it means calculate.
Requirements
Usage
Note: Chromosome names in
reference.fastaand inset_of_sv.vcfmust be the same. Also, theSVTYPEtag must be present in the VCF (SVTYPE=DELorSVTYPE=INSorSVTYPE=INVorSVTYPE=BND). More details are given in SV representation in VCF.Installation
SVJedi is also distributed as a Bioconda package:
Examples
The folder Data/HG002_son includes an example of 20 SVs (10 insertions and 10 deletions) to genotype on a subsample of a real human dataset of the Ashkenazim son HG002.
Example command line:
Note: Genotyping results in
Data/HG002_son/expected_genotype_results.vcfwere obtained using Minimap2 version 2.17-r941.The folder Data/C_elegans includes an example on 12 SVs (del, ins, inv, bnd) to genotype with a small synthetic read dataset on a subset of the Caenorhabditis elegans genome.
Example command line:
Parameters
SVJedi two different usages from non aligned reads or from aligned reads (PAF format).
SV representation in VCF
Here are the information needed for SVJedi to genotype the following SV types. All variants must have the
CHROMandPOSfields defined, with the chromosome names inreference.fastaand inset_of_sv.vcfthat must be the same. Then additional information is required according to SV type:Deletion
ALTfield is<DEL>orINFOfield must containSVTYPE=DELINFOfield must contain eitherEND=pos(withposbeing the end position of the deleted segment) orSVLEN=len(withlenbeing the size of the deletion) tagsInsertion
INFOfield must containSVTYPE=INSALTfield must contain the sequence of the insertionInversion
ALTfield is<INV>orINFOfield must containSVTYPE=INVINFOfield must containEND=postag, withposbeing the second breakpoint positionTranslocation
INFOfield must containSVTYPE=BNDandCHR2=andEND=tagsALTfield must be formated as:t[chr:pos[,t]chr:pos],]chr:pos]tor[chr:pos[t, withchrandposindicating the second breakpoint position and brackets directions indicating which parts of the two chromosomes should be joined togetherReference
SVJedi: Genotyping structural variations with long reads. Lecompte L, Peterlongo P, Lavenier D, Lemaitre C. Bioinformatics 2020, 36(17):4568–4575 doi:10.1093/bioinformatics/btaa527 (bioRxiv preprint)
Contact
SVJedi is a Genscale tool developed by Lolita Lecompte. For any bug report or feedback, please use the Issues form of Github.