SVJedi-graph : long-read SV genotyper with a variation graph
SVJedi-graph is a structural variation (SV) genotyper for long read data. It takes as input a variant file (VCF), a reference genome (fasta) and a long read file (fasta/fastq) and outputs the initial variant file with an additional column containing genotyping information (VCF).
SVjedi-graph is based on a representation of the genome and the different SV alleles in a variation graph. After building this variation graph from the reference genome sequence and the input variant file, long reads are mapped on this graph using minigraph[^1]. Then it estimates the genotype of each variant in a given individual sample based on allele-specific alignment counts.
Currently, SVJedi-graph can genotype five types of SVs: deletions, insertions, duplications, inversions and translocations (intra- and inter-chromosomal).
[^1]: Li, H., Feng, X. & Chu, C. The design and construction of reference pangenome graphs with minigraph. Genome Biol 21, 265 (2020). https://doi.org/10.1186/s13059-020-02168-z
For all variants, the SVTYPE tag must be present in the INFO field (SVTYPE=DEL or SVTYPE=INS or SVTYPE=INV or SVTYPE=BND). Insertions need to be sequence-resolved with the full inserted sequence characterized and reported in the ALT field of the VCF file. As duplications are a special case of insertions, SVJedi-graph supports also duplications, as long as their duplicated sequence is characterized and reported similarly to insertions. More details are given in SV representation in VCF.
Test with a small dataset
To check that SVJedi-graph behaves as expected on your device, you can run:
cd test-dir/
./run_test.sh
To explore the output files on a small dataset, run:
mkdir outputfiles
cd outputfiles
./../svjedi-graph.py -v ../test-dir/test.vcf -r ../test-dir/reference_genome.fasta -q ../test-dir/simulated_reads.fastq.gz -p test
Parameters
-v--vcf VCF file containing the set of SVs to genotype.
-r--ref FASTA file containing the reference genome (on which the SVs have been identified).
-q--reads FASTQ file containing the long reads used to genotype. If you have multiple FASTQ files for one individual, use , as a filename separator.
-p--prefix Prefix of output files.
-t--threads Number of threads to use for the mapping step.
-ms--minsupport Minimum number of alignments to genotype a SV (default: 3>=).
Output files
Main output file:
<prefix>_genotype.vcf Genotyped SVs set in VCF format.
<prefix>.gaf Mapping results from minigraph in GAF format.
<prefix>_informative_aln.json Json dictionnary of read supports for each input SV’s alleles.
SV representation in VCF
Here are the information needed for SVJedi-graph to genotype the following SV types. All variants must have the CHROM and POS fields defined, with the chromosome names in the reference genome file and variant file that must be the same. The SVTYPE tag must be present in the INFO field (SVTYPE=DEL or SVTYPE=INS or SVTYPE=INV or SVTYPE=BND). Then additional information is required according to SV type:
Deletion
INFO field must contain SVTYPE=DEL
INFO field must contain END=pos (with pos being the end position of the deleted segment)
Insertion
INFO field must contain SVTYPE=INS
ALT field must contain the sequence of the insertion
Duplication
must be defined as an insertion event whith CHR and POS corresponding to the position of insertion of the novel copy
INFO field must contain SVTYPE=INS
ALT field must contain the sequence of the duplication
Inversion
INFO field must contain SVTYPE=INV
INFO field must contain END=pos tag, with pos being the second breakpoint position
Intra-chromosomal translocation
INFO field must contain SVTYPE=BND
ALT field must be formated as: t[pos[, t]pos], ]pos]t or [pos[t, with pos indicating the second breakpoint position and brackets directions indicating which parts of the two chromosomes should be joined together
Citation
Sandra Romain, Claire Lemaitre, SVJedi-graph: improving the genotyping of close and overlapping structural variants with long reads using a variation graph, Bioinformatics, Volume 39, Issue Supplement_1, June 2023, Pages i270–i278, https://doi.org/10.1093/bioinformatics/btad237
Contact
SVJedi-graph is a Genscale tool developed by Sandra Romain and Claire Lemaitre. For any bug report or feedback, please use the Github Issues form.
SVJedi-graph : long-read SV genotyper with a variation graph
SVJedi-graph is a structural variation (SV) genotyper for long read data. It takes as input a variant file (VCF), a reference genome (fasta) and a long read file (fasta/fastq) and outputs the initial variant file with an additional column containing genotyping information (VCF).
SVjedi-graph is based on a representation of the genome and the different SV alleles in a variation graph. After building this variation graph from the reference genome sequence and the input variant file, long reads are mapped on this graph using minigraph[^1]. Then it estimates the genotype of each variant in a given individual sample based on allele-specific alignment counts.
Currently, SVJedi-graph can genotype five types of SVs: deletions, insertions, duplications, inversions and translocations (intra- and inter-chromosomal).
[^1]: Li, H., Feng, X. & Chu, C. The design and construction of reference pangenome graphs with minigraph. Genome Biol 21, 265 (2020). https://doi.org/10.1186/s13059-020-02168-z
Installation
SVJedi-graph requires :
With Conda
Or
Usage
Input VCF requirements
For all variants, the
SVTYPEtag must be present in theINFOfield (SVTYPE=DELorSVTYPE=INSorSVTYPE=INVorSVTYPE=BND). Insertions need to be sequence-resolved with the full inserted sequence characterized and reported in the ALT field of the VCF file. As duplications are a special case of insertions, SVJedi-graph supports also duplications, as long as their duplicated sequence is characterized and reported similarly to insertions. More details are given in SV representation in VCF.Test with a small dataset
To check that SVJedi-graph behaves as expected on your device, you can run:
To explore the output files on a small dataset, run:
Parameters
-v--vcfVCF file containing the set of SVs to genotype.-r--refFASTA file containing the reference genome (on which the SVs have been identified).-q--readsFASTQ file containing the long reads used to genotype. If you have multiple FASTQ files for one individual, use,as a filename separator.-p--prefixPrefix of output files.-t--threadsNumber of threads to use for the mapping step.-ms--minsupportMinimum number of alignments to genotype a SV (default: 3>=).Output files
Main output file:
<prefix>_genotype.vcfGenotyped SVs set in VCF format.Intermediate output files:
<prefix>.gfaVariation graph in GFA format.<prefix>.gafMapping results from minigraph in GAF format.<prefix>_informative_aln.jsonJson dictionnary of read supports for each input SV’s alleles.SV representation in VCF
Here are the information needed for SVJedi-graph to genotype the following SV types. All variants must have the
CHROMandPOSfields defined, with the chromosome names in the reference genome file and variant file that must be the same. TheSVTYPEtag must be present in the INFO field (SVTYPE=DELorSVTYPE=INSorSVTYPE=INVorSVTYPE=BND). Then additional information is required according to SV type:Deletion
INFOfield must containSVTYPE=DELINFOfield must containEND=pos(withposbeing the end position of the deleted segment)Insertion
INFOfield must containSVTYPE=INSALTfield must contain the sequence of the insertionDuplication
CHRandPOScorresponding to the position of insertion of the novel copyINFOfield must containSVTYPE=INSALTfield must contain the sequence of the duplicationInversion
INFOfield must containSVTYPE=INVINFOfield must containEND=postag, withposbeing the second breakpoint positionIntra-chromosomal translocation
INFOfield must containSVTYPE=BNDALTfield must be formated as:t[pos[,t]pos],]pos]tor[pos[t, withposindicating the second breakpoint position and brackets directions indicating which parts of the two chromosomes should be joined togetherCitation
Sandra Romain, Claire Lemaitre, SVJedi-graph: improving the genotyping of close and overlapping structural variants with long reads using a variation graph, Bioinformatics, Volume 39, Issue Supplement_1, June 2023, Pages i270–i278, https://doi.org/10.1093/bioinformatics/btad237
Contact
SVJedi-graph is a Genscale tool developed by Sandra Romain and Claire Lemaitre. For any bug report or feedback, please use the Github Issues form.