目录

vcfbub

install with bioconda

popping bubbles in vg deconstruct VCFs

overview

The VCF output produced by a command like vg deconstruct -e -a -H '#' ... includes information about the nesting of variants. With -a, --all-snarls, we obtain not just the top level bubbles, but all nested ones. This exposed snarl tree information can be used to filter the VCF to obtain a set of non-overlapping sites (n.b. “snarl” is a generic model of graph bubbles including tips and loops).

vcfbub lets us do two common operations on these VCFs:

  1. We can filter sites by maximum level in the snarl tree. For instance, --max-level 0 would keep only sites with LV=0. In practice, vg’s snarl finder ensures that these are sites rooted on the main linear axis of the pangenome graph. Those at higher levels occur within larger variants.
  2. We can filter sites by maximum allele size, either for the reference allele or any allele. In this case, --max-ref-length 10000 would keep only sites where the reference allele is less than 10kb long. Setting --max-ref-length or --max-allele-length additionally ensures that the output contains the bubbles nested inside of any popped bubble, even if they are at greater than --max-level.

vcfbub accomplishes a simple task: we keep sites that are the children of those which we “pop” due to their size. These occur around complex large SVs, such as multi-Mbp inversions and segmental duplications. We often need to remove these, as they provide little information for many downstream applications, such as haplotype panels or other imputation references.

usage

This removes all non-top-level variant sites (-l 0) unless they are inside of variants with reference length > 10kb (-r 10000):

vcfbub -l 0 -r 10000 var.vcf >filt.vcf
关于

图基因组变异后处理工具,用于过滤 vg deconstruct 生成的嵌套变异记录并得到非重叠 VCF 位点集合。

58.0 KB
邀请码
    Gitlink(确实开源)
  • 加入我们
  • 官网邮箱:gitlink@ccf.org.cn
  • QQ群
  • QQ群
  • 公众号
  • 公众号

版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9 京公网安备 11010802032778号