目录

piawka

Conda Downloads

Calculate SNP-based population statistics over groups of samples in VCF files with:

  • indexable BED output
  • correct handling of missing data
  • support for polyploid variant calls
  • higher data yield due to per-group ALT-agnostic SNP retrieval
  • a broad selection of statistics, extensible with modules
  • convenient helper tools for making genomic windows, filtering and summarizing the results
  • the power of GNU AWK: no installation, competitive speed, low memory footprint, and multiprocessing

[!WARNING] piawka is under development. At this stage, breaking changes are not unthinkable of. If something does not seem to work well, check newer versions and do not hesitate to file an issue!

Installation

conda install -c bioconda piawka

Alternatively, have the following programs available in the command line and clone the repo:

  • gawk>=v5.2.0
  • tabix
  • bgzip
git clone https://github.com/novikovalab/piawka.git
export PATH="$( realpath ./piawka ):${PATH}"

Usage

Docs are available at https://novikovalab.github.io/piawka.

Input and output

Mandatory (for piawka calc):

  • VCF file – bgzipped and tabixed

Optional:

  • groups file – 2-column TSV with sample ID and group ID (may include relevant samples only)
  • regions/targets file – BED file to restrict/split output by regions

Output is a BED file:

$ cd piawka/examples
$ piawka calc -v alyrata_scaff_1_10000k-10500k.vcf.gz -b genes.bed -g groups.tsv -s pi,dxy
#chr        start     end       locus      pop1              pop2              stat    value        numerator  denominator
scaffold_1  10035093  10035276  AL5G20950  CESiberia_2n      LE_2n             dxy     0.0071137    460        64664
scaffold_1  10035093  10035276  AL5G20950  PUWS_4n           .                 pi      0.00588993   640        108660
scaffold_1  10035093  10035276  AL5G20950  LE_2n             PUWS_4n           dxy     0.00881262   1102       125048
scaffold_1  10035093  10035276  AL5G20950  LE_2n             .                 pi      0.00772461   1078       139554
...

Subcommands

  • piawka calc: calculate various population statistics from a VCF file
  • piawka dist: convert calc output to PHYLIP or NEXUS distance matrix
  • piawka filt: filter piawka output using AWK expressions
  • piawka list: show all statistics available for calculation
  • piawka sum: summarize stats from calc output across regions
  • piawka win: prepare genomic windows from various sources

Statistics

Within groups:

  • lines: number of lines used in calculation
  • miss: share of missing genotype calls
  • pi: expected heterozygosity = nucleotide diversity
  • maf: minor allele frequency
  • daf: alternative (“derived”) allele frequency
  • tajima: Tajima’s D
  • tajimalike: Tajima’s D interpolated for missing genotypes (experimental)
  • theta_w: Watterson’s theta
  • theta_low: Theta estimator based on sites with 0<allele_freq<0.33
  • theta_mid: Theta estimator based on sites with 0.33<=allele_freq<0.66
  • theta_high: Theta estimator based on sites with 0.33<=allele_freq<0.66

Between groups (pairwise):

  • afd: average allele frequency difference
  • dxy: absolute nucleotide divergence
  • fst: fixation index, Hudson’s estimator
  • fstwc: fixation index, Weir & Cockerham’s estimator
  • rho: Ronfort’s rho
  • nei: Nei’s D standard genetic distance

Citation

First mention of piawka as well as the test data are coming from https://doi.org/10.1093/molbev/msaf153.

关于

用于在多倍体 VCF 文件中计算 π、Dxy、Fst 及其他群体遗传统计量,并支持混合倍性群体分析的 awk 脚本工具。

83.5 MB
邀请码
    Gitlink(确实开源)
  • 加入我们
  • 官网邮箱:gitlink@ccf.org.cn
  • QQ群
  • QQ群
  • 公众号
  • 公众号

版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9 京公网安备 11010802032778号