Row index and query index are incrementally so the name of the sample and the precise haplotype can be calculated using the output of bcftools.
The command:
bcftools query -l <input vcf/bcf> > samples.txt
store in samples.txt all the samples name, in order. So, for example, row indices 0 and 1 corresponds to the two haplotypes of the first sample, row indices 2 and 3 to the second one etc…
Eventually you can use script/mem_sample.py:
> python mem_sample.py -h
usage: mem_sample.py [-h] [-i INPUT] [-p PANEL] [-q QUERIES] [-o OUTPUT]
options:
-h, --help show this help message and exit
-i INPUT, --input INPUT
SMEM file in Durbin's format
-p PANEL, --panel PANEL
panel as VCF/BCF (optional)
-q QUERIES, --queries QUERIES
queries as VCF/BCF (optional)
-o OUTPUT, --output OUTPUT
output file
μ-PBWT results are currently available on Bioinformatics.
Bibtex:
@article{10.1093/bioinformatics/btad552,
author = {Cozzi, Davide and Rossi, Massimiliano and Rubinacci, Simone and Gagie, Travis and Köppl, Dominik and Boucher, Christina and Bonizzoni, Paola},
title = "{μ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data}",
journal = {Bioinformatics},
volume = {39},
number = {9},
pages = {btad552},
year = {2023},
month = {09},
abstract = "{The Positional Burrows–Wheeler Transform (PBWT) is a data structure that indexes haplotype sequences in a manner that enables finding maximal haplotype matches in h sequences containing w variation sites in O(hw) time. This represents a significant improvement over classical quadratic-time approaches. However, the original PBWT data structure does not allow for queries over Biobank panels that consist of several millions of haplotypes, if an index of the haplotypes must be kept entirely in memory.In this article, we leverage the notion of r-index proposed for the BWT to present a memory-efficient method for constructing and storing the run-length encoded PBWT, and computing set maximal matches (SMEMs) queries in haplotype sequences. We implement our method, which we refer to as μ-PBWT, and evaluate it on datasets of 1000 Genome Project and UK Biobank data. Our experiments demonstrate that the μ-PBWT reduces the memory usage up to a factor of 20\% compared to the best current PBWT-based indexing. In particular, μ-PBWT produces an index that stores high-coverage whole genome sequencing data of chromosome 20 in about a third of the space of its BCF file. μ-PBWT is an adaptation of techniques for the run-length compressed BWT for the PBWT (RLPBWT) and it is based on keeping in memory only a succinct representation of the RLPBWT that still allows the efficient computation of set maximal matches (SMEMs) over the original panel.Our implementation is open source and available at https://github.com/dlcgold/muPBWT. The binary is available at https://bioconda.github.io/recipes/mupbwt/README.html.}",
issn = {1367-4811},
doi = {10.1093/bioinformatics/btad552},
url = {https://doi.org/10.1093/bioinformatics/btad552},
eprint = {https://academic.oup.com/bioinformatics/article-pdf/39/9/btad552/51556136/btad552.pdf},
}
μ-PBWT
A PBWT-based light index for UK Biobank scale genotype data.
Conda install
μ-PBWT is available for Gnu/Linux on conda (bioconda channel):
Build from source
Prepare the cmake for building the current project in ‘.’ into the ‘build’ folder
Build μ-PBWT:
Install from source (optional)
Install μ-PBWT (default in
/usr/local/bin/,sudorequired):Use
--prefix <path>for custom path.Usage
File format supported:
Usage: ./mupbwt [options]
Options: -i, –input_file
Query the index:
Query without save the index:
Query and save the index:
Using examples in
sample_data:Load the index and print details to stdout:
An output example is:
Input
Only bialleic case is supported. In case of vcf/bcf bcftools can be used to filter the input:
Output
Output file follow the standard proposed in Durbin’s PBWT. Each row contain a SMEM:
For example:
Row index and query index are incrementally so the name of the sample and the precise haplotype can be calculated using the output of bcftools.
The command:
store in
samples.txtall the samples name, in order. So, for example, row indices 0 and 1 corresponds to the two haplotypes of the first sample, row indices 2 and 3 to the second one etc…Eventually you can use
script/mem_sample.py:Esample:
Only one between
PANELandQUERIEScan be specified. New SMEM file will contain in each row:For example, assuming both
PANELandQUERIES:Results
Results on high-coverage whole genome sequencing data from UK Biobank (chromosome 20):
Results on 1000 Genome Project phase 3 data including the average number of runs per site:
Note that total building times are not printed due to the fact that all the computations have been done in parallel.
The pipeline for 1000 Genome Project phase 3 data is available at dlcgold/muPBWT-1KGP-workflow.
Reference
μ-PBWT results are currently available on Bioinformatics.
Bibtex: