Data management of whole-genome sequence variant calls with hundreds of thousands of individuals: genotypic data (e.g., SNVs, indels and structural variation calls) and annotations in SeqArray GDS files are stored in an array-oriented and compressed manner, with efficient data access using the R programming language.
The SeqArray package is built on top of Genomic Data Structure (GDS) data format, and defines the required data structure for a SeqArray file. GDS is a flexible and portable data container with hierarchical structure to store multiple scalable array-oriented data sets. It is suited for large-scale datasets, especially for data which are much larger than the available random-access memory. It also offers efficient operations specifically designed for integers of less than 8 bits, since a diploid genotype usually occupies fewer bits than a byte. Data compression and decompression are available with relatively efficient random access. A high-level R interface to GDS files is available in the package gdsfmt.
Zheng X, Gogarten S, Lawrence M, Stilp A, Conomos M, Weir BS, Laurie C, Levine D (2017). SeqArray – A storage-efficient high-performance data format for WGS variant calls. Bioinformatics. DOI: 10.1093/bioinformatics/btx145.
Zheng X, Levine D, Shen J, Gogarten SM, Laurie C, Weir BS (2012). A High-performance Computing Toolset for Relatedness and Principal Component Analysis of SNP Data. Bioinformatics. DOI: 10.1093/bioinformatics/bts606.
Installation (requiring ≥ R_v3.5.0)
Bioconductor repository:
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("SeqArray")
Development version from Github (for developers/testers only):
The install_github() approach requires that you build from source, i.e. make and compilers must be installed on your system – see the R FAQ for your operating system; you may also need to install dependencies manually.
## Examples
```R
library(SeqArray)
gds.fn <- seqExampleFileName("gds")
# open a GDS file
f <- seqOpen(gds.fn)
# display the contents of the GDS file
f
# close the file
seqClose(f)
SeqArray: Data management of large-scale whole-genome sequence variant calls using GDS files
Features
Data management of whole-genome sequence variant calls with hundreds of thousands of individuals: genotypic data (e.g., SNVs, indels and structural variation calls) and annotations in SeqArray GDS files are stored in an array-oriented and compressed manner, with efficient data access using the R programming language.
The SeqArray package is built on top of Genomic Data Structure (GDS) data format, and defines the required data structure for a SeqArray file. GDS is a flexible and portable data container with hierarchical structure to store multiple scalable array-oriented data sets. It is suited for large-scale datasets, especially for data which are much larger than the available random-access memory. It also offers efficient operations specifically designed for integers of less than 8 bits, since a diploid genotype usually occupies fewer bits than a byte. Data compression and decompression are available with relatively efficient random access. A high-level R interface to GDS files is available in the package gdsfmt.
Bioconductor:
Release Version: v1.50.1
http://www.bioconductor.org/packages/SeqArray
Citation
Zheng X, Gogarten S, Lawrence M, Stilp A, Conomos M, Weir BS, Laurie C, Levine D (2017). SeqArray – A storage-efficient high-performance data format for WGS variant calls. Bioinformatics. DOI: 10.1093/bioinformatics/btx145.
Zheng X, Levine D, Shen J, Gogarten SM, Laurie C, Weir BS (2012). A High-performance Computing Toolset for Relatedness and Principal Component Analysis of SNP Data. Bioinformatics. DOI: 10.1093/bioinformatics/bts606.
Installation (requiring ≥ R_v3.5.0)
Bioconductor repository:
Development version from Github (for developers/testers only):
The
install_github()approach requires that you build from source, i.e.makeand compilers must be installed on your system – see the R FAQ for your operating system; you may also need to install dependencies manually.Install the package from the source code: gdsfmt, SeqArray ```sh wget –no-check-certificate https://github.com/zhengxwen/gdsfmt/tarball/master -O gdsfmt_latest.tar.gz wget –no-check-certificate https://github.com/zhengxwen/SeqArray/tarball/master -O SeqArray_latest.tar.gz R CMD INSTALL gdsfmt_latest.tar.gz R CMD INSTALL SeqArray_latest.tar.gz
Or
curl -L https://github.com/zhengxwen/gdsfmt/tarball/master/ -o gdsfmt_latest.tar.gz curl -L https://github.com/zhengxwen/SeqArray/tarball/master/ -o SeqArray_latest.tar.gz R CMD INSTALL gdsfmt_latest.tar.gz R CMD INSTALL SeqArray_latest.tar.gz
Key Functions in the SeqArray Package
File Format Conversion
SeqArray GDS File Downloads
See Also