MungeSumstats: Standardise the format of GWAS summary statistics
Authors: Alan Murphy, Brian Schilder and Nathan Skene
Updated: Aug-07-2024
Introduction
The MungeSumstats package is designed to facilitate the
standardisation of GWAS summary statistics.
Overview
The package is designed to handle the lack of standardisation of output
files by the GWAS community. The MRC IEU Open
GWAS team have provided full summary
statistics for >10k GWAS, which are API-accessible via the
ieugwasr and
gwasvcf packages. But these GWAS
are only standardised in the sense that they are VCF format, and can be
fully standardised with MungeSumstats.
MungeSumstats provides a framework to standardise the format for any
GWAS summary statistics, including those in VCF format, enabling
downstream integration and analysis. It addresses the most common
discrepancies across summary statistic files, and offers a range of
adjustable Quality Control (QC) steps.
Citation
If you use MungeSumstats, please cite the original authors of the GWAS
as well as:
Alan E Murphy, Brian M Schilder, Nathan G Skene (2021) MungeSumstats:
A Bioconductor package for the standardisation and quality control of
many GWAS summary statistics. Bioinformatics, btab665,
https://doi.org/10.1093/bioinformatics/btab665
Installing MungeSumstats
MungeSumstats is available on
Bioconductor. To
install MungeSumstats on Bioconductor run:
if (!require("BiocManager")) install.packages("BiocManager")
BiocManager::install("MungeSumstats")
Note that for a number of the checks implored by MungeSumstats a
reference genome is used. If your GWAS summary statistics file of
interest relates to GRCh38, you will need to install
SNPlocs.Hsapiens.dbSNP155.GRCh38 and BSgenome.Hsapiens.NCBI.GRCh38
from Bioconductor as follows:
If your GWAS summary statistics file of interest relates to GRCh37,
you will need to install SNPlocs.Hsapiens.dbSNP155.GRCh37 and
BSgenome.Hsapiens.1000genomes.hs37d5 from Bioconductor as follows:
These may take some time to install and are not included in the package
as some users may only need one of GRCh37/GRCh38. If you are unsure
of the genome build, MungeSumstats can also infer this information from
your data.
See the OpenGWAS vignette
website
for information on how to use MungeSumstats to access, standardise and
perform quality control on GWAS Summary Statistics from the MRC IEU
Open GWAS Project.
If you have any problems please do file an
Issue here on
GitHub.
Future Enhancements
The MungeSumstats package aims to be able to handle the most common
summary statistic file formats including VCF. If your file can not be
formatted by MungeSumstats feel free to report the
Issue on GitHub
along with your summary statistics file header.
We also encourage people to edit the code to resolve their particular
issues too and are happy to incorporate these through pull requests on
github. If your summary statistic file headers are not recognised by
MungeSumstats but correspond to one of
SNP, BP, CHR, A1, A2, P, Z, OR, BETA, LOG_ODDS, SIGNED_SUMSTAT, N, N_CAS, N_CON,
NSTUDY, INFO or FRQ,
Feel free to update the data("sumstatsColHeaders") following the
approach in the data.R file and add your mapping. Then use a Pull
Request on GitHub and
we will incorporate this change into the package.
Contributors
We would like to acknowledge all those who have contributed to
MungeSumstats development:
MungeSumstats: Standardise the format of GWAS summary statisticsAuthors: Alan Murphy, Brian Schilder and Nathan Skene
Updated: Aug-07-2024
Introduction
The
MungeSumstatspackage is designed to facilitate the standardisation of GWAS summary statistics.Overview
The package is designed to handle the lack of standardisation of output files by the GWAS community. The MRC IEU Open GWAS team have provided full summary statistics for >10k GWAS, which are API-accessible via the
ieugwasrandgwasvcfpackages. But these GWAS are only standardised in the sense that they are VCF format, and can be fully standardised withMungeSumstats.MungeSumstatsprovides a framework to standardise the format for any GWAS summary statistics, including those in VCF format, enabling downstream integration and analysis. It addresses the most common discrepancies across summary statistic files, and offers a range of adjustable Quality Control (QC) steps.Citation
If you use
MungeSumstats, please cite the original authors of the GWAS as well as:Installing
MungeSumstatsMungeSumstatsis available on Bioconductor. To installMungeSumstatson Bioconductor run:You can then load the package and data package:
Note that there is also a docker image for MungeSumstats.
Note that for a number of the checks implored by
MungeSumstatsa reference genome is used. If your GWAS summary statistics file of interest relates to GRCh38, you will need to installSNPlocs.Hsapiens.dbSNP155.GRCh38andBSgenome.Hsapiens.NCBI.GRCh38from Bioconductor as follows:If your GWAS summary statistics file of interest relates to GRCh37, you will need to install
SNPlocs.Hsapiens.dbSNP155.GRCh37andBSgenome.Hsapiens.1000genomes.hs37d5from Bioconductor as follows:These may take some time to install and are not included in the package as some users may only need one of GRCh37/GRCh38. If you are unsure of the genome build, MungeSumstats can also infer this information from your data.
Getting started
See the Getting started vignette website for up-to-date instructions on usage.
See the OpenGWAS vignette website for information on how to use MungeSumstats to access, standardise and perform quality control on GWAS Summary Statistics from the MRC IEU Open GWAS Project.
If you have any problems please do file an Issue here on GitHub.
Future Enhancements
The
MungeSumstatspackage aims to be able to handle the most common summary statistic file formats including VCF. If your file can not be formatted byMungeSumstatsfeel free to report the Issue on GitHub along with your summary statistics file header.We also encourage people to edit the code to resolve their particular issues too and are happy to incorporate these through pull requests on github. If your summary statistic file headers are not recognised by
MungeSumstatsbut correspond to one ofFeel free to update the
data("sumstatsColHeaders")following the approach in the data.R file and add your mapping. Then use a Pull Request on GitHub and we will incorporate this change into the package.Contributors
We would like to acknowledge all those who have contributed to
MungeSumstatsdevelopment: