vcf2pandas is a python package to convert vcf files to pandas dataframes.
Install
pip install vcf2pandas
Dependencies
pandas (2.1.0)
pysam (0.22.1)
Usage
Selecting all columns (default behaviour)
from vcf2pandas import vcf2pandas
import pandas
df = vcf2pandas("path_to_vcf.vcf")
Remove all empty columns
Sometimes where will be INFO or FORMAT fields from the header where none of the variants or samples have that field. You can choose to remove all of these from the pandas dataframe.
[!NOTE]
You do not need to have everything a list or everything a dictionary, you can mix and match defaults, lists and dictionaries for info_fields, sample_list and format_fields.
Custom column ordering
vcf2pandas can select custom/specific:
INFO fields
samples
FORMAT fields
And order the selected columns based on the input list.
E.g. The following list:
info_fields = ["DP", "MQM", "QA"]
Gets the columns (in that order)
INFO:DP INFO:MQM INFO:QA
Output
INFO and FORMAT headings
INFO:INFO_FIELD e.g. INFO:DP
FORMAT:SAMPLE_NAME:FORMAT_FIELD e.g. FORMAT:HG002:GT
The info field, format field and sample names can also be mapped to custom values by using a dictionary. See Renaming custom columns and samples.
INFO or FORMAT fields not present for some variants
When certain INFO or FORMAT fields are not present for certain variants, vcf2pandas inserts a . instead in that cell. E.g. for vcf3_all.txt you can see INFO:GENE column has . for the first 7 variants.
Examples
Example vcf and output files (dataframes as a .txt file) are available in examples/
vcf2pandas
vcf2pandasis a python package to convert vcf files topandasdataframes.Install
Dependencies
Usage
Selecting all columns (default behaviour)
Remove all empty columns
Sometimes where will be
INFOorFORMATfields from the header where none of the variants or samples have that field. You can choose to remove all of these from the pandas dataframe.Selecting custom columns and samples
Renaming custom columns and samples
From
v0.2.0, renaming column and sample names is supported. Simply input a dictionary instead of a list with your name mapping. See example below.Custom column ordering
vcf2pandascan select custom/specific:And order the selected columns based on the input list.
E.g. The following list:
Gets the columns (in that order)
Output
INFO and FORMAT headings
The info field, format field and sample names can also be mapped to custom values by using a dictionary. See Renaming custom columns and samples.
INFO or FORMAT fields not present for some variants
When certain INFO or FORMAT fields are not present for certain variants,
vcf2pandasinserts a.instead in that cell. E.g. forvcf3_all.txtyou can seeINFO:GENEcolumn has.for the first 7 variants.Examples
Example vcf and output files (dataframes as a .txt file) are available in
examples/Example Usage
To print to a text file:
For more examples, see
tests/run_examples.py.To recreate the examples in the
examples/folder, run:Changelog
v0.1.0
v0.1.1
v0.1.2
0.22.1.v0.2.0
.if not all samples/variants had all the info/format values.Issues
Please open an issue if you encounter any problems! Thanks!