vcfsim is a command-line tool for generating simulated VCF’s (variant call format files for encoding genetic data). It combines a coalescent simulation backend (msprime) with clean and efficient postprocessing to produce a wide variety of biologically realistic VCFs, with parameterized levels of missing data. Realistic VCF’s can now be easily simulated with just a few command line arguments!
–samples_file [SAMPLES_FILE] File containing one whitespace separated line of custom sample names
Optional
–chromosome [CHROMOSOME] Chromosome name/label
–replicates [REPLICATES] Number of replicate VCFs to produce (with varying seeds)
–sequence_length [SEQUENCE_LENGTH] Length of the chromosome to be simulated, in basepairs
–ploidy [PLOIDY] Ploidy for your VCF
–Ne [NE] Effective population size of the simulated population(s)
–mu [MU] Mutation rate in the simulated population(s)
–output_file [OUTPUT_FILE] Filename of outputed vcf, will automatically be followed by seed
–chromosome_file [CHROMOSOME_FILE] Specified file for multiple chromosome inputs (see below for details)
–population_mode [1|2] Number of populations simulate. 1 = single population (default), 2 = two populations with a shared history (C splits into A & B).
–div_time [TIME] Split divergence time for population mode 2 (e.g. generations before present). Required if –population_mode 2 is specified.
Instead of specifying –percent_missing_sites [PERCENT_MISSING_SITES], which produces uniform deterministic missingness across sites, we also provide a more advanced option that uses a Hidden Markov Model (HMM). By leaving –percent_missing_sites blank, users may instead provide the following parameters to allow site-level missingness to be spatially clustered (see bwlow for details). In the case that this option is used rather than the –percent_missing_sites parameter, all four HMM parameters must be specified:
–hmm_baseline [HMM_BASELINE] Baseline probability of site missingness when in a low-missing (good) state
–hmm_multiplier [HMM_MULTIPLIER] Multiplier applied to the baseline missingness probability when in a high-missing (bad) state
–hmm_p_low_to_high [HMM_P_LOW_TO_HIGH] Probability of switching from a low-missing state to a high-missing state
–hmm_p_high_to_low [HMM_P_HIGH_TO_LOW] Probability of switching from a high-missing state to a low-missing state
This will create a VCV with the name “myvcf1234.vcf”, i.e. “myvcf” followed by the seed given for the input. If input for replicates were requested higher number than 1, 2 for example, then vcfsim will create two output files by the name of myvcf1234 and myvcf1235, adding one to the seed after every run.
NOTE: An output file doesn’t needed to be specified. If no output file is specified, then the vcf will be redirected to STDOUT.
Screenshot of output file:
Using Hidden Markov Model Parameters
Instead of specifying --percent_missing_sites, which produces uniform deterministic missingness across sites, you can instead specify the following Hidden Markov Model (HMM) parameters to introduce more spatially clustered missing data. Note that when missingness is introduced in this manner, the exact proportion of missing sites is stochastic and will vary between runs, but this missingness can still be approximately determined based on the chosen parameter values.
All four HMM parameters must be specified together when using this option.
To do this, sites transition between low-missing (good) and high-missing (bad) states according to a two-state Markov process. Sites in bad regions have a higher probability of being missing.
Using custom sample names
Instead of --sample_size, you can provide explicit sample names:
Another way vcfsim can be used is by providing a file for multiple chromosome inputs.
Your input file should be in the form of a text file, and should be formatted as such:
The columns are in the order of: chromosone, ploidy, sequence length, population size, mutation rate. Each row will represent a seperate run of vcfsim, all these runs will be concatenated to the same file in the end.
The following command should be used when running vcfsim in this way:
vcfsim
vcfsim is a command-line tool for generating simulated VCF’s (variant call format files for encoding genetic data). It combines a coalescent simulation backend (msprime) with clean and efficient postprocessing to produce a wide variety of biologically realistic VCFs, with parameterized levels of missing data. Realistic VCF’s can now be easily simulated with just a few command line arguments!
Authors
Paimon Goulart (UC Riverside), Kieran Samuk (UC Riverside)
Installation
First create and activate a conda environment for vcfsim:
vcfsim is currently available on bioconda, and can be installed by using the following command:
For more detailed installation instructions, please visit:
https://bioconda.github.io/recipes/vcfsim/README.html?highlight=vcfsi#package-package%20'vcfsim'
Arguments
Here is the list of required/optional arguments to run vcfsim
Required
–seed [SEED] Random seed for vcfsim to use
–percent_missing_sites [PERCENT_MISSING_SITES] Percent of rows missing from your VCF
–percent_missing_genotypes [PERCENT_MISSING_GENOTYPES] Percent of samples missing from your VCF
One of the following three options must also be provided to set the samples:
Optional
–chromosome [CHROMOSOME] Chromosome name/label
–replicates [REPLICATES] Number of replicate VCFs to produce (with varying seeds)
–sequence_length [SEQUENCE_LENGTH] Length of the chromosome to be simulated, in basepairs
–ploidy [PLOIDY] Ploidy for your VCF
–Ne [NE] Effective population size of the simulated population(s)
–mu [MU] Mutation rate in the simulated population(s)
–output_file [OUTPUT_FILE] Filename of outputed vcf, will automatically be followed by seed
–chromosome_file [CHROMOSOME_FILE] Specified file for multiple chromosome inputs (see below for details)
–population_mode [1|2] Number of populations simulate. 1 = single population (default), 2 = two populations with a shared history (C splits into A & B).
–div_time [TIME] Split divergence time for population mode 2 (e.g. generations before present). Required if –population_mode 2 is specified.
Instead of specifying –percent_missing_sites [PERCENT_MISSING_SITES], which produces uniform deterministic missingness across sites, we also provide a more advanced option that uses a Hidden Markov Model (HMM). By leaving –percent_missing_sites blank, users may instead provide the following parameters to allow site-level missingness to be spatially clustered (see bwlow for details). In the case that this option is used rather than the –percent_missing_sites parameter, all four HMM parameters must be specified:
–hmm_baseline [HMM_BASELINE] Baseline probability of site missingness when in a low-missing (good) state
–hmm_multiplier [HMM_MULTIPLIER] Multiplier applied to the baseline missingness probability when in a high-missing (bad) state
–hmm_p_low_to_high [HMM_P_LOW_TO_HIGH] Probability of switching from a low-missing state to a high-missing state
–hmm_p_high_to_low [HMM_P_HIGH_TO_LOW] Probability of switching from a high-missing state to a low-missing state
Usage
Typical usage for vcfsim is as follows:
This will create a VCV with the name “myvcf1234.vcf”, i.e. “myvcf” followed by the seed given for the input.
If input for replicates were requested higher number than 1, 2 for example, then vcfsim will create two output files by the name of myvcf1234 and myvcf1235, adding one to the seed after every run.
NOTE: An output file doesn’t needed to be specified. If no output file is specified, then the vcf will be redirected to STDOUT.
Screenshot of output file:
Using Hidden Markov Model Parameters
Instead of specifying
--percent_missing_sites, which produces uniform deterministic missingness across sites, you can instead specify the following Hidden Markov Model (HMM) parameters to introduce more spatially clustered missing data. Note that when missingness is introduced in this manner, the exact proportion of missing sites is stochastic and will vary between runs, but this missingness can still be approximately determined based on the chosen parameter values.All four HMM parameters must be specified together when using this option.
To do this, sites transition between low-missing (good) and high-missing (bad) states according to a two-state Markov process. Sites in bad regions have a higher probability of being missing.
Using custom sample names
Instead of
--sample_size, you can provide explicit sample names:This will automatically set the sample size to 4 and label the VCF columns
A1 B1 C1 D1.You can also read the names from a file containing a single whitespace separated line:
Where
names.txtmight contain:Otherwise, sample identifiers will default to tsk_0,…,tsk_n
Simulating a structured population split (population_mode = 2)
To simulate a demographic split between populations A and B from an ancestral population C:
Multiple chromosome inputs
Another way vcfsim can be used is by providing a file for multiple chromosome inputs.
Your input file should be in the form of a text file, and should be formatted as such:

The columns are in the order of: chromosone, ploidy, sequence length, population size, mutation rate.
Each row will represent a seperate run of vcfsim, all these runs will be concatenated to the same file in the end.
The following command should be used when running vcfsim in this way:
You can also combine a param file with custom names:
When done this way, the output should look like such:

With the concatenated vcf looking like:
