Just sh make.sh to compile. The executable VCF2Dis can be found in the folder of bin/VCF2Dis For Linux /Unix and macOS
tar -zxvf VCF2DisXXX.tar.gz # if Link do not work ,Try re-install [zlib]library
cd VCF2DisXXX; # [zlib] and copy them to the library Dir
sh make.sh; # VCF2Dis-xx/src/include/zlib
./bin/VCF2Dis
Note: If fail to link,try to re-install the libraries zlib Note:: R with ape, dplyr and ggtree are recommended
Option 2: Docker container
You can use Docker to install and run VCF2Dis. Follow the steps below:
Install Docker: Ensure Docker is installed on your system. If not, you can install it by following the Docker Official Documentation.
Pull the Docker Image: Use the following command to pull the VCF2Dis Docker image from the Alibaba Cloud Container Registry:
docker pull registry.cn-shenzhen.aliyuncs.com/knight134/vcf2dis:v1.53e ## Docker image from the Alibaba Cloud Container Registry
docker run -it --rm vcf2dis:v1.53e VCF2Dis ## After pulling the image, you can run the containe
Option 3: Singularity container
Install Singularity: Ensure Singularity is installed on your system. If not, you can install it by following the Singularity Official Documentation.
Build the SIF File: Use the following command to build a Singularity image file (SIF) from the Docker image:
singularity build vcf2dis_1.53e.sif docker://registry.cn-shenzhen.aliyuncs.com/knight134/vcf2dis:v1.53e # you can download follows
singularity exec vcf2dis_1.53e.sif VCF2Dis
Download the SIF File:Alternatively, you can download the built SIF file directly from the vcf2dis_1.53e.sif. Once downloaded, you can run it using Singularity.
Main parameter description:
```sh
Usage: VCF2Dis -InPut <in.vcf> -OutPut <p_dis.mat>
-InPut Input one or muti GATK VCF genotype File
-OutPut OutPut Sample p-Distance matrix
-InList Input GATK muti-chr VCF Path List
-SubPop SubGroup SampleList of VCF File [ALLsample]
-Rand Probability (0-1] for each site to join Calculation [1]
-help Show more help [hewm2008 v1.55s]
For more details, please use <b>-help </b> and see the [example](https://github.com/hewm2008/VCF2Dis/blob/main/example)
```sh
-InFormat <str> Input File is [VCF/FA/PHY] Format,defaut: [VCF]
-InSampleGroup <str> InFile of sample Group info,format(sample groupA)
-TreeMethod <int> Construct Tree Method,1:NJ-tree 2:UPGMA-tree [1]
-KeepMF Keep the Middle File diff & Use matrix
2) Example
Three examples were provided in the directory of example/Example*
1) an Example of nj-tree with no boostrap
To Create the p_distance matrix and construct nj-tree newick tree
# 1.1) To new all the sample p_distance matrix and newick tree based VCF, run VCF2Dis directly
./bin/VCF2Dis -InPut in.vcf.gz -OutPut p_dis.mat
# ./bin/VCF2Dis -InPut in.fa.gz -OutPut p_dis.mat -InFormat FA
# 1.2) To new sub group sample p_distance matrix and and newick tree ; put their sample name into File sample.list
./bin/VCF2Dis -InPut chr1.vcf.gz chr2.vcf.gz -OutPut p_dis.mat -SubPop sample.list
# 1.3) To new group tree p_distance matrix and and newick tree ; pop.info file Format[sample group] (v1.55)
./bin/VCF2Dis -InPut in.vcf.gz -OutPut p_dis.mat -InSampleGroup pop.info
Simple tree visualization (for advanced tree display and annotation please refer to iTOL, Evolview, MEGA)
you will obtain the p_dis.nwk tree file and neighbor-joining tree in pdf format p_dis.pdf after VCF2Dis.
Running multiple times by using a method of sampling with replacement.
Users can randomly select a part of the sites [-Rand] and construct a new nj-tree as above, and Repeat NN times [recommand NN=100]. X=(1,2….NN);
#!/bin/bash
NN=100
if [ "$#" -eq 1 ]; then
NN=$1
fi
for X in $(seq 1 $NN)
do
./bin/VCF2Dis -InPut in.vcf.gz -OutPut p_dis_${X}.mat -Rand 0.25
# PHYLIPNEW-3.69.650/bin/fneighbor -datafile p_dis_${X}.mat -outfile tree.out1_${X}.txt -matrixtype s -treetype n -outtreefile tree.out2_${X}.tre
done
Merge all the nj-tree and construct and display a boostrap nj-tree. (For advanced display tree and annotation please refer to iTOL, Evolview and MEGA)
#!/bin/bash
NN=100
if [ "$#" -eq 1 ]; then
NN=$1
fi
cat p_*.nwk > alltree_merge.tre # cat tree*.tre > alltree_merge.tre
PHYLIPNEW-3.69.650/bin/fconsense -intreefile alltree_merge.tre -outfile out -treeprint Y
perl ./bin/percentageboostrapTree.pl alltree_merge.treefile $NN Final_boostrap.tre # NN is the input number
How to Install PHYLIPNEW please Click on here or Click on here(Chinese)
4) Introduction
The formula for calculating p-distance between indivisuals from VCF SNP datasets was listed below:
D_ij=(1/L) * [(sum(d(l)_ij))]
Where L is the length of regions where SNPs can be identified, and given the alleles at position l are A/C between sample i and sample j:
d(l)_ij=0.0 if the genotypes of the two individuals were AA and AA;
d(l)_ij=0.5 if the genotypes of the two individuals were AA and AC;
d(l)_ij=0.0 if the genotypes of the two individuals were AC and AC;
d(l)_ij=1.0 if the genotypes of the two individuals were AA and CC;
d(l)_ij=0.0 if the genotypes of the two individuals were CC and CC;
To further know about the p_distance matrix based the VCF file, please refer to this website.
5) Results
VCF2Dis have been cited in more than 170 times by searching against google scholar.
Below were some NJ-tree images that I draw in the paper before.
VCF2Dis
VCF2Dis: an ultra-fast and efficient tool to calculate pairwise genetic distance and construct population phylogeny from VCF files
The VCF2Dis article has been published in GigaScience, please cited this article if possible
PMID: 40184433 PMCID:PMC11970368 DOI:10.1093/gigascience/giaf032
1) Install and Parameter
The new version will be updated and maintained in hewm2008/VCF2Dis, please click below Link to download the latest version
hewm2008/VCF2Dis
DownloadOption 1 : Local compilation
Just
sh make.shto compile. The executableVCF2Discan be found in the folder ofbin/VCF2DisFor Linux /Unix and macOS
tar -zxvf VCF2DisXXX.tar.gz # if Link do not work ,Try re-install [zlib]library cd VCF2DisXXX; # [zlib] and copy them to the library Dir sh make.sh; # VCF2Dis-xx/src/include/zlib ./bin/VCF2DisNote: If fail to link,try to re-install the libraries zlib
Note:: R with ape, dplyr and ggtree are recommended
Option 2: Docker container
You can use Docker to install and run VCF2Dis. Follow the steps below:
Option 3: Singularity container
Main parameter description: ```sh Usage: VCF2Dis -InPut <in.vcf> -OutPut <p_dis.mat>
-InPut Input one or muti GATK VCF genotype File -OutPut OutPut Sample p-Distance matrix
-InList Input GATK muti-chr VCF Path List -SubPop SubGroup SampleList of VCF File [ALLsample] -Rand Probability (0-1] for each site to join Calculation [1]
-help Show more help [hewm2008 v1.55s]
2) Example
Three examples were provided in the directory of example/Example*
1) an Example of nj-tree with no boostrap
iTOL,Evolview,MEGA)you will obtain the
p_dis.nwktree file and neighbor-joining tree in pdf formatp_dis.pdfafter VCF2Dis.Note::if you can’t get the p_dis.nwk tree file but had the p_dis.mat, here are the 3 methods to get the tree file.
2) an Example of nj-tree with boostrap
iTOL,EvolviewandMEGA)How to Install PHYLIPNEW please Click on here or Click on here(Chinese)
4) Introduction
The formula for calculating p-distance between indivisuals from VCF SNP datasets was listed below:
Where L is the length of regions where SNPs can be identified, and given the alleles at position
lare A/C between sampleiand samplej:To further know about the p_distance matrix based the VCF file, please refer to this website.
5) Results
VCF2Dis have been cited in more than 170 times by searching against google scholar.
Below were some NJ-tree images that I draw in the paper before.
Display tree by MAGA after test Data VCF2Dis -i ALL.chr*.genotypes.vcf.gz -SubPop subsample203.list -InSampleGroup pop.info
6) Discussing
################swimming in the sky and flying in the sea ###########################