目录

EMVC-2

An efficient SNV variant caller based on the expectation maximization algorithm. EMVC-2 is implemented in C and uses a python wrapper.

Supported plataforms: Linux, MacOS

Authors: Guillermo Dufort y Álvarez, Martí Xargay, Idoia Ochoa, and Alba Pages-Zamora

Contact: gdufort@fing.edu.uy

Install with Conda

To install directly from source, follow the instructions in the next section.

EMVC-2 is available on conda via the bioconda channel. See this page for installation instructions for conda. Once conda is installed, we recommend creating an environment with python=3.8.1:

conda create --name emvcEnv python=3.8.1
conda activate emvcEnv

Then run the following command to install emvc-2.

conda install -c bioconda emvc-2

Note that if emvc-2 is installed this way, it should be invoked with the command emvc-2 rather than ./emvc-2. The bioconda help page shows the commands if you wish to install emvc-2 in an environment.

Install from source code

Download repository

git clone https://github.com/guilledufort/EMVC-2.git

Requirements

Software requirements

  1. python ( == 3.8.1 )
  2. samtools ( == 1.9 )

Compiler requirement

  1. gcc ( Linux: >= 4.8.1, Mac: Apple clang version >= 14.0.0 )

Python libraries requirement

  1. cython ( >=0.29.17 ),
  2. numpy ( >=1.16.6,<=1.20.3 ),
  3. argparse ( >=1.1 ),
  4. scipy ( >=1.1.0,<1.5.4 ),
  5. tqdm ( >=4.46.0 ),
  6. scikit-learn ( >=0.22.2,<=0.24.2 ),

Compiling the candidate_variants_finder and installing python dependencies

The following instructions will create the candidate_variants_finder executable in the root directory, which is needed to run EMVC-2, and install the required python dependencies. To compile candidate_variants_finder you need to have the gcc compiler.

On Linux (Ubuntu or CentOS) gcc usually comes installed by default, but if not run the following:

sudo apt update
sudo apt-get install gcc

On macOS, install GCC compiler:

  • Install HomeBrew (https://brew.sh/)
  • Install GCC (this step will be faster if Xcode command line tools are already installed using xcode-select --install):
    brew update
    brew install gcc

To check if the gcc compiler is properly installed in your system run:

On Linux

gcc --version

The output should be the description of the installed software.

To compile candidate_variants_finder and install the requiered python dependencies run:

cd EMVC-2/
python setup.py install

Install samtools

To install samtools, you can use conda:

conda install -c bioconda samtools==1.9

or follow the instructions in the github repository.

Usage


emvc-2 [-h] -i BAM_FILE -r REF_FILE [-p THREADS] [-t ITERATIONS] [-m LEARNERS] [-v VERBOSE] -o OUT_FILE

optional arguments:
  -h, --help            show this help message and exit
  -i BAM_FILE, --bam_file BAM_FILE
                        The bam file
  -r REF_FILE, --ref_file REF_FILE
                        The reference fasta file
  -p THREADS, --threads THREADS
                        The number of parallel threads (default 8)
  -t ITERATIONS, --iterations ITERATIONS
                        The number of EM iterations (default 5)
  -m LEARNERS, --learners LEARNERS
                        The number of learners (default 7)
  -v VERBOSE, --verbose VERBOSE
                        Make output verbose (default 0)
  -o OUT_FILE, --out_file OUT_FILE
                        The output file name

Usage example

We add an example folder with a test file to run a simple example of the tool. The hs37d5 reference file must be downloaded following the instructions detailed in the following section for the example to work.

To run the variant caller with 8 threads on the example file example.bam:

cd EMVC-2
./emvc-2 -i example/example.bam -r reference/hs37d5/hs37d5.fa.gz -p 8 -o example/example.vcf

Original paper datasets information

To test the performance of the EMVC-2 SNV variant caller we ran experiments on the following datasets.

Dataset Reference Size (GB) Coverage Sequencing Method Download link
ERR262997 HG001 104 30 Illumina HiSeq 2000 link
NovaSeq HG001 49 25 Illumina NovaSeq 6000 is not available for download
Ashkenazim son HG002 48 25 Illumina HiSeq 2500 link
pangenomics2 HG002 61 30 Illumina HiSeq 2500 link
pangenomics3 HG003 66 30 Illumina HiSeq 2500 link
pangenomics4 HG004 61 30 Illumina HiSeq 2500 link
Chinese Son HG005 34 15 Illumina HiSeq 2500 link

Downloading the datasets and the reference genome

To download a dataset you have to run the download_files.sh with the specific dataset name as a parameter. For example, to download ERR262997 run:

cd EMVC-2/datasets
./download_files.sh ERR262997

To download the human reference genome version hs37d5 run:

cd EMVC-2/reference
./download_files.sh hs37d5

The scripts use the command curl to perform the download. To install curl on macOS run:

brew install curl

To install curl on Ubuntu or CentOS run:

sudo apt-get install curl

Alignment information

To obtain alignment information in BAM format for each pair of FASTQ files we recommend using the tool BWA.

To install bwa with conda run:

conda install bwa

To align a pair of FASTQ files against a reference genome using BWA run:

bwa mem -t  [-@THREADS] [REF] [FASTQ_R1] [FASTQ_R2] \
   | samtools sort [-@THREADS] -o [BAM_FILE] 
关于

用于高通量测序数据的变异检测与过滤流程工具。

7.2 MB
邀请码
    Gitlink(确实开源)
  • 加入我们
  • 官网邮箱:gitlink@ccf.org.cn
  • QQ群
  • QQ群
  • 公众号
  • 公众号

版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9 京公网安备 11010802032778号