Python package to run a Sequence Quality Control pipeline, based on workflows
defined by IGMM.
Requirements:
FastQC
Cutadapt
Python 3.x
Pyyaml
We recommend using the conda package manager, and making use of virtual
environments. This tool also exists in the bioconda channel. This has the
benefit of automatically installing all pre-requisites when installing this
tool.
Installation
There are two main ways to install the package.
Conda package installation
Set up a new conda environment (optional):
$ conda create -n my_env -c bioconda python=3
This creates a clean Python3 environment in which to install and run the tool.
If you have a conda environment you already wish to use, make sure you add the
bioconda channel to the environment, or your conda package as a whole.
Install BioExcel_SeqQC
$ conda install bioexcel_seqqc
This one line will install BioExcel_SeqQC and all of it’s dependencies.
Manual installation
If you wish to install manually, follow the steps below. We still recommend
using some kind of virtual environment. Before running the workflow, install
the pre-requisite tools and ensure they are contained in your $PATH
Once installed, there are several ways to use the tool. The easiest is to call
the executable script, which runs the whole workflow based on several options
and arguments the user can modify. Find these using
The tool runs an automated set of checks based on output from FastQC. The
default decision making is based on our partner preference, but these can be
changed. First, output an example configuration file (which contains the
default values):
$ bxcl_seqqc --printconfig
The file lists the summary outputs from FastQC, and what decisions to make
depending on whether the files should be trimmed, rechecked, and take into
account whether they have been trimmed automatically.
Python Module
In addition to the executable version, the tool is installed as a Python
package, so each stage can be imported as a module into other scripts, if the
user wishes to perform more unique/complicated/expanded workflows. Each function
creates and returns a python subprocess.
import bioexcel_seqqc
import bioexcel_seqqc.runfastqc as rfq
import bioexcel_seqqc.runtrim as rt
# Do things before running FastQC
fqc_process = rfq.run_fqc(infiles, fqcdir, tmpdir, threads)
fqc.wait()
# Do things after FastQC, and before trimming low quality reads
trim_process = rt.trimQC(infiles, trimdir, threads):
trim_process.wait()
Stages
Our pipeline consists of three main stages: runfastqc, checkfastqc and runtrim.
Each stage exists as a python module as shown above. Each module contains
specific functions that execute the tools listed. The diagram below shows
each of these stages, with colour coding to show which tools are used in each
module, as well as useful output files. For this work, the module checkfastqc
was developed specifically to remove the human intervention required to check
output from fastqc before continuing with trimming/further analysis.
Each module can also be executed independently of the main executable workflow.
For example, if a situation occurs that causes cutadapt to fail, the runtrim
stage can be executed from the command line as
BioExcel_SeqQC
Python package to run a Sequence Quality Control pipeline, based on workflows defined by IGMM.
Requirements:
We recommend using the conda package manager, and making use of virtual environments. This tool also exists in the bioconda channel. This has the benefit of automatically installing all pre-requisites when installing this tool.
Installation
There are two main ways to install the package.
Conda package installation
Set up a new conda environment (optional):
This creates a clean Python3 environment in which to install and run the tool. If you have a conda environment you already wish to use, make sure you add the bioconda channel to the environment, or your conda package as a whole.
Install BioExcel_SeqQC
This one line will install BioExcel_SeqQC and all of it’s dependencies.
Manual installation
If you wish to install manually, follow the steps below. We still recommend using some kind of virtual environment. Before running the workflow, install the pre-requisite tools and ensure they are contained in your $PATH
Usage
Once installed, there are several ways to use the tool. The easiest is to call the executable script, which runs the whole workflow based on several options and arguments the user can modify. Find these using
An example of basic usage of the pipeline is:
Editing configuration for checkFastQC stage
The tool runs an automated set of checks based on output from FastQC. The default decision making is based on our partner preference, but these can be changed. First, output an example configuration file (which contains the default values):
The file lists the summary outputs from FastQC, and what decisions to make depending on whether the files should be trimmed, rechecked, and take into account whether they have been trimmed automatically.
Python Module
In addition to the executable version, the tool is installed as a Python package, so each stage can be imported as a module into other scripts, if the user wishes to perform more unique/complicated/expanded workflows. Each function creates and returns a python subprocess.
Stages
Our pipeline consists of three main stages: runfastqc, checkfastqc and runtrim. Each stage exists as a python module as shown above. Each module contains specific functions that execute the tools listed. The diagram below shows each of these stages, with colour coding to show which tools are used in each module, as well as useful output files. For this work, the module checkfastqc was developed specifically to remove the human intervention required to check output from fastqc before continuing with trimming/further analysis.
Each module can also be executed independently of the main executable workflow. For example, if a situation occurs that causes cutadapt to fail, the runtrim stage can be executed from the command line as