DFAST_QC conducts taxonomy and completeness check of the assembled genome.
Taxonomy check DFAST_QC evaluates taxonomic identity of the genome by querying against more than 28,000 reference genomes from type strains. To shorten the runtime , it first run MASH on the query against reference nucleotide databases to narrow down the number of genomes used in the downstream process based on the number of shared hashes. Then, pass it on to Skani against the selected reference genomes to calculate the ANI value. DFAST_QC uses MASH for the former process and Skani for the latter process.
Completeness check DFAST_QC employs CheckM to calculate completeness and contamination values of the query genome. DFAST_QC automatically determines the reference marker set for CheckM based on the result of taxonomy check. Users can also specify the marker set to be used. The genome size is also checked to ensure it falls within the expected range.
GTDB search
As of ver. 0.5.0, DFAST_QC can calculate ANI against GTDB representative genomes, thereby enabling species-level identification in the GTDB Taxonomy. Thie employs the same 2-step search as Taxonomy check
ShigaPass
When the taxonomy check identifies the query genome as Escherichia coli/Shigella (with an “indistinguishable” status), DFAST_QC automatically runs ShigaPass to predict the Shigella serotype. ShigaPass can be disabled with --disable_shigapass.
System requirements and software dependencies
DFAST_QC runs on Linux with Python version 3.11 or earlier and requires approximately 2 GB of memory. While newer Python versions may be installed, some dependencies (e.g., CheckM) can produce errors, so Python 3.11 or lower is recommended. Alternatively, users are encouraged to run DFAST_QC within the provided environment to ensure compatibility.
The following third party softwares/packages are required.
Skani
Mash
CheckM
HMMer (required for CheckM)
Prodigal (required for CheckM)
BLAST+ (required for ShigaPass)
Python packages: peewee, more-itertools, ete3
[For macOS] DFAST_QC is not officially supported on macOS and has not been thoroughly tested. On Macs with ARM CPUs (Apple Silicon), some dependencies are not supported. We recommend creating a conda environment with the platform explicitly specified:
conda create --platform osx-64 -n myenv
ShigaPass does not work due to differences in how the paste command works between macOS and Linux. Use --disable_shigapass when running DFAST_QC on macOS.
Install dependencies We recommend using conda to install dependencies.
cd dfast_qc
conda env create -f environment.yml
This will create a conda environment named “dfast_qc” and install the above-mentioned dependencies in it.
Alternatively, after installing required softwares by yourself, you can install Python packages with the pip command.
pip install -r requirements.txt
Reference data is not included in the conda package. Please install it following the steps below.
Install ShigaPass
ShigaPass is required for Shigella serotype prediction. Clone the ShigaPass repository and copy the script and databases into the DFAST_QC source tree:
BLAST+ must also be installed and available on your $PATH (e.g. sudo apt-get install ncbi-blast+ or conda install -c bioconda blast).
The ShigaPass databases will be automatically initialized (via makeblastdb) on the first run.
If you do not need ShigaPass, you can skip this step and use --disable_shigapass when running DFAST_QC.
Quick set up (recommended)
Since the full data set of DFAST_QC’s reference data (DQC_REFERENCE_FULL) is huge (>100GB, including GTDB representative genomes), we have made the pre-built reference data (DQC_REFERENCE_COMPACT, <1.5GB) available for download using the dqc_ref_manager.py script.
This script attempts to retrieve data from the DFAST web service hosted on the NIG Supercomputer. If the web service is unavailable, downloading will not be successful. Please refer to https://www.ddbj.nig.ac.jp/.
dqc_ref_manager.py download
As DQC_REFERENCE_COMPACT does not contain reference genomes for ANI calculation, dfast_qc will attempt to download the required genomes in an on-the-fly manner during the run (internet connection is required). Therefore, it takes extra time for downloding them (~1min). We will update DQC_REFERENCE_COMPACT periodically, please update it by running dqc_ref_manager.py again.
The dqc_ref_manager.py script downloads the reference data from our web service (https://dfast.ddbj.nig.ac.jp). If file downloads fail due to server maintenance or other issues, please manually obtain the reference data from this site.
If you want to prepre DQC_REFERENCE_FULL, please follow the procedure below.
usage: dfast_qc [-h] [--version] [-i PATH] [-o PATH] [-hits INT] [-a INT]
[-t INT] [-r PATH] [-n INT] [--enable_gtdb] [--disable_tc]
[--disable_cc] [--disable_shigapass] [--disable_auto_download]
[--force] [--debug] [-p STR] [--show_taxon]
DFAST_QC: Taxonomy and completeness check
optional arguments:
options:
-h, --help show this help message and exit
--version Show program version
-i PATH, --input_fasta PATH
Input FASTA file (raw or gzipped) [required]
-o PATH, --out_dir PATH
Output directory (default: OUT)
-hits INT, --num_hits INT
Number of top hits by MASH (default: 10)
-a INT, --ani INT ANI threshold (default: 95%)
-t INT, --taxid INT NCBI taxid for completeness check. Use '--show_taxon' for available taxids. (Default: Automatically inferred from taxonomy check)
-r PATH, --ref_dir PATH
DQC reference directory (default: DQC_REFERENCE_DIR)
-n INT, --num_threads INT
Number of threads for parallel processing (default: 1)
--enable_gtdb Enable GTDB search
--disable_tc Disable taxonomy check using ANI
--disable_cc Disable completeness check using CheckM
--disable_shigapass Disable ShigaPass analysis even when Shigella/E. coli is detected
--disable_auto_download
Disable auto-download for missing reference genomes
--force Force overwriting result
--debug Debug mode
-p STR, --prefix STR Prefix for output (for debugging use, default: None)
--show_taxon Show available taxa for competeness check
Example
Test data can be found in example. To test the software, run this after preparing the reference data.
A wrapper script is available for batch execution for multiple genomes in a given directory. Please make sure dfast_qc executable is placed in your $PATH.
dqc_multi -t 3 examples/
This will invoke 3 DFAST_QC processes in parallel against FASTA files in example directory and generate a report file dqc_report.tsv. By default, FASTA files with extensions fa(.gz),fna(.gz),fasta(.gz) will be processed. See help, dqc_multi -h for more details.
Help
usage: dqc_multi [-h] [--fasta FASTA] [--out_dir OUT_DIR] [--output OUTPUT] [--taxid TAXID] [--disable_tc] [--disable_cc] [--enable_gtdb] [--thread THREAD] input_dir
Run DFAST_QC in parallel for batch execution of multiple genomes
positional arguments:
input_dir The directory containing the FASTA files
options:
-h, --help show this help message and exit
--fasta FASTA Acceptable file extension for the fasta files. Default: fa,fasta,fna,fa.gz,fasta.gz,fna.gz
--out_dir OUT_DIR, -O OUT_DIR
Name of output directory. Intermediate files will be saved here.
--output OUTPUT, -o OUTPUT
Output file name
--taxid TAXID taxid for taxnomy check (-1: auto, 0:prokaryote)
--disable_tc Disable taxonomy check using ANI
--disable_cc Disable completeness check using CheckM
--enable_gtdb Enable GTDB search
--thread THREAD, -t THREAD
Number of threads to use
List of status in taxonomy check result
conclusive: Effective ANI hit (>=95%) againt only 1 species, hence the species name is conclusively determined.
indistinguishable: The genome belongs to one of the species that are difficult to distinguish using ANI (e.g. E. coli and Shigella spp.)
inconclusive: ANI hits against more than 2 differenct species. This may result from the comparison between very closely-related species or contamination of 2 different species.
below_threhold: The ANI hit is below the threshold (95%)
Note that DFAST_QC cannot identify clades below species level.
Run in Docker
Docker image is available at dockerub. The example below shows how to invoke DFAST_QC with an input FASTA file (genome.fa) in the current directory.
Reference data of DFAST_QC is stored in a directory called DQC_REFERENCE. By default, it is located in the directory where DFAST_QC is installed (PATH/TO/dfast_qc/dqc_reference), or in /dqc_reference when the docker version is used. In general, you do not need to change this, but you can specify it in the config file or by using -r option.
To prepare reference data, run the following command.
sh dqc_initial_setup.sh [-n int]
-n denotes the number of threads for parallel processing (default: 1). As data preparation may take time, it is recommended specifying the value 4~8 (or more) for -n.
Once reference data has been prepared, it can be updated by running command
dqc_admin_tools.py update_all
To generate a list of the reference genomes (reference_genomes.tsv), run the following command
dqc_admin_tools.py dump_sqlite_db
Instead of running dqc_initial_setup.sh, you can prepare reference data by manually executing the following commands. Run dqc_admin_tools.py -h or dqc_admin_tools.py subcommand -h to show help.
Download master files
dqc_admin_tools.py download_master_files --targets asm ani tsr igp sst egs
This will download “Assembly report”, “ANI report”, “Type strain report”, and “indistinguishable_groups_prokaryotes.txt” from the NCBI FTP server and HMMer profile for TIGR.
Download/Update NCBI taxdump data
dqc_admin_tools.py update_taxdump
Download reference genomes
dqc_admin_tools.py download_genomes
This will download reference genomic FASTA files from the NCBI Assembly database. As it attempts to download large number of genomes, it is recommended to enable parallel downloading option (e.g. --num_threads 4)
Sketch reference genomes using MASH
dqc_admin_tools.py mash_ref_sketch
Prepare SQLite database file
dqc_admin_tools.py prepare_sqlite_db
This will generate a reference file DQC_REFERENCE/references.db, which contains metadata for reference genomes.
Prepare CheckM data
dqc_admin_tools.py prepare_checkm
CheckM reference data will be downloaded and configured.
Update database for CheckM
dqc_admin_tools.py update_checkm_db
Will insert auxiliary data for CheckM into DQC_REFERENCE/references.db
Prepare reference data for genome size check
dqc_admin_tools.py prepare_genome_size_data
Install Shigapass and its reference data
dqc_admin_tools.py setup_shigapass
Add timestamp to the reference data
dqc_admin_tools.py add_ref_info
Mash sketching (step 4) may fail when running with multiple threads. To avoid error, please specify --num-threads 1.
Preparation for the GTDB reference data.
Download the representative genomes from GTDB and unarchive it.
curl -LO https://data.gtdb.ecogenomic.org/releases/latest/genomic_files_reps/gtdb_genomes_reps.tar.gz
tar xfz gtdb_genomes_reps.tar.gz
If the downloading is slow from the above link, try downloading it from the mirror site,
curl -LO https://data.ace.uq.edu.au/public/gtdb/data/releases/release226/226.0/genomic_files_reps/gtdb_genomes_reps_r226.tar.gz
tar xfz gtdb_genomes_reps_r226.tar.gz
Create a link under DQC_REFERENCE
ln -s gtdb_genomes_reps_r226 gtdb_genomes_reps
Alternatively, place the unarchived folder under DQC_REFERENCE, and modify the value GTDB_GENOME_DIR specified in config.py.
DFAST_QC: DFAST Quality Control
DFAST_QC conducts taxonomy and completeness check of the assembled genome.
Taxonomy check
DFAST_QC evaluates taxonomic identity of the genome by querying against more than 28,000 reference genomes from type strains. To shorten the runtime , it first run MASH on the query against reference nucleotide databases to narrow down the number of genomes used in the downstream process based on the number of shared hashes. Then, pass it on to Skani against the selected reference genomes to calculate the ANI value.
DFAST_QC uses MASH for the former process and Skani for the latter process.
Completeness check
DFAST_QC employs CheckM to calculate completeness and contamination values of the query genome. DFAST_QC automatically determines the reference marker set for CheckM based on the result of taxonomy check. Users can also specify the marker set to be used.
The genome size is also checked to ensure it falls within the expected range.
GTDB search As of ver. 0.5.0, DFAST_QC can calculate ANI against GTDB representative genomes, thereby enabling species-level identification in the GTDB Taxonomy. Thie employs the same 2-step search as Taxonomy check
ShigaPass When the taxonomy check identifies the query genome as Escherichia coli/Shigella (with an “indistinguishable” status), DFAST_QC automatically runs ShigaPass to predict the Shigella serotype. ShigaPass can be disabled with
--disable_shigapass.System requirements and software dependencies
DFAST_QC runs on Linux with Python version 3.11 or earlier and requires approximately 2 GB of memory. While newer Python versions may be installed, some dependencies (e.g., CheckM) can produce errors, so Python 3.11 or lower is recommended. Alternatively, users are encouraged to run DFAST_QC within the provided environment to ensure compatibility. The following third party softwares/packages are required.
[For macOS]
DFAST_QC is not officially supported on macOS and has not been thoroughly tested. On Macs with ARM CPUs (Apple Silicon), some dependencies are not supported. We recommend creating a conda environment with the platform explicitly specified:
ShigaPass does not work due to differences in how the
pastecommand works between macOS and Linux. Use--disable_shigapasswhen running DFAST_QC on macOS.Installation from Bioconda
DFAST_QC is also available from BioConda.
If this did not work, please try Installation from source code.
Installation from source code
Source code
Install dependencies
We recommend using conda to install dependencies.
This will create a conda environment named “dfast_qc” and install the above-mentioned dependencies in it.
Alternatively, after installing required softwares by yourself, you can install Python packages with the
pipcommand.Reference data is not included in the conda package. Please install it following the steps below.
Install ShigaPass
ShigaPass is required for Shigella serotype prediction. Clone the ShigaPass repository and copy the script and databases into the DFAST_QC source tree:
BLAST+ must also be installed and available on your
$PATH(e.g.sudo apt-get install ncbi-blast+orconda install -c bioconda blast). The ShigaPass databases will be automatically initialized (viamakeblastdb) on the first run.If you do not need ShigaPass, you can skip this step and use
--disable_shigapasswhen running DFAST_QC.Quick set up (recommended)
Since the full data set of DFAST_QC’s reference data (
DQC_REFERENCE_FULL) is huge (>100GB, including GTDB representative genomes), we have made the pre-built reference data (DQC_REFERENCE_COMPACT, <1.5GB) available for download using thedqc_ref_manager.pyscript. This script attempts to retrieve data from the DFAST web service hosted on the NIG Supercomputer. If the web service is unavailable, downloading will not be successful. Please refer to https://www.ddbj.nig.ac.jp/.As
DQC_REFERENCE_COMPACTdoes not contain reference genomes for ANI calculation,dfast_qcwill attempt to download the required genomes in an on-the-fly manner during the run (internet connection is required). Therefore, it takes extra time for downloding them (~1min).We will update
DQC_REFERENCE_COMPACTperiodically, please update it by runningdqc_ref_manager.pyagain.The
dqc_ref_manager.pyscript downloads the reference data from our web service (https://dfast.ddbj.nig.ac.jp). If file downloads fail due to server maintenance or other issues, please manually obtain the reference data from this site.If you want to prepre
DQC_REFERENCE_FULL, please follow the procedure below.Usage
--num_threadsvalue larger than 1.Example
Test data can be found in
example. To test the software, run this after preparing the reference data.Example of Result
tc_result.tsv: Taxonomy check resultcc_result.tsv: Completeness check resultdqc_result.json: DFAST_QC result in a json format as show below:Batch execution for multiple genomes
A wrapper script is available for batch execution for multiple genomes in a given directory. Please make sure
dfast_qcexecutable is placed in your$PATH.This will invoke 3 DFAST_QC processes in parallel against FASTA files in
exampledirectory and generate a report filedqc_report.tsv.By default, FASTA files with extensions fa(.gz),fna(.gz),fasta(.gz) will be processed. See help,
dqc_multi -hfor more details.Help
List of status in taxonomy check result
Note that DFAST_QC cannot identify clades below species level.
Run in Docker
Docker image is available at dockerub.
The example below shows how to invoke DFAST_QC with an input FASTA file (genome.fa) in the current directory.
For power users
Prepare reference data
Reference data of DFAST_QC is stored in a directory called
DQC_REFERENCE. By default, it is located in the directory where DFAST_QC is installed (PATH/TO/dfast_qc/dqc_reference), or in/dqc_referencewhen the docker version is used.In general, you do not need to change this, but you can specify it in the config file or by using
-roption.To prepare reference data, run the following command.
-ndenotes the number of threads for parallel processing (default: 1). As data preparation may take time, it is recommended specifying the value 4~8 (or more) for-n.Once reference data has been prepared, it can be updated by running command
To generate a list of the reference genomes (
reference_genomes.tsv), run the following commandInstead of running
dqc_initial_setup.sh, you can prepare reference data by manually executing the following commands. Rundqc_admin_tools.py -hordqc_admin_tools.py subcommand -hto show help.Download master files
This will download “Assembly report”, “ANI report”, “Type strain report”, and “indistinguishable_groups_prokaryotes.txt” from the NCBI FTP server and HMMer profile for TIGR.
Download/Update NCBI taxdump data
Download reference genomes
This will download reference genomic FASTA files from the NCBI Assembly database. As it attempts to download large number of genomes, it is recommended to enable parallel downloading option (e.g.
--num_threads 4)Sketch reference genomes using MASH
Prepare SQLite database file
This will generate a reference file
DQC_REFERENCE/references.db, which contains metadata for reference genomes.Prepare CheckM data
CheckM reference data will be downloaded and configured.
Update database for CheckM
Will insert auxiliary data for CheckM into
DQC_REFERENCE/references.dbPrepare reference data for genome size check
Install Shigapass and its reference data
Add timestamp to the reference data
Mash sketching (step 4) may fail when running with multiple threads. To avoid error, please specify
--num-threads 1.Preparation for the GTDB reference data.
Download the representative genomes from GTDB and unarchive it.
If the downloading is slow from the above link, try downloading it from the mirror site,
Create a link under
DQC_REFERENCEAlternatively, place the unarchived folder under
DQC_REFERENCE, and modify the valueGTDB_GENOME_DIRspecified in config.py.Download the species list from GTDB.
The above command will download this file from GTDB.
Place the file in
DQC_REFERENCEdirectory.Sketch representative genomes from GTDB using MASH
Prepare the SQLite DB file for GTDB
When the newer version of the GTDB representative genomes become available, repeat these steps.
Citation
If you use DFAST-QC, please cite:
Mohamed Elmanzalawi, Takatomo Fujisawa, Hiroshi Mori, Yasukazu Nakamura & Yasuhiro Tanizawa
DFAST_QC: quality assessment and taxonomic identification tool for prokaryotic genomes.
BMC Bioinformatics 26:3, 2025. https://doi.org/10.1186/s12859-024-06030-y