CoreProfiler: a robust and integrable cgMLST software
Introduction
Core genome Multilocus Sequence Typing (cgMLST) is a genotyping method used in microbiology to compare the genetic distances of bacterial strains. It achieves this by comparing the alleles of the core genome in genomes to a reference database of official alleles. The output is a cgMLST profile. CoreProfiler is a software that calls official alleles (“autotag”) and detects new ones (“scannew”).
The identification of alleles on the genome, often referred to as ‘allele calling,’ requires an allele database (or ‘scheme’).
Scheme downloading
A scheme contains the list of official alleles curated by its reference platform (such as BIGSdb for Klebsiella pneumoniae or Listeria, Enterobase for E. coli, PubMLST for Staphylococcus aureus, …) and cgMLST.org for several other schemes. Each locus has its FASTA file containing the list of alleles, with sequence id’s >[locus]_[allele-number].
[!WARNING] The schemes should be downloaded from the reference platforms and should not be manually modified.
Token Management (OAuth1)
BIGSdb and PubMLST platforms require OAuth1 authentication to access and download the most up-to-date schemes. While authentication is not strictly mandatory, skipping it may result in downloading outdated schemes.
This workflow involves two types of tokens:
Consumer tokens: permanent tokens used to initiate the authentication flow.
Access tokens: tokens required to download or update a scheme.
1. Getting request tokens
The first step is to create an account on pubMLST or BigsDB Pasteur and request a consumer key and consumer secret (see the respective documentation for instructions on how to obtain them).
The second step is to request an access key and access secret using your consumer key and consumer secret provided by the platform for a specific schema.
coreprofiler db \
get_request_tokens \
--scheme <SCHEME_NAME> \
--consumer_key <YOUR_CONSUMER_KEY> \
--consumer_secret <YOUR_CONSUMER_SECRET>
#### Reference platforms supported in CoreProfiler
- pubMLST
- BigsDB
- Enterobase
- cgMLST.org
#### Run scheme download
A schema can be downloaded with the following commands.
```bash
$ coreprofiler db download -s [name_of_schema] -o [schema_folder_output_path] -k [consumer_token] -ks [consumer_secret] -a [access_token] -as [access_secret]
-k / -ks → OAuth1 consumer key and secret.
-a / -as → OAuth1 access token and secret.
Authentication (used for PubMLST and BIGSdb) is optional but recommended to ensure the latest version of the schema is downloaded
The list of available scheme can be obtained with the following command.
$ coreprofiler db -av
You will have a list with the name of the scheme, a short description, the number of loci in the scheme (locus_count) and the name of the platform from which the scheme originates
$ coreprofiler db update -s [name_of_schema] -d [local_scheme_folder_path] -o [update_log_file_path] -k [consumer_token] -ks [consumer_secret] -a [access_token] -as [access_secret]
-k / -ks → OAuth1 consumer key and secret.
-a / -as → OAuth1 access token and secret.
Authentication (used for PubMLST and BIGSdb) is optional but recommended to ensure the latest version of the schema is downloaded
This code reads the last local update of the species scheme from data/scheme_url_new.json, and replaces all locus files if the last remote locus update is newer than the last local database update.
Available species schemes and their API requests URL are stored in data/scheme_url_new.json.
Allele identification
The main command coreprofiler allele_calling consist of 2 steps:
Autotag (exact matches on reference database)
It runs BLASTn with 100% identity, no gaps, and parameters optimized for speed (-task megablast and -dust no; genome vs. BLAST db). The output is parsed, only results with 100% coverage are kept.
Scannew (detection of new alleles)
It selects the loci for which no exact match was found in step 1. Creates a BLAST database with those loci files, and runs a looser BLASTn (-task blastn, genome vs. BLAST db). The best hit is kept for each locus and classified as either a new allele, an incomplete allele or a missing allele according to the matching default values:
Allele status
Identity %
Coverage %
Representation
New
> 90
> 90
[locus]_[sequence_hash]
Incomplete
> 90
70 - 90
X
Missing
< 90
< 90
-
CoreProfiler takes as input:
a genome assembly in fasta format
the scheme directory
[!NOTE]
Omitting --detailed parameter would characterize incomplete alleles as “-“ (downstream analyses programs may only accept integers and “-“ as allele id’s).
Doing so will run makeblastdb on the scheme and create a BLAST database on which to run the genome against. This process may take a bit longer and does not need to be re-run unless the scheme has been updated. The path to a BLAST database can also be provided with the -db argument.
Example: (once the command above has been run and a BLAST database has been created already)
unlike other cgMLST softwares, CoreProfiler takes as input the scheme files and can not be only provided with its BLAST database.
-db argument must be given which is the path of the main fasta files of the BLAST database.
Detection of new alleles
Core-genome MLST brings out at almost every genome alleles that do not belong to the reference scheme. The aim of CoreProfiler is not to describe new “official” alleles, as that may enter in conflict with the platforms providing official schemes. It is however able to describe locally (temporary) new alleles. A genome is considered to carry a new allele if, for a given locus, the following conditions are met:
no exact match (100% identity & 100% coverage) with the locus alleles
one allele of this locus matched on the genome with:
The new allele sequences is the query subject that matched best with one allele (position trimming).
Parameters
general
-q, --query: [required] Path(s) to the genome assemblie(s) (fasta files).
-sf, --scheme_dir: [required] Path to the directory containing the scheme files.
-out: [required] Path to the output file (tsv file).
-db, --blast_db_path: Path to the BLAST database.
-n, --num_threads: Number of threads to run BLAST functions on. Default: 4.
-V, --version: CoreProfiler version.
-v, --verbose: Verbose mode.
autotag
--autotag_word_size: Word size for the autotag BLASTn. Default: 31.
scannew
--outfa: Path to output fasta file with new alleles sequences if detected.
--min_id_new_allele: Minimum identity percentage to consider new alleles. Default: 90.
--min_cov_new_allele: Minimum coverage percentage to consider new alleles. Default: 90.
--min_cov_incomplete: Minimum coverage percentage to consider an allele incomplete (if --detailed option). Default 70.
-d, --detailed: Return detailed information on incomplete alleles: “incomplete, …”. Else, only new alleles or missing.
--profiles_w_tmp_alleles: A JSON file containing info about files with temporary alleles.
--num_alleles_per_locus: TSV file containing the number of alleles per locus of a given scheme.
-cds: Extract new alleles within CDS.
Platform integration
CoreProfiler has 2 modes: allele_calling (see above) and db (to download & updates databases).
Scheme databases can be downloaded with coreprofiler db download and updated with coreprofiler db update [species]. When a database is updated, update stored profiles with coreprofiler platform profiles_update [args], and re-run coreprofiler db get_num_alleles [args].
To limit redondant processes between two database updates, consider running coreprofiler db makeblastdb [args] and coreprofiler db get_num_alleles [args], and providing their output file when needed.
Handling temporary/new alleles
When a new allele is detected, the sequence is extracted and the allele identifier is a MD5-hash of the sequence. This way, one new allele described in parallelized runs is given the same id in output profiles (and from one batch run to another). On a platform that aims to store query genomes and update cgMLST profiles, the --profiles_w_tmp_alleles parameter should be provided to write the profiles containing temporary new alleles in a json file.
The nucleic sequences of alleles described in the json file generated by the --profiles_w_tmp_alleles parameter can be obtained with the --outfa parameter.
The BLAST database can be provided to avoid repeating expensive processes (--blast_db_path). CoreProfiler runs BLAST functions on the query genomes vs. a BLAST database of the species scheme. When the user does not provide a BLAST database, one is created. Building such BLAST database takes some time, and does not need to be repeated unless the scheme has been updated. So, between two updates of a species scheme, consider providing a BLAST database as input. Create one running coreprofiler db makeblastdb -s [scheme_path] -n [blastdb_name] -p [blastdb_path]. 'blastdb_name' is used to name output files, so provide a name compatible with file naming. To provide it as input, add the argument -db [blastdb_path/blastdb_name.fasta] to coreprofiler allele_calling.
Testing
Unittary testing
Unittary tests are written with pytest and can be run with the simple pytest command. To test the API and HTML requests, run pytest -m requests.
Accuracy verification
CoreProfiler is being thoroughly tested against reference platforms.
Autotag: tested by downloading public genomes and their profiles from the platform, running CoreProfiler cgMLST on the genomes and comparing the alleles called. [add details]
Scannew: tested by manually removing alleles from the scheme database and verifying that these are described by CoreProfler “scannew” function.
References
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of molecular biology, 215(3), 403-410.
CoreProfiler: a robust and integrable cgMLST software
Introduction
Core genome Multilocus Sequence Typing (cgMLST) is a genotyping method used in microbiology to compare the genetic distances of bacterial strains. It achieves this by comparing the alleles of the core genome in genomes to a reference database of official alleles. The output is a cgMLST profile. CoreProfiler is a software that calls official alleles (“autotag”) and detects new ones (“scannew”).
Installation
Requirement
Conda installation
CoreProfiler is available on conda via bioconda.
Installation from sources
Clone the GitLab repository
Move inside the created folder
Install dependencies
Usage
Check the installation
Basic usage
The identification of alleles on the genome, often referred to as ‘allele calling,’ requires an allele database (or ‘scheme’).
Scheme downloading
A scheme contains the list of official alleles curated by its reference platform (such as BIGSdb for Klebsiella pneumoniae or Listeria, Enterobase for E. coli, PubMLST for Staphylococcus aureus, …) and cgMLST.org for several other schemes. Each locus has its FASTA file containing the list of alleles, with sequence id’s
>[locus]_[allele-number].Token Management (OAuth1)
BIGSdb and PubMLST platforms require OAuth1 authentication to access and download the most up-to-date schemes.
While authentication is not strictly mandatory, skipping it may result in downloading outdated schemes.
This workflow involves two types of tokens:
1. Getting request tokens
The first step is to create an account on pubMLST or BigsDB Pasteur and request a consumer key and consumer secret
(see the respective documentation for instructions on how to obtain them).
The second step is to request an access key and access secret using your consumer key and consumer secret provided by the platform for a specific schema.
-k / -ks → OAuth1 consumer key and secret. -a / -as → OAuth1 access token and secret.
Authentication (used for PubMLST and BIGSdb) is optional but recommended to ensure the latest version of the schema is downloaded
The list of available scheme can be obtained with the following command.
You will have a list with the
nameof the scheme, a shortdescription, the number of loci in the scheme (locus_count) and the name of theplatformfrom which the scheme originatesScheme updating
To update a scheme, run
-k / -ks → OAuth1 consumer key and secret. -a / -as → OAuth1 access token and secret.
Authentication (used for PubMLST and BIGSdb) is optional but recommended to ensure the latest version of the schema is downloaded
This code reads the last local update of the species scheme from
data/scheme_url_new.json, and replaces all locus files if the last remote locus update is newer than the last local database update.Available species schemes and their API requests URL are stored in
data/scheme_url_new.json.Allele identification
The main command
coreprofiler allele_callingconsist of 2 steps:Autotag (exact matches on reference database)
It runs
BLASTnwith 100% identity, no gaps, and parameters optimized for speed (-task megablastand-dust no; genome vs. BLAST db). The output is parsed, only results with 100% coverage are kept.Scannew (detection of new alleles)
It selects the loci for which no exact match was found in step 1. Creates a BLAST database with those loci files, and runs a looser
BLASTn(-task blastn, genome vs. BLAST db). The best hit is kept for each locus and classified as either anew allele, anincomplete alleleor amissing alleleaccording to the matching default values:[locus]_[sequence_hash]X-CoreProfiler takes as input:
Running allele calling
You can test the tool by running:
Doing so will run
makeblastdbon the scheme and create a BLAST database on which to run the genome against. This process may take a bit longer and does not need to be re-run unless the scheme has been updated. The path to a BLAST database can also be provided with the-dbargument.Example: (once the command above has been run and a BLAST database has been created already)
Detection of new alleles
Core-genome MLST brings out at almost every genome alleles that do not belong to the reference scheme. The aim of CoreProfiler is not to describe new “official” alleles, as that may enter in conflict with the platforms providing official schemes. It is however able to describe locally (temporary) new alleles. A genome is considered to carry a new allele if, for a given locus, the following conditions are met:
no exact match (100% identity & 100% coverage) with the locus alleles
one allele of this locus matched on the genome with:
The new allele sequences is the query subject that matched best with one allele (position trimming).
Parameters
general
-q, --query: [required] Path(s) to the genome assemblie(s) (fasta files).-sf, --scheme_dir: [required] Path to the directory containing the scheme files.-out: [required] Path to the output file (tsv file).-db, --blast_db_path: Path to the BLAST database.-n, --num_threads: Number of threads to run BLAST functions on. Default: 4.-V, --version: CoreProfiler version.-v, --verbose: Verbose mode.autotag
--autotag_word_size: Word size for the autotag BLASTn. Default: 31.scannew
--outfa: Path to output fasta file with new alleles sequences if detected.--min_id_new_allele: Minimum identity percentage to consider new alleles. Default: 90.--min_cov_new_allele: Minimum coverage percentage to consider new alleles. Default: 90.--min_cov_incomplete: Minimum coverage percentage to consider an allele incomplete (if--detailedoption). Default 70.-d, --detailed: Return detailed information on incomplete alleles: “incomplete, …”. Else, only new alleles or missing.--profiles_w_tmp_alleles: A JSON file containing info about files with temporary alleles.--num_alleles_per_locus: TSV file containing the number of alleles per locus of a given scheme.-cds: Extract new alleles within CDS.Platform integration
CoreProfiler has 2 modes:
allele_calling(see above) anddb(to download & updates databases).Scheme databases can be downloaded with
coreprofiler db downloadand updated withcoreprofiler db update [species]. When a database is updated, update stored profiles withcoreprofiler platform profiles_update [args], and re-runcoreprofiler db get_num_alleles [args]. To limit redondant processes between two database updates, consider runningcoreprofiler db makeblastdb [args]andcoreprofiler db get_num_alleles [args], and providing their output file when needed.Handling temporary/new alleles
When a new allele is detected, the sequence is extracted and the allele identifier is a MD5-hash of the sequence. This way, one new allele described in parallelized runs is given the same id in output profiles (and from one batch run to another). On a platform that aims to store query genomes and update cgMLST profiles, the
--profiles_w_tmp_allelesparameter should be provided to write the profiles containing temporary new alleles in a json file.The nucleic sequences of alleles described in the json file generated by the
--profiles_w_tmp_allelesparameter can be obtained with the--outfaparameter.For example, the following command:
Will generate the following output files:
test_kp.tsv:Output file
test_kp_tmp_alleles.json:Output file
test_kp_tmp_alleles.fasta:Optimizing performance with precomputed files
The BLAST database can be provided to avoid repeating expensive processes (
--blast_db_path).CoreProfiler runs BLAST functions on the query genomes vs. a BLAST database of the species scheme. When the user does not provide a BLAST database, one is created. Building such BLAST database takes some time, and does not need to be repeated unless the scheme has been updated. So, between two updates of a species scheme, consider providing a BLAST database as input. Create one running
coreprofiler db makeblastdb -s [scheme_path] -n [blastdb_name] -p [blastdb_path].'blastdb_name'is used to name output files, so provide a name compatible with file naming. To provide it as input, add the argument-db [blastdb_path/blastdb_name.fasta]tocoreprofiler allele_calling.Testing
Unittary testing
Unittary tests are written with pytest and can be run with the simple
pytestcommand. To test the API and HTML requests, runpytest -m requests.Accuracy verification
CoreProfiler is being thoroughly tested against reference platforms.
References
Authors
Contributors