CoreProfiler: a robust and integrable cgMLST software

Introduction

Core genome Multilocus Sequence Typing (cgMLST) is a genotyping method used in microbiology to compare the genetic distances of bacterial strains. It achieves this by comparing the alleles of the core genome in genomes to a reference database of official alleles. The output is a cgMLST profile. CoreProfiler is a software that calls official alleles (“autotag”) and detects new ones (“scannew”).

Installation

Requirement

Python > 3.11
pandas
biopython
blast

Conda installation

CoreProfiler is available on conda via bioconda.

$ conda create --name coreprofiler --channel bioconda coreprofiler

Installation from sources

Clone the GitLab repository

$ git clone https://gitlab.com/ifb-elixirfr/abromics/coreprofiler.git

Move inside the created folder
```
$ cd coreprofiler
```
Install dependencies
```
$ pip install .
```

Usage

Check the installation

```bash
$ coreprofiler --version
```

Basic usage

The identification of alleles on the genome, often referred to as ‘allele calling,’ requires an allele database (or ‘scheme’).

Scheme downloading

A scheme contains the list of official alleles curated by its reference platform (such as BIGSdb for Klebsiella pneumoniae or Listeria, Enterobase for E. coli, PubMLST for Staphylococcus aureus, …) and cgMLST.org for several other schemes. Each locus has its FASTA file containing the list of alleles, with sequence id’s >[locus]_[allele-number].

[!WARNING]
The schemes should be downloaded from the reference platforms and should not be manually modified.

Token Management (OAuth1)

BIGSdb and PubMLST platforms require OAuth1 authentication to access and download the most up-to-date schemes.
While authentication is not strictly mandatory, skipping it may result in downloading outdated schemes.

This workflow involves two types of tokens:

Consumer tokens: permanent tokens used to initiate the authentication flow.
Access tokens: tokens required to download or update a scheme.

1. Getting request tokens

The first step is to create an account on pubMLST or BigsDB Pasteur and request a consumer key and consumer secret
(see the respective documentation for instructions on how to obtain them).

The second step is to request an access key and access secret using your consumer key and consumer secret provided by the platform for a specific schema.

coreprofiler db \
  get_request_tokens \
  --scheme <SCHEME_NAME> \
  --consumer_key <YOUR_CONSUMER_KEY> \
  --consumer_secret <YOUR_CONSUMER_SECRET>

#### Reference platforms supported in CoreProfiler

- pubMLST
- BigsDB
- Enterobase
- cgMLST.org

#### Run scheme download

A schema can be downloaded with the following commands.


```bash
$ coreprofiler db download -s [name_of_schema] -o [schema_folder_output_path] -k [consumer_token] -ks [consumer_secret] -a [access_token] -as [access_secret]

-k / -ks → OAuth1 consumer key and secret. -a / -as → OAuth1 access token and secret.

Authentication (used for PubMLST and BIGSdb) is optional but recommended to ensure the latest version of the schema is downloaded

The list of available scheme can be obtained with the following command.

$ coreprofiler db -av

You will have a list with the name of the scheme, a short description, the number of loci in the scheme (locus_count) and the name of the platform from which the scheme originates

name: description [locus count] (platform)
abaumannii: Acinetobacter baumannii [2390 loci] (cgmlstorg)
abaumannii_3: cgMLST v1 [2133 loci] (pubmlst)
banthracis: Bacillus anthracis [3803 loci] (cgmlstorg)
bcc_2: cgMLST [1925 loci] (pubmlst)
bcereus_2: B. anthracis cgMLST [3803 loci] (pubmlst)
bcereus_5: B. cereus cgMLST [1568 loci] (pubmlst)
bmallei_1: cgMLST [3311 loci] (pubmlst)
bmallei_fli: Burkholderia mallei (FLI) [2838 loci] (cgmlstorg)
bmallei_rki: Burkholderia mallei (RKI) [3328 loci] (cgmlstorg)
bmelitensis: Brucella melitensis [2704 loci] (cgmlstorg)
bordetella_1: cgMLST_genus [1415 loci] (BIGSdb)
bordetella_4: cgMLST_pertussis [2038 loci] (BIGSdb)
borrelia_3: cgMLST [639 loci] (pubmlst)
bpertussis: Bordetella pertussis [2983 loci] (cgmlstorg)
bpseudomallei: Burkholderia pseudomallei [4221 loci] (cgmlstorg)
bpseudomallei_2: cgMLST [4090 loci] (pubmlst)
brucella_3: cgMLST [1764 loci] (pubmlst)
campylobacter_4: C. jejuni / C. coli cgMLST v1 [1343 loci] (pubmlst)
campylobacter_8: C. jejuni / C. coli cgMLST v2 [1142 loci] (pubmlst)
cchauvoei_1: cgMLST [2223 loci] (pubmlst)
cdifficile: Clostridioides difficile [2147 loci] (cgmlstorg)
cdiphtheriae: Corynebacterium diphtheriae [1553 loci] (cgmlstorg)
cfreundii: Citrobacter freundii [3250 loci] (cgmlstorg)
cfreundii2: Citrobacter freundii/portucalensis/braakii/europaeus [2307 loci] (cgmlstorg)
chlamydiales_42: C. trachomatis cgMLST v1.0 [817 loci] (pubmlst)
chlamydiales_44: C. abortus cgMLST v1.0 [959 loci] (pubmlst)
cjejuni: Campylobacter jejuni/coli [637 loci] (cgmlstorg)
cperfringens_2: cgMLST [1431 loci] (pubmlst)
cpseudotuberculosis: Corynebacterium pseudotuberculosis [1577 loci] (cgmlstorg)
csakazakii: Cronobacter sakazakii/malonaticus [2831 loci] (cgmlstorg)
diphtheria_1: cgMLST [1305 loci] (BIGSdb)
diphtheria_5: cgMLST_ulcerans [1628 loci] (BIGSdb)
dnodosus_3: cgMLST [714 loci] (pubmlst)
efaecalis: Enterococcus faecalis [1972 loci] (cgmlstorg)
efaecium: Enterococcus faecium [1423 loci] (cgmlstorg)
ehormaechei: Enterobacter hormaechei [2178 loci] (cgmlstorg)
escherichia_v1: cgMLSTv1 [2513 loci] (enterobase)
ftularensis: Francisella tularensis [1147 loci] (cgmlstorg)
hinfluenzae_56: cgMLST v1 [1037 loci] (pubmlst)
hinfluenzae_59: H. parainfluenzae cgMLST v1 [1062 loci] (pubmlst)
klebsiella_10: cgMLST_ST258_ST512_ST1199 [1371 loci] (BIGSdb)
klebsiella_15: cgMLST_KpI [2537 loci] (BIGSdb)
klebsiella_18: scgMLST629_S [629 loci] (BIGSdb)
klebsiella_19: Kpn_cgMLST [2752 loci] (BIGSdb)
klebsiella_26: SL147_cgMLST [852 loci] (BIGSdb)
klebsiella_27: cgMLST1748_v2 [947 loci] (BIGSdb)
klebsiella_3: scgMLST634 [632 loci] (BIGSdb)
koxytoca: Klebsiella oxytoca/grimontii/michiganensis/pasteurii [2536 loci] (cgmlstorg)
kpneumoniae: Klebsiella pneumoniae/variicola/quasipneumoniae [2358 loci] (cgmlstorg)
leptospira_1: cgMLST [545 loci] (BIGSdb)
leptospira_3: capture_cgMLST [545 loci] (BIGSdb)
leptospira_4: cgMLST [1565 loci] (pubmlst)
listeria_15: cgMLST [1748 loci] (BIGSdb)
listeria_3: cgMLST1748 [1748 loci] (BIGSdb)
lmonocytogenes: Listeria monocytogenes [1701 loci] (cgmlstorg)
lpneumophila: Legionella pneumophila [1521 loci] (cgmlstorg)
mabscessus_2: cgMLST [2904 loci] (pubmlst)
mgallisepticum: Mycoplasma gallisepticum [425 loci] (cgmlstorg)
mmorganii: Morganella morganii [2462 loci] (cgmlstorg)
mtuberculosis: Mycobacterium tuberculosis/bovis/africanum/canettii [2891 loci] (cgmlstorg)
mycobacteria_3: MTBC cgMLST [1561 loci] (pubmlst)
neisseria_45: L3 cgMLST [1742 loci] (pubmlst)
neisseria_47: N. meningitidis cgMLST v1 [1605 loci] (pubmlst)
neisseria_62: N. gonorrhoeae cgMLST v1.0 [1649 loci] (pubmlst)
neisseria_68: L44 cgMLST [1699 loci] (pubmlst)
neisseria_72: Human-restricted Neisseria cgMLST v1.0 [1441 loci] (pubmlst)
neisseria_85: N. meningitidis cgMLST v2 [1422 loci] (pubmlst)
neisseria_88: N. meningitidis cgMLST v3 [1329 loci] (pubmlst)
neisseria_89: N. gonorrhoeae cgMLST v2 [1430 loci] (pubmlst)
paeruginosa: Pseudomonas aeruginosa [3867 loci] (cgmlstorg)
plarvae: Paenibacillus larvae [2419 loci] (cgmlstorg)
pmirabilis: Proteus mirabilis [2574 loci] (cgmlstorg)
pmultocida_3: cgMLST (draft) [1233 loci] (pubmlst)
pstuartii: Providencia stuartii [3079 loci] (cgmlstorg)
sagalactiae_38: h_S.agalactiae cgMLST v1.0 [1405 loci] (pubmlst)
salmonella_3: SalmcgMLST v1.0 [2750 loci] (pubmlst)
salmonella_v2: cgMLSTv2 [3002 loci] (enterobase)
sargenteus: Staphylococcus argenteus [2168 loci] (cgmlstorg)
saureus: Staphylococcus aureus [1861 loci] (cgmlstorg)
saureus_20: cgMLST [1716 loci] (pubmlst)
scapitis: Staphylococcus capitis [1492 loci] (cgmlstorg)
serratia_2: cgMLST [2692 loci] (pubmlst)
smarcescens: Serratia marcescens [2692 loci] (cgmlstorg)
spneumoniae_2: cgMLST [1222 loci] (pubmlst)
spyogenes: Streptococcus pyogenes [1095 loci] (cgmlstorg)
spyogenes_11: cgMLST [1198 loci] (pubmlst)
suberis_8: cgMLST [1447 loci] (pubmlst)
vcholerae_3: cgMLST [2443 loci] (pubmlst)
vparahaemolyticus_3: cgMLST [2254 loci] (pubmlst)
xcitri_1: cgMLST [1618 loci] (pubmlst)
yenterocolitica: Yersinia enterocolitica [2582 loci] (cgmlstorg)
yersinia_1: Yersinia cgMLST [500 loci] (BIGSdb)
yersinia_2: Y.enterocolitica cgMLST [1727 loci] (BIGSdb)
yersinia_3: Y.pseudotuberculosis cgMLST [1921 loci] (BIGSdb)

Scheme updating

To update a scheme, run

$ coreprofiler db update -s [name_of_schema] -d [local_scheme_folder_path] -o [update_log_file_path] -k [consumer_token] -ks [consumer_secret] -a [access_token] -as [access_secret]

-k / -ks → OAuth1 consumer key and secret. -a / -as → OAuth1 access token and secret.

Authentication (used for PubMLST and BIGSdb) is optional but recommended to ensure the latest version of the schema is downloaded

This code reads the last local update of the species scheme from data/scheme_url_new.json, and replaces all locus files if the last remote locus update is newer than the last local database update.

Available species schemes and their API requests URL are stored in data/scheme_url_new.json.

Allele identification

The main command coreprofiler allele_calling consist of 2 steps:

Autotag (exact matches on reference database)

It runs BLASTn with 100% identity, no gaps, and parameters optimized for speed (-task megablast and -dust no; genome vs. BLAST db). The output is parsed, only results with 100% coverage are kept.
Scannew (detection of new alleles)

It selects the loci for which no exact match was found in step 1. Creates a BLAST database with those loci files, and runs a looser BLASTn (-task blastn, genome vs. BLAST db). The best hit is kept for each locus and classified as either a new allele, an incomplete allele or a missing allele according to the matching default values:

Allele status	Identity %	Coverage %	Representation
New	> 90	> 90	`[locus]_[sequence_hash]`
Incomplete	> 90	70 - 90	`X`
Missing	< 90	< 90	`-`

CoreProfiler takes as input:

a genome assembly in fasta format
the scheme directory

[!NOTE] Omitting --detailed parameter would characterize incomplete alleles as “-“ (downstream analyses programs may only accept integers and “-“ as allele id’s).

Running allele calling

You can test the tool by running:

$ coreprofiler allele_calling \
    -q tests/data/kp/genomes/10_04A025_hennart2022.fas \
    -sf tests/data/kp/scheme_sample \
    -out test_kp.tsv

Doing so will run makeblastdb on the scheme and create a BLAST database on which to run the genome against. This process may take a bit longer and does not need to be re-run unless the scheme has been updated. The path to a BLAST database can also be provided with the -db argument.

Example: (once the command above has been run and a BLAST database has been created already)

$ coreprofiler allele_calling \
    -q tests/data/kp/genomes/10_04A025_hennart2022.fas \
    -sf tests/data/kp/scheme_sample \
    -db tmp_blast_db/tmp_blast_db.fasta \
    -out test_kp.tsv

[!IMPORTANT]

unlike other cgMLST softwares, CoreProfiler takes as input the scheme files and can not be only provided with its BLAST database.

-db argument must be given which is the path of the main fasta files of the BLAST database.

Detection of new alleles

Core-genome MLST brings out at almost every genome alleles that do not belong to the reference scheme. The aim of CoreProfiler is not to describe new “official” alleles, as that may enter in conflict with the platforms providing official schemes. It is however able to describe locally (temporary) new alleles. A genome is considered to carry a new allele if, for a given locus, the following conditions are met:

no exact match (100% identity & 100% coverage) with the locus alleles
one allele of this locus matched on the genome with:
- sequence identity > “min_id_new_allele” (default: 90)
- sequence coverage > “min_cov_new_allele” (default: 90)

The new allele sequences is the query subject that matched best with one allele (position trimming).

Parameters

general

-q, --query: [required] Path(s) to the genome assemblie(s) (fasta files).
-sf, --scheme_dir: [required] Path to the directory containing the scheme files.
-out: [required] Path to the output file (tsv file).
-db, --blast_db_path: Path to the BLAST database.
-n, --num_threads: Number of threads to run BLAST functions on. Default: 4.
-V, --version: CoreProfiler version.
-v, --verbose: Verbose mode.

autotag

--autotag_word_size: Word size for the autotag BLASTn. Default: 31.

scannew

--outfa: Path to output fasta file with new alleles sequences if detected.
--min_id_new_allele: Minimum identity percentage to consider new alleles. Default: 90.
--min_cov_new_allele: Minimum coverage percentage to consider new alleles. Default: 90.
--min_cov_incomplete: Minimum coverage percentage to consider an allele incomplete (if --detailed option). Default 70.
-d, --detailed: Return detailed information on incomplete alleles: “incomplete, …”. Else, only new alleles or missing.
--profiles_w_tmp_alleles: A JSON file containing info about files with temporary alleles.
--num_alleles_per_locus: TSV file containing the number of alleles per locus of a given scheme.
-cds: Extract new alleles within CDS.

Platform integration

CoreProfiler has 2 modes: allele_calling (see above) and db (to download & updates databases).

Scheme databases can be downloaded with coreprofiler db download and updated with coreprofiler db update [species]. When a database is updated, update stored profiles with coreprofiler platform profiles_update [args], and re-run coreprofiler db get_num_alleles [args]. To limit redondant processes between two database updates, consider running coreprofiler db makeblastdb [args] and coreprofiler db get_num_alleles [args], and providing their output file when needed.

Handling temporary/new alleles

When a new allele is detected, the sequence is extracted and the allele identifier is a MD5-hash of the sequence. This way, one new allele described in parallelized runs is given the same id in output profiles (and from one batch run to another). On a platform that aims to store query genomes and update cgMLST profiles, the --profiles_w_tmp_alleles parameter should be provided to write the profiles containing temporary new alleles in a json file.

The nucleic sequences of alleles described in the json file generated by the --profiles_w_tmp_alleles parameter can be obtained with the --outfa parameter.

For example, the following command:

$ coreprofiler allele_calling \
    -q tests/data/kp/genomes/10_04A025_hennart2022.fas \
    -sf tests/data/kp/scheme_sample \
    -db tmp_blast_db/tmp_blast_db.fasta \
    -out test_kp.tsv \
    --profiles_w_tmp_alleles test_kp_tmp_alleles.json \
    --outfa test_kp_tmp_alleles.fasta

Will generate the following output files:

Output file test_kp.tsv:

seq_id	accB_S	accC_S	accD_S
tests/data/kp/genomes/10_04A025_hennart2022.fas	1	accC_S_6db3bfc0871923e1aedb4998b3c93b37	accD_S_2e603343825da09506bdc8ea287f5584

Output file test_kp_tmp_alleles.json:

{
  "tests/data/kp/genomes/10_04A025_hennart2022.fas":
      {
          "genome": "tests/data/kp/genomes/10_04A025_hennart2022.fas",
          "profile": "test_kp.tsv",
          "tmp_loci":
              ["accC_S",
               "accD_S"]
      }
}

Output file test_kp_tmp_alleles.fasta:

>accC_S_6db3bfc0871923e1aedb4998b3c93b37
[...]
>accD_S_2e603343825da09506bdc8ea287f5584
[...]

Optimizing performance with precomputed files

The BLAST database can be provided to avoid repeating expensive processes (--blast_db_path).
CoreProfiler runs BLAST functions on the query genomes vs. a BLAST database of the species scheme. When the user does not provide a BLAST database, one is created. Building such BLAST database takes some time, and does not need to be repeated unless the scheme has been updated. So, between two updates of a species scheme, consider providing a BLAST database as input. Create one running coreprofiler db makeblastdb -s [scheme_path] -n [blastdb_name] -p [blastdb_path]. 'blastdb_name' is used to name output files, so provide a name compatible with file naming. To provide it as input, add the argument -db [blastdb_path/blastdb_name.fasta] to coreprofiler allele_calling.

Testing

Unittary testing

Unittary tests are written with pytest and can be run with the simple pytest command. To test the API and HTML requests, run pytest -m requests.

Accuracy verification

CoreProfiler is being thoroughly tested against reference platforms.

Autotag: tested by downloading public genomes and their profiles from the platform, running CoreProfiler cgMLST on the genomes and comparing the alleles called. [add details]
Scannew: tested by manually removing alleles from the scheme database and verifying that these are described by CoreProfler “scannew” function.

References

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of molecular biology, 215(3), 403-410.

Authors

Sylvain Brisse
Alexis Criscuolo
Alix de Thoisy

Contributors

Alix de Thoisy
Fabien Mareuil
Clea Siguret
Julie Lao
Bérénice Batut
Hugo Lefeuvre
Pierre Marin