WhatsGNU

What’s Gene Novelty Unit: A Tool For Identifying Proteomic Novelty.

Introduction

WhatsGNU utilizes the natural variation in public databases to rank protein sequences based on the number of observed exact protein matches (the GNU score) in all known genomes of a certain species & can quickly create whole protein reports.
WhatsGNU compresses proteins database based on exact match to much fewer number of proteins that differ by at least one amino acid. WhatsGNU will save a copy of the compressed database in two formats; database.txt and database.pickle for faster subsequent uses.

Installation

Dependencies

Python3.x
Blastp (optional for WhatsGNU_main.py and required for WhatsGNU_plotter.py)
NumPy (required for WhatsGNU_plotter.py)
SciPy (required for WhatsGNU_plotter.py)
Matplotlib (required for WhatsGNU_plotter.py)
Clone the Github repository
WhatsGNU is a command-line application written in Python3. Simply download and use! You will have to install all needed dependencies!
```
git clone https://github.com/ahmedmagds/WhatsGNU
cd WhatsGNU/bin
chmod +x *.py
pwd
#pwd will give you a path/to/folder/having/WhatsGNU which you will use in next command
export PATH=$PATH:/path/to/folder/having/WhatsGNU/bin
```
If you need it permanently, you can add this last line to your .bashrc or .bash_profile.
Bioconda
If you use Conda you can use the Bioconda channel to install it in the conda base:
```
conda install -c bioconda whatsgnu
```
OR
Make a new environment and install WhatsGNU in it (recommended)
```
conda create -n WhatsGNU -c bioconda whatsgnu
conda activate WhatsGNU
```
The ‘conda activate’ command is needed to activate the WhatsGNU environment each time you want to use the tool.

If you do not have Miniconda or Anaconda installed already, you can install one of them from:

Miniconda
Anaconda
Windows
Follow instructions for installing Windows Subsystem for Linux (WSL) on https://docs.microsoft.com/en-us/windows/wsl/install-win10
Briefly:

Open PowerShell as Administrator and run:

Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux

Install Linux distribution app from Microsoft Store (tested on Ubuntu 18.04 LTS).
Set up username and password.

Update the system and install dependencies:

sudo apt update && sudo apt upgrade
sudo apt install python3-pip
pip3 install --user numpy scipy matplotlib ipython jupyter pandas sympy nose
sudo apt install unzip
sudo apt install ncbi-blast+
git clone https://github.com/ahmedmagds/WhatsGNU.git
export PATH=$PATH:/home/user_name/WhatsGNU/bin

Note: Your Windows C:\Users\ gets mapped to /mnt/c/Users/ in WSL. You can copy between the two directories using a command like:

cp /mnt/c/Users/Windows_username/Desktop/file.fasta /home/Ubuntu_user_name/

Test

Type WhatsGNU_main.py -h and it should output help screen.
Type WhatsGNU_main.py -v and you should see an output like WhatsGNU_main.py 1.0.

Available Databases

There are three different types of databases available to use: basic, ortholog, or hashed basic databases. At this time, hashed ortholog databases are not available for use, but will be in the future. For more information on the uses and limitations of hashed databases, skip to the WhatsGNU_main_hashes.py section under WhatsGNU toolbox.

The following databases are available to download and use:

Ortholog Mode:

Klebsiella pneumoniae Version: 04/17/2020 (compressed 46,072,343 proteins in 8752 genomes to 1,466,934 protein variants). Updated April 2023.
Pseudomonas aeruginosa Version: 07/06/2019 (compressed 14,475,742 proteins in 4712 genomes to 1,288,892 protein variants)
Mycobacterium tuberculosis Version: 07/09/2019 (compressed 26,794,006 proteins in 6563 genomes to 434,725 protein variants).
Staphylococcus aureus Version: April 2024, Size: 14GB (compressed 188,965,356 proteins in 68,299 genomes to 2,702,458 protein variants)
C.difficile Version: July 2024, Size: 3.8GB (compressed 55,048,119 proteins in 14,186 genomes to 617,095 protein variants)

Basic Mode:

Salmonella enterica Enterobase Version: 08/29/2019 (compressed 975,262,506 proteins in 216,642 genomes to 5,056,335 protein variants)
Pseudomonas aeruginosa Version: June 2024, Size: 19GB (compressed 198,278,793 proteins in 31,832 genomes to 3,537,663 protein variants)
Klebsiella pnuemoniae Version: June 2024, Size: 37GB (compressed 405,201,811 proteins in 75,246 genomes to 4,425,185 protein variants)
Escherichia coli Version: March 2024, Size: 90 GB (compressed 1,044,408,936 proteins in 211,942 genomes to 15,220,801 protein variants)
Staphylococcus epidermidis Version: January 2025, Size: 779 MB (compressed 11,569,977 proteins in 4,981 genomes to 376,025 protein variants)

Hashed Databases:

Note: Metadata (i.e. number of genomes, protein variants, etc) is the same as above for each of the following species.

Escherichia coli Size: 7.5GB
Pseudomonas aeruginosa Size: 1.4GB
Klebsiella pnuemoniae Size: 2.6GB
RefSeq Version: July 2023, Size: 27 GB (compressed 1,166,846,405 proteins in 306,326 genomes to 229,663,320 protein variants)

The databases are available to download by visiting the link or using the wget command. Examples of how to use the wget command as follows:

S. aureus Ortholog

Mycobacterium tuberculosis Ortholog

wget -O TB.zip https://www.dropbox.com/sh/8nqowtd4fcf7dgs/AAAdXiqcxTsEqfIAyNE9TWwRa?dl=0
unzip TB.zip -d WhatsGNU_TB_Ortholog

Pseudomonas aeruginosa Ortholog

wget -O Pa.zip https://www.dropbox.com/sh/r0wvoig3alsz7xg/AABPoNu6FdN7zG2PP9BFezQYa?dl=0
unzip Pa.zip -d WhatsGNU_Pa_Ortholog

S. enterica Enterobase

wget -O Senterica_Enterobase_basic_216642.pickle https://www.dropbox.com/s/gbjengikpynxo12/Senterica_Enterobase_basic_216642.pickle?dl=0

Klebsiella pneumoniae hashed

wget https://zenodo.org/records/13384718/files/Kp_basic.tar.gz
tar xvfz Kp_basic.tar.gz

WhatsGNU toolbox

WhatsGNU_get_GenBank_genomes.py
This script downloads genomic fna files or protein faa files from GenBank.
WhatsGNU_database_customizer.py
This script customizes the protein faa files from GenBank, RefSeq, Prokka and RAST by adding a strain name to the start of each protein. This script can also customize the strain names for gff file to be used in Roary for pangenome analysis, if the Ortholog mode is going to be used in WhatsGNU.
WhatsGNU_db_download.py
This script will download databases for WhatsGNU. You can check all databases available for WhatsGNU in the file databases_available.csv.
WhatsGNU_main.py
In basic mode, this script ranks protein sequences based on the number of observed exact protein matches (the GNU score) in all known genomes of a particular species. It generates a report for all the proteins in your query in seconds using exact match compression technique. In ortholog mode, the script will additionally link the different alleles of an ortholog group using the clustered proteins output file from Roary or similar pangenome analysis tools. In this mode, WhatsGNU will calculate Ortholog Variant Rarity Index (OVRI) (scale 0-1). This metric is calculated as the number of alleles in an orthologous group that have a GNU score less than or equal to the GNU score of any given allele divided by the sum of GNU scores in the orthologous group. This index represents how unusual a given GNU score is within an ortholog group by measuring how many other protein alleles in the ortholog group have that GNU score or lower. For instance, an allele of GNU=8 in an ortholog group that has 6 alleles with this distribution of GNU scores [300,20,15,8,2,1] will get an OVRI of (8+2+1)/346= 0.03. On the other hand, the allele with GNU=300 will get an OVRI of (300+20+15+8+2+1)/346= 1. An allele with an OVRI of 1 is relatively common regardless of the magnitude of the GNU score, while an allele with OVRI of 0.03 is relatively rare. This index helps distinguish between ortholog groups with high levels of diversity and ortholog groups that are highly conserved.
WhatsGNU_plotter.py
This script plots:

Heatmap of GNU scores of orthologous genes in different isolates.
Metadata distribution bar plot of proteins.
Histogram of the GNU scores of all proteins in a genome.
Volcano plot showing proteins with a lower average GNU score in one group (case) compared to the other (control). The x-axis is the delta average GNU score (Average_GNU_score_case – Average_GNU_score_control) in the ortholog group. Lower average GNU score in cases will have a negative value on the x-axis (red dots) while lower average GNU score in the control group will have positive value on the x-axis (green dots). The y-axis could be drawn as a -log10(P value) from Mann–Whitney-Wilcoxon test. In this case, lower average GNU score in one group (upper left for case or upper right for control) would be of interest as shown by a significant P value (-log10( P value) > 1.3). The y-axis can also be the average OVRI in the case group for negative values on the x-axis or average OVRI in the control group for positive values on the x-axis.

WhatsGNU_main_hashes.py
This script is compatible only with the hashed versions of the databases. Each hashed database comes with a CSV file that is necessary to be able to run this script. The corresponding CSV for each hashed database can be found in the respective gzipped tarball. Functions available in this version of the script include generating a basic WhatsGNU report (see below for formatting of report), creating a file of each protein with all associated ids from the database (-i) and creating a file with the top genomes (-t/-tn). With this script, you cannot run blastp on the proteins with GNU score of zero (i.e. -b, –blastp option is not available with this script) at this time.

Usage for WhatsGNU_db_download.py

Input

database name (e.g. Sau, Kp, TB, Pa, Staphopia, S.enterica or all)
```
WhatsGNU_db_download.py Sau
```

Usage for WhatsGNU_main.py

Input

database (precompressed (.pickle or .txt) or raw (.faa)).
Query protein FASTA file (.faa) or folder of query files.

Optional for S. aureus: The CSV file of Metadata (CC/ST) frequencies for the S. aureus database.

Use precompressed databases

WhatsGNU_main.py -d Sau_Ortholog_10350.pickle -dm ortholog query.faa
or
WhatsGNU_main.py -d Senterica_Enterobase_basic_216642.pickle -dm basic query.faa

You can also use a folder of multiple .faa query files as input (e.g. folder_faa/ has all .faa files to be processed)

WhatsGNU.py -d TB_Ortholog_6563.pickle -dm ortholog folder_faa/

Use precompressed databases with more features

You can assign output folder name using -o instead of default (WhatsGNU_results_timestamp)

WhatsGNU_main.py -d Sau_Staphopia_basic_43914.pickle -dm basic -o output_results_folder query.faa

Create a file of each protein with all associated ids from the database (Note: large file (~ 1 Gb for 3000 proteins))

WhatsGNU_main.py -d Pa_Ortholog_4713.pickle -dm ortholog -i -o output_results_folder query.faa

Create a file of top 10 genomes with hits

WhatsGNU_main.py -d Sau_Ortholog_10350.pickle -dm ortholog -t query.faa

Check how many hits you get from a particular genome in the database (It has to be used with -t). The names of the different strains in the databases and their corresponding Genbank strain name and GCA number are available from List of Genomes included

WhatsGNU_main.py -d Sau_Ortholog_10350.pickle -dm ortholog -t -s FDAARGOS_31_GCA_001019015.2_CC8_ query.faa

Get Metadata (CC/ST) composition of your hits in the report (Only for S. aureus and you will need to use the metadata_frequencies.csv file (available to download with the database) with -e)

WhatsGNU_main.py -d Sau_Ortholog_10350.pickle -dm ortholog -e metadata_frequencies.csv query.faa

Get a fasta (.faa) file of all proteins with GNU score of zero.

WhatsGNU_main.py -d Sau_Ortholog_10350.pickle -dm ortholog -f query.faa

The following options work with -dm ortholog

Run blastp on the proteins with GNU score of zero and modify the report with ortholog information.

WhatsGNU_main.py -d Pa_Ortholog_4713.pickle -dm ortholog -b query.faa

Note: If -b is used, WhatsGNU will search for compressed_db_orthologs.faa and compressed_db_orthologs_info.txt in the same path for the compressed database as they are needed for the blastp run.

Get the output report of blastp run (works with -b).

WhatsGNU_main.py -d Pa_Ortholog_4713.pickle -dm ortholog -b -op query.faa

Select a blastp percent identity and coverage cutoff values [Default 80], range(0,100).

WhatsGNU_main.py -d Sau_Ortholog_10350.pickle -dm ortholog -b –w 90 –c 50 query.faa

Select an OVRI cutoff value [Default 0.045], range (0-1).

WhatsGNU_main.py -d TB_Ortholog_6563.pickle -dm ortholog -ri 0.09 query.faa

Use all features together

WhatsGNU_main.py -d Sau_Ortholog_10350.pickle -dm ortholog -o output_results_folder -i -t -s strain_name -e metadata_frequencies.csv -f -b -op –w 95 –c 40 -ri 0.09 query.faa

Command line options

WhatsGNU_main.py -h
usage: WhatsGNU_main.py [-h] [-m MKDATABASE | -d DATABASE] [-a] [-j]
                        [-r [ROARY_CLUSTERED_PROTEINS]] [-dm {ortholog,basic}]
                        [-ri [RARITY_INDEX]] [-o OUTPUT_FOLDER] [--force]
                        [-p PREFIX] [-t] [-s STRAINHITS] [-e METADATA] [-i]
                        [-f] [-b] [-op] [-w [PERCENT_IDENTITY]]
                        [-c [PERCENT_COVERAGE]] [-q] [-v]
                        query_faa

WhatsGNU v1.0 utilizes the natural variation in public databases to rank
protein sequences based on the number of observed exact protein matches
(the GNU score) in all known genomes of a particular species. It generates a
report for all the proteins in your query in seconds.

positional arguments:
  query_faa             Query protein FASTA file/s to analyze (.faa)

optional arguments:
  -h, --help            show this help message and exit
  -m MKDATABASE, --mkdatabase MKDATABASE
                        you have to provide path to faa file or a folder of
                        multiple faa files for compression
  -d DATABASE, --database DATABASE
                        you have to provide path to your compressed database
  -a, --pickle          Save database in pickle format [Default only txt file]
  -j, --sql             Save database in SQL format for large Databases
                        [Default only txt file]
  -r [ROARY_CLUSTERED_PROTEINS], --roary_clustered_proteins [ROARY_CLUSTERED_PROTEINS]
                        clustered_proteins output file from roary to be used
                        with -m
  -dm {ortholog,basic}, --database_mode {ortholog,basic}
                        select a mode from 'ortholog' or 'basic' to be used
                        with -d
  -ri [RARITY_INDEX], --rarity_index [RARITY_INDEX]
                        select an ortholog variant rarity index (OVRI) cutoff
                        value in range (0-1)[0.045] for ortholog mode
  -o OUTPUT_FOLDER, --output_folder OUTPUT_FOLDER
                        Database output prefix to be created for results
                        (default: timestamped WhatsGNU_results in the current
                        directory)
  --force               Force overwriting existing results folder assigned
                        with -o (default: off)
  -p PREFIX, --prefix PREFIX
                        Prefix for output compressed database (default:
                        WhatsGNU_compressed_database)
  -t, --topgenomes      create a file of top 10 genomes with hits
  -s STRAINHITS, --strainhits STRAINHITS
                        check how many hits you get from a particular
                        strain,it has to be used with -t
  -e METADATA, --metadata METADATA
                        get the metadata composition of your hits, use the
                        metadata_frequency.csv file produced by the WhatsGNU
                        customizer script
  -i, --ids_hits        create a file of each protein with locus_tags (ids) of
                        all hits from the database, large file (~ 1 Gb for
                        3000 pts)
  -f, --faa_GNU_0       get a fasta (.faa) file of all proteins with GNU score
                        of zero
  -b, --blastp          run blastp on the proteins with GNU score of zero and
                        modify the report with ortholog_info, blastp has to be
                        installed
  -op, --output_blastp  get the output report of blastp run, it has to be used
                        with -b
  -w [PERCENT_IDENTITY], --percent_identity [PERCENT_IDENTITY]
                        select a blastp percent identity cutoff value [80],
                        range(0,100)
  -c [PERCENT_COVERAGE], --percent_coverage [PERCENT_COVERAGE]
                        select a blastp percent coverage cutoff value [80],
                        range(0,100)
  -q, --quiet           No screen output [default OFF]
  -v, --version         print version and exit

Output

Always with -m or -d

query_WhatsGNU_report_v1.txt (tab-separated output file)

Basic Mode

protein	GNU score	length	function	sequence
strain_x_protein_1	2	3	argG	MVM

Ortholog Mode (in addition to the previous five columns)

ortholog_group	ortho_gp_total_sequences_number	ortho_gp_total_variants_number	minimum_GNU	maximum_GNU	average_GNU	OVRI	OVRI interpretation
argG	100	5	2	50	38	0.02	rare

Explanation for the columns in the report: For instance, if strain_x_protein_1 (sequence: MVM) belongs to argG orthologous group which has 5 protein variants (MMMM,MVVM, MVM, MVV and VVM) with GNU scores [50,35,10,3,2]:

Column 1: protein name
Column 2: GNU score (number of exact matches in the database)
Column 3: protein sequence length
Column 4: function from the database
Column 5: protein sequence
Column 6: name of the orthologous group
Column 7: total number of sequences (sum of GNU scores) in the orthologous group
Column 8: Number of protein variants (alleles) in the orthologous group
Column 5: minimum GNU score in the orthologous group
Column 6: maximum GNU score in the orthologous group
Column 7: average GNU score in the orthologous group
Column 8: Ortholog Variant Rarity Index (OVRI) (scale is 0-1) which is
(GNU score of the allele + lower GNU scores)/(Sum of GNU scores in the ortholog group)
For example for the variant that has GNU=2 it will be 2/100 = 0.02
while for variant that has GNU=10 it will be (10+3+2)/100 = 0.15
and finally the variant that has GNU=50 it will be (50+35+10+3+2)/100 = 1
Column 9: A rare/frequent tag to the protein based on its OVRI. The default cutoff value which is arbitrary is 0.045 so anything below this value is rare and above is frequent.

Note: If -e option is used for S. aureus, CC/ST percentages’ columns will be added to the report.

WhatsGNU_date_time.log (Log file, e.g. WhatsGNU_v1_20190209_183406.log)

Always with -m

compressed_db.txt (if -a, compressed_db.pickle will be created)
compressed_db_orthologs.faa (if “-r clustered_proteins” is used with -m)
compressed_db_orthologs_info.txt (if “-r clustered_proteins” is used with -m)

Optional

Option	File	Description
-i	query_WhatsGNU_hits.txt	each protein with all hits_ids from the database,large file (~ 1 Gb for S. aureus)
-t	query_WhatsGNU_topgenomes.txt	top 10 genomes with hits to your query
-f	query_WhatsGNU_zeros.faa	file of all proteins with GNU score of zero
-op	query_WhatsGNU_zeros_blast_report.txt	output report of blastp run

Usage for WhatsGNU_plotter.py

Input

A folder of query_WhatsGNU_report.txt files.

Heatmap

Plot a heatmap of GNU scores for these proteins in proteins.faa using this strains’ order. Assign a title using -t. Font size and figure size (w,h) are given by -f and -fs, respectively. Annotate the heatmap cells with OVRI rare tag using -r option.

WhatsGNU_plotter.py -hp ortholog -q proteins.faa -r -d strains_order.txt -t title -r -f 14 -fs 14 10 prefix_name WhatsGNU_reports_folder/

Metadata percentage distribution

Plot a metadata percentage bar plot for the GNU scores of the proteins in proteins.faa for each WhatsGNU report.

WhatsGNU_plotter.py -mb basic -q proteins.faa prefix_name WhatsGNU_reports_folder/

Histogram

Plot a blue histogram of the GNU scores for each WhatsGNU report using 100 bins and get a text file showing novel and conserved proteins with -p option to assign cutoffs.

WhatsGNU_plotter.py -x -e blue -b 100 -p 50 5000 prefix_name WhatsGNU_reports_folder/

Volcano plot

Plot two scatterplots that shows either statistical significance (P value) or average OVRI versus magnitude of change (Delta_average_GNU_Score). The case/control tag is provided in isolates_case_control_tag.csv. The option -c 100 is a percentage of isolates a protein must be in to be included. A summary statistics file is also created.

WhatsGNU_plotter.py -st isolates_case_control_tag.csv -c 100 prefix_name WhatsGNU_reports_folder/

All features together

WhatsGNU_plotter.py -hp ortholog -q proteins.faa -d strains_order.txt -t title -r -f 16 -fs 14 10 -mb ortholog -x -e blue -b 100 -st isolates_case_control_tag.csv -c 100 prefix_name WhatsGNU_reports_folder/

Command line options

WhatsGNU_plotter.py -h
usage: WhatsGNU_plotter.py [-h] [-hp {ortholog,basic}] [-l LIST_GENES]
                           [-q FASTA] [-op] [-d STRAINS_ORDER] [-r]
                           [-rc RARITY_COLOR] [-fs FIGURE_SIZE FIGURE_SIZE]
                           [-hc HEATMAP_COLOR] [-mc MASKED_COLOR]
                           [-f FONT_SIZE] [-t TITLE] [-mb {ortholog,basic}]
                           [-w] [-s SELECT_METADATA] [-x] [-e HISTOGRAM_COLOR]
                           [-b HISTOGRAM_BINS]
                           [-p NOVEL_CONSERVED NOVEL_CONSERVED]
                           [-st STRAINS_TAG_VOLCANO] [-c CUTOFF_VOLCANO]
                           [-cc CASE_CONTROL_NAME CASE_CONTROL_NAME]
                           prefix_name directory_path

WhatsGNU_plotter script for WhatsGNU v1.0.

positional arguments:
  prefix_name           prefix name for the the output folder and
                        heatmap/volcano output files
  directory_path        path to directory of WhatsGNU reports

optional arguments:
  -h, --help            show this help message and exit
  -hp {ortholog,basic}, --heatmap {ortholog,basic}
                        heatmap of GNU scores for orthologous genes in
                        multiple isolates
  -l LIST_GENES, --list_genes LIST_GENES
                        a txt file of ortholog group names from one of the
                        WhatsGNU reports for heatmap
  -q FASTA, --fasta FASTA
                        a FASTA file of sequences for the proteins of interest
                        for heatmap or metadata barplot
  -op, --output_blastp  get the output report of blastp run, it has to be used
                        with -q
  -d STRAINS_ORDER, --strains_order STRAINS_ORDER
                        list of strains order for heatmap
  -r, --rarity          Annotate heatmap cells with OVRI(default: off)
  -rc RARITY_COLOR, --rarity_color RARITY_COLOR
                        OVRI data text color in the heatmap
  -fs FIGURE_SIZE FIGURE_SIZE, --figure_size FIGURE_SIZE FIGURE_SIZE
                        heatmap width and height in inches w,h, respectively
  -hc HEATMAP_COLOR, --heatmap_color HEATMAP_COLOR
                        heatmap color
  -mc MASKED_COLOR, --masked_color MASKED_COLOR
                        missing data color in heatmap
  -f FONT_SIZE, --font_size FONT_SIZE
                        heatmap font size
  -t TITLE, --title TITLE
                        title for the heatmap [Default:WhatsGNU heatmap]
  -mb {ortholog,basic}, --metadata_barplot {ortholog,basic}
                        Metadata percentage distribution for proteins in a
                        FASTA file
  -w, --all_metadata    all metadata
  -s SELECT_METADATA, --select_metadata SELECT_METADATA
                        select some metadata
  -x, --histogram       histogram of GNU scores
  -e HISTOGRAM_COLOR, --histogram_color HISTOGRAM_COLOR
                        histogram color
  -b HISTOGRAM_BINS, --histogram_bins HISTOGRAM_BINS
                        number of bins for the histograms [10]
  -p NOVEL_CONSERVED NOVEL_CONSERVED, --novel_conserved NOVEL_CONSERVED NOVEL_CONSERVED
                        upper and lower GNU score limits for novel and
                        conserved proteins novel_GNU_upper_limit,
                        conserved_GNU_lower_limit, respectively [Default 10,
                        100]
  -st STRAINS_TAG_VOLCANO, --strains_tag_volcano STRAINS_TAG_VOLCANO
                        a csv file of the strains of the two groups to be
                        compared with (case/control) tag
  -c CUTOFF_VOLCANO, --cutoff_volcano CUTOFF_VOLCANO
                        a percentage of isolates a protein must be in [Default:
                        100]
  -cc CASE_CONTROL_NAME CASE_CONTROL_NAME, --case_control_name CASE_CONTROL_NAME CASE_CONTROL_NAME
                        case and control groups' names [Default: case control]

Output

A heatmap, metadata percentage distribution bar plot, histogram and two volcano plots and summary statistics files.

Instructions for creating a database

Simple (GenBank)

Download proteomes of a species (.faa) in a Directory from GenBank
```
WhatsGNU_get_GenBank_genomes.py -f GCAs.txt Species_faa
```

Modify the faa files to have the strains’ names

WhatsGNU_database_customizer.py -c -g Species_modified Species_faa/

Run WhatsGNU_main.py in basic mode
```
WhatsGNU_main.py -m Species_modified_concatenated.faa query.faa
```
Simple (Prokka-annotated faa files)
Annotate your genomes with Prokka and put all faa files in one folder

Modify the faa files to have the strains’ names

WhatsGNU_database_customizer.py -c -p Species_modified Species_faa/

Run WhatsGNU_main.py in basic mode

WhatsGNU_main.py -m Species_modified_concatenated.faa query.faa

query.faa should be any faa file. It won’t matter at this step

Advanced (e.g. S. aureus)

Download genomes of a species (.fna) in a Directory from GenBank

WhatsGNU_get_GenBank_genomes.py -c GCAs.txt Sau_fna
gunzip Sau_fna/*

Annotate the genomes using Prokka

An example command for S. aureus is given, change it or use any other options from Prokka

for i in `cat file_names.list`;do prokka --kingdom Bacteria --outdir prokka_$i --gcode 11 --genus Staphylococcus --species aureus --strain $i --prefix $i --locustag $i Species_fna/$i*.fna; done
find ./ -name '*.faa' -exec cp -prv '{}' '/Sau_faa/' ';'
find ./ -name '*.gff' -exec cp -prv '{}' '/Sau_gff/' ';'

Modify the faa and gff files to have the strains’ names
```
WhatsGNU_database_customizer.py -c -p -l strain_name_list.csv Sau_modified_faa Sau_faa/
WhatsGNU_database_customizer.py -i -s -l strain_name_list.csv -g Sau_modified_gff Sau_gff/
```
The strain_name_list.csv is a comma-separated list of 3+ columns: file_name, old locustag, new locustag and optionally metadata. If metadata are provided, the script will concatenate the new locustag with metadata using ‘’ as a separator. The new locustag in this case will be: new_locustag_metadata. In case of GenBank, RefSeq and RAST, use NA for the old locustag column in the list.csv file.
Run Roary for pangenome analysis

An example command for Roary is given, change it or use any other options from Roary

roary Sau_modified_gff/*.gff

5.Run WhatsGNU_main.py in Ortholog mode using clustered_proteins output file from Roary

WhatsGNU_main.py -m Sau_modified_concatenated.faa -r clustered_proteins query.faa

Command line options for WhatsGNU_get_GenBank_genomes.py

WhatsGNU_get_GenBank_genomes.py -h
usage: WhatsGNU_get_GenBank_assemblies.py [-h] [-f] [-c] [-r]
                                          list output_folder

Get GenBank assemblies (faa or/and fna) for WhatsGNU v1.0

positional arguments:
  list           a list.txt file of GenBank accession numbers (GCA#.#)
  output_folder  give name for output folder to be created

optional arguments:
  -h, --help     show this help message and exit
  -f, --faa      protein faa file from GenBank
  -c, --contigs  genomic fna file from GenBank
  -r, --remove   remove assembly_summary_genbank.txt after done

Command line options for WhatsGNU_database_customizer.py

WhatsGNU_database_customizer.py -h
usage: WhatsGNU_database_customizer.py [-h] [-g | -p | -r | -s] [-z]
                                       [-l LIST_CSV] [-i] [-c]
                                       prefix_name directory_path

Database_customizer script for WhatsGNU v1.0.

positional arguments:
  prefix_name           prefix name for the output folder and the one
                        concatenated modified file
  directory_path        path to directory of faa, RAST txt or gff files

optional arguments:
  -h, --help            show this help message and exit
  -g, --GenBank_RefSeq  faa files from GenBank or RefSeq
  -p, --prokka          faa files from Prokka
  -r, --RAST            spreadsheet tab-separated text files from RAST
  -s, --gff_file        gff file from prokka, needed if planning to run Roary
  -z, --gzipped         compressed file (.gz)
  -l LIST_CSV, --list_csv LIST_CSV
                        a file.csv of 3+ columns: file_name, old locustag, new
                        locustag and optionally metadata
  -i, --individual_files
                        individual modified files
  -c, --concatenated_file
                        one concatenated modified file of all input files

Example usage for WhatsGNU_main_hashes.py

Using the hashed database to generate basic WhatsGNU reports

WhatsGNU_main_hashes.py -d Kp_basic_db_hashed_str.pickle -csv Kp_basic_db_hashed.csv -o WhatsGNU_Kp_op faa/

Finding the top 10 genomes closest genomes to your genomes of interest

WhatsGNU_main_hashes.py -d PA_basic_db_hashed_str.pickle -csv PA_basic_db_hashed.csv -t -o WhatsGNU_PA_op faa/

By default, when using -i/–ids_hits the output report will report the hashed values of the hits. To get the accession numbers instead, use the –accession-names option

WhatsGNU_main_hashes.py -d basic_Ecoli_db_hashed_str.pickle -csv basic_Ecoli_db_hashed.csv -i --accession-names -o WhatsGNU_Ecoli_op faa/

Command line options for WhatsGNU_main_hashes.py

usage: WhatsGNU_main.py [-h] [-d DATABASE] [-o OUTPUT_FOLDER] [--force] [-p PREFIX] [-t] [-csv CSV] [-tn TOPGENOMES_COUNT] [-s STRAINHITS] [-i] [--accession_names] [--hash_values] [-q]
                        [-v]
                        query_faa

WhatsGNU v1.4 utilizes the natural variation in public databases to rank protein sequences based on the number of observed exact protein matches (the GNU score) in all known genomes of a
particular species. It generates a report for all the proteins in your query in seconds.

positional arguments:
  query_faa             Query protein FASTA file/s to analyze (.faa)

options:
  -h, --help            show this help message and exit
  -d DATABASE, --database DATABASE
                        you have to provide path to your compressed database
  -o OUTPUT_FOLDER, --output_folder OUTPUT_FOLDER
                        Database output prefix to be created for results (default: timestamped WhatsGNU_results in the current directory)
  --force               Force overwriting existing results folder assigned with -o (default: off)
  -p PREFIX, --prefix PREFIX
                        Prefix for output compressed database (default: WhatsGNU_compressed_database)
  -t, --topgenomes      create a file of top N genomes with most number of exact matches to query [Default top 10 genomes]
  -csv CSV              csv file of hashed inputs
  -tn TOPGENOMES_COUNT, --topgenomes_count TOPGENOMES_COUNT
                        select number of closest top genomes to show [Default top 10 genomes]
  -s STRAINHITS, --strainhits STRAINHITS
                        check how many hits you get from a particular strain,it has to be used with -t
  -i, --ids_hits        create a file of each protein with locus_tags (ids) of all hits from the database, large file (~ 1 Gb for 3000 pts)
  --accession_names     to be used with --ids_hits. If this option is selected, writes the id_hits file with the accession names.
  --hash_values         to be used with --ids_hits. Default option. This options writes the id_hits file with the hashed values.
  -q, --quiet           No screen output [default OFF]
  -v, --version         print version and exit

Requests for creating a database

Requests to process a database for a specific species are welcomed and will be considered

Bugs

Please submit via the GitHub issues page: https://github.com/ahmedmagds/WhatsGNU/issues

Software Licence

GPLv3: https://github.com/ahmedmagds/WhatsGNU/blob/master/LICENSE

Citations

WhatsGNU

WhatsGNU: a tool for identifying proteomic novelty
Moustafa AM and Planet PJ 2020, Genome Biology;21:58

Other tools

Please cite Prokka ‘Seemann 2014, Bioinformatics;30(14):2068-9’ if you use WhatsGNU.
Please also cite Roary ‘Page et al. 2015, Bioinformatics;31(22):3691-3693’ if you use WhatsGNU.
Please also cite BLAST+ ‘Camacho et al. 2009, BMC Bioinformatics;10:421’ if you use WhatsGNU.
Please cite Staphopia ‘Petit RA III and Read TD 2018, PeerJ;6:e5261’ if you use Staphopia S. aureus Database.
Please cite Enterobase ‘Alikhan NF et al. 2018, PLoS Genetics;14(4):e1007261’ if you use Enterobase S. enterica Database.
Author
Ahmed M. Moustafa: ahmedmagds
Twitter: Ahmed_Microbes

WhatsGNU

Introduction

Installation

Dependencies

Clone the Github repository

Bioconda

Windows

Test

Available Databases

Ortholog Mode:

Basic Mode:

Hashed Databases:

WhatsGNU toolbox

WhatsGNU_get_GenBank_genomes.py

WhatsGNU_database_customizer.py

WhatsGNU_db_download.py

WhatsGNU_main.py

WhatsGNU_plotter.py

WhatsGNU_main_hashes.py

Usage for WhatsGNU_db_download.py

Input

Usage for WhatsGNU_main.py

Input

Use precompressed databases

Use precompressed databases with more features

The following options work with -dm ortholog

Use all features together

Command line options

Output

Always with -m or -d

Basic Mode

Ortholog Mode (in addition to the previous five columns)

Always with -m

Optional

Usage for WhatsGNU_plotter.py

Input

Heatmap

Metadata percentage distribution

Histogram

Volcano plot

All features together

Command line options

Output

Instructions for creating a database

Simple (GenBank)

Simple (Prokka-annotated faa files)

Advanced (e.g. S. aureus)

Command line options for WhatsGNU_get_GenBank_genomes.py

Command line options for WhatsGNU_database_customizer.py

Example usage for WhatsGNU_main_hashes.py

Command line options for WhatsGNU_main_hashes.py

Requests for creating a database

Bugs

Software Licence

Citations

WhatsGNU

Other tools

Author