DISCLAIMER: This repositroy is deprecated and no further development is happening here. All the code and functionality was moved to https://github.com/rki-mf1/president.
PRESIDENT: PaiRwisE Sequence IDENtiTy
Calculate pairwise nucleotide identity with respect to a reference sequence.
Given a reference and a query sequence, calculate pairwise nucleotide identity with respect to the reference sequence relative to the entire length of the reference. In the main metric, only informative nucleotides (A, T, G, C) are considered identical to each other. The tool also provides some further metrics (e.g. regarding ambiguous ‘N’s) and splits the input FASTA into valid and failed FASTA files for further processing.
Installation:
To install president with conda, run the commands below:
conda create -y -n president -c bioconda -c conda-forge president
conda activate president
Note that pblat is a dependency and only runs on Linux. Alternatively, president can be installed with pip in an environment where pblat is in the PATH:
pip install pypresident
Another possibility is to use a docker image with all dependencies installed:
docker pull rkibioinf/president:latest
Usage:
The pairwise alignment can be run with the following console call:
To run an example, download a query FASTA and
a reference FASTA from GitLab.
Run the alignment with the following command and identity of ACGT bases of 90% and maximal 5% Ns in the query. Note that multiple fasta sequences are allowed to be present in the query but not in the reference FASTA.
a tab-separated file with the below listed columns
a FASTA file with valid sequences
a FASTA file with invalid sequences
The separation between the valid and invalid bin is mainly based on the defined identity/n thresholds (-x, default: 0.9; -n, default: 0.05) and further sanity checks (non-IUPAC characters).
Note that for multiple query inputs the valid and invalid sequence headers are NOT associate with their filename. This meta information must be retrieved from the report.
An example from the call described above can be found online.
The transposed version of the output is shown below.
variable
value
query_name
NC_045512.2 Severe acute […]
reference_name
NC_045512.2 Severe acute […]
file_in_query
NC_045512.2.20mis.fasta
file_in_ref
NC_045512.2.fasta
ACGT Nucleotide identity
0.9987
ACGT Nucleotide identity (ignoring Ns)
0.9994
ACGT Nucleotide identity (ignoring non-ACGTNs)
0.9994
qc_all_valid
True
qc_is_empty_query
False
qc_post_align_pass_threshold
True
qc_post_aligned
True
qc_post_aligned_all_valid
True
qc_valid_length
True
qc_valid_nucleotides
True
qc_valid_pass_nthreshold
True
acgt_bases
29883
iupac_bases
20
non_iupac_bases
0
N_bases
20
length_query
29903
length_reference
29903
LongestNGap
20
Matches
29864
Mismatches
19
query_description
query_index
0
Date
2021-01-20
Definitions:
query_name - FASTA ID of the query sequence + file origin, format :
reference_name - FASTA ID of the reference sequence
file_in_query - the input query file name
file_in_ref - the input reference file name (note that for multiple input queries a tempororay file name is reported)
ACGT Nucleotide identity - Percentage of ACGT matches to the reference (# matches / max(sequence_lengths))
ACGT Nucleotide identity (ignoring Ns) - Percentage of ACGT matches to the reference (# matches / max(sequence_lengths) - #Ns in query)
ACGT Nucleotide identity (ignoring non-ACGTNs) - Percentage of ACGT matches to the reference (# matches / max(sequence_lengths) - #non-ACGTNs in query)
qc_all_valid - True if all checks below are True
qc_is_empty_query - True if input query file is not empty
qc_post_align_pass_threshold - True if ‘ACGT Nucleotide identity’ is greater than the -x parameter
qc_post_aligned - Set to True if the sequence was aligned (can be False either because of initial checks or failed alignments)
qc_post_aligned_all_valid - True if all checks before alignment are True, alignment is only performed if this value is True
qc_valid_length - True if the query is actually able to achieve the -x score even if the query is shorter/ longer
qc_valid_nucleotides - True if only valid IUPAC characters are in the query
qc_valid_pass_nthreshold - True if -n percentage of Ns in the query is not exceeded
acgt_bases - number of ACGT bases
iupac_bases - number of IUPAC bases
non_iupac_bases - Number of non-IUPAC nucleotides in the query
N_bases - the number of N bases
length_query - the length of the query
length_reference - the length of the reference
LongestNGap - Lenght of the longest N gap
Matches - the number of matches in the alignment
Mismatches - the number of mismatches in the alignment
query_description - query description, if available
query_index - the position of the sequence in the query fasta input file (for multiple files the counter is not resetted)
Date - the yyyy-mm-dd of the execution of the script
Note: max(sequence_lengths) is equal to max(length_query, length_reference).
Notes:
nextstrain uses a quality threshold of < 3000 non-canonical nucleotides
DISCLAIMER: This repositroy is deprecated and no further development is happening here. All the code and functionality was moved to https://github.com/rki-mf1/president.
PRESIDENT: PaiRwisE Sequence IDENtiTy
Calculate pairwise nucleotide identity with respect to a reference sequence.
Given a reference and a query sequence, calculate pairwise nucleotide identity with respect to the reference sequence relative to the entire length of the reference. In the main metric, only informative nucleotides (A, T, G, C) are considered identical to each other. The tool also provides some further metrics (e.g. regarding ambiguous ‘N’s) and splits the input FASTA into valid and failed FASTA files for further processing.
Installation:
To install president with conda, run the commands below:
Note that
pblatis a dependency and only runs on Linux. Alternatively, president can be installed with pip in an environment where pblat is in the PATH:Another possibility is to use a docker image with all dependencies installed:
Usage:
The pairwise alignment can be run with the following console call:
To run an example, download a query FASTA and a reference FASTA from GitLab.
Run the alignment with the following command and identity of ACGT bases of 90% and maximal 5% Ns in the query. Note that multiple fasta sequences are allowed to be present in the query but not in the reference FASTA.
Output:
The script provides:
The separation between the valid and invalid bin is mainly based on the defined identity/n thresholds (
-x, default: 0.9;-n, default: 0.05) and further sanity checks (non-IUPAC characters). Note that for multiple query inputs the valid and invalid sequence headers are NOT associate with their filename. This meta information must be retrieved from the report. An example from the call described above can be found online.The transposed version of the output is shown below.
Definitions:
-xparameter-xscore even if the query is shorter/ longer-npercentage of Ns in the query is not exceededNote: max(sequence_lengths) is equal to max(length_query, length_reference).
Notes:
ANI definition: