KITSUNE: K-mer-length Iterative Selection for UNbiased Ecophylogenomics
KITSUNE is a toolkit for evaluating the length of k-mer in a given genome dataset for alignment-free phylogenomic analysis.
K-mer based approach is simple and fast yet has been widely used in many applications including biological sequence comparison. However, selection of an appropriate k-mer length to obtain good information content for comparison is normally overlooked. The optimum k-mer length is a prerequisite to obtain biological meaningful genomic distance for assessment of phylogenetic relationships. Therefore, we have developed KITSUNE to aid k-mer length selection process in a systematic way, based on a three-steps approach described in Viral Phylogenomics Using an Alignment-Free Method: A Three-Step Approach to Determine Optimal Length of k-mer.
KITSUNE will calculate the three matrices across considered k-mer range:
Cumulative Relative Entropy (CRE)
Average number of Common Features (ACF)
Observed Common Features (OCF)
Moreover, KITSUNE also provides various genomic distance calculations from the k-mer frequency vectors that can be used for species identification or phylogenomic tree construction.
# Clone the GitHub repository
git clone https://github.com/natapol/kitsune
# Move to the kitsune folder
cd kitsune/
# Install
python setup.py install
Usage
Overview of kitsune
command for listing help
$ kitsune --help
usage: kitsune <command> [<args>]
Available commands:
acf Compute average number of common features between signatures
cre Compute cumulative relative entropy
dmatrix Compute distance matrix
kopt Compute recommended choice (optimal) of kmer within a given kmer interval for a set of genomes using the cre, acf and ofc
ofc Compute observed feature frequencies
Use --help in conjunction with one of the commands above for a list of available options (e.g. kitsune acf --help)
Calculate CRE, ACF, and OFC value for specific kmer
Kitsune provides three commands to calculate an appropiate k-mer using CRE, ACF, and OCF:
Calculate CRE
$ kitsune cre -h
usage: kitsune (cre) [-h] --filename FILENAME [--fast] [--canonical] -ke KEND
[-kf KFROM] [-t THREAD] [-o OUTPUT]
Calculate k-mer from cumulative relative entropy of all genomes
optional arguments:
-h, --help show this help message and exit
--filename FILENAME A genome file in fasta format (default: None)
--fast Jellyfish one-pass calculation (faster) (default:
False)
--canonical Jellyfish count only canonical mer (default: False)
-ke KEND, --kend KEND
Last k-mer (default: None)
-kf KFROM, --kfrom KFROM
Calculate from k-mer (default: 4)
-t THREAD, --thread THREAD
-o OUTPUT, --output OUTPUT
Output filename (default: None)
Calculate ACF
$ kitsune acf -h
usage: kitsune (acf) [-h] --filenames FILENAMES [FILENAMES ...] [--fast]
[--canonical] -k KMERS [KMERS ...] [-t THREAD]
[-o OUTPUT]
Calculate an average number of common features pairwise between one genome
against others
optional arguments:
-h, --help show this help message and exit
--filenames FILENAMES [FILENAMES ...]
Genome files in fasta format (default: None)
--fast Jellyfish one-pass calculation (faster) (default:
False)
--canonical Jellyfish count only canonical mer (default: False)
-k KMERS [KMERS ...], --kmers KMERS [KMERS ...]
Have to state before (default: None)
-t THREAD, --thread THREAD
-o OUTPUT, --output OUTPUT
Output filename (default: None)
Calculate OFC
$ kitsune ofc -h
usage: kitsune (ofc) [-h] --filenames FILENAMES [FILENAMES ...] [--fast]
[--canonical] -k KMERS [KMERS ...] [-t THREAD]
[-o OUTPUT]
Calculate an observe feature frequency
optional arguments:
-h, --help show this help message and exit
--filenames FILENAMES [FILENAMES ...]
Genome files in fasta format (default: None)
--fast Jellyfish one-pass calculation (faster) (default:
False)
--canonical Jellyfish count only canonical mer (default: False)
-k KMERS [KMERS ...], --kmers KMERS [KMERS ...]
-t THREAD, --thread THREAD
-o OUTPUT, --output OUTPUT
Output filename (default: None)
Calculate genomic distance at specific k-mer from kmer frequency vectors of two of genomes
Kitsune provides a commands to calculate genomic distance using different distance estimation method. Users can assess the impact of a selected k-mer length on the genomic distnace of choice below.
distance option
name
braycurtis
Bray-Curtis distance
canberra
Canberra distance
chebyshev
Chebyshev distance
cityblock
City Block (Manhattan) distance
correlation
Correlation distance
cosine
Cosine distance
euclidean
Euclidean distance
jensenshannon
Jensen-Shannon distance
sqeuclidean
Squared Euclidean distance
dice
Dice dissimilarity
hamming
Hamming distance
jaccard
Jaccard-Needham dissimilarity
kulsinski
Kulsinski dissimilarity
rogerstanimoto
Rogers-Tanimoto dissimilarity
russellrao
Russell-Rao dissimilarity
sokalmichener
Sokal-Michener dissimilarity
sokalsneath
Sokal-Sneath dissimilarity
yule
Yule dissimilarity
mash
MASH distance
jsmash
MASH Jensen-Shannon distance
jaccarddistp
Jaccard-Needham dissimilarity Probability
euclidean_of_frequency
Euclidean distance of Frequency
Kitsune provides a choice of distance transformation proposed by Fan et.al.
Calculate a distance matrix
$ kitsune dmatrix -h
usage: kitsune (dmatrix) [-h] [--filenames [FILENAMES [FILENAMES ...]]]
[--fast] [--canonical] -k KMER [-i INPUT] [-o OUTPUT]
[-t THREAD] [--transformed]
[-d {braycurtis,canberra,jsmash,chebyshev,cityblock,correlation,cosine,dice,euclidean,hamming,jaccard,kulsinsk,matching,rogerstanimoto,russellrao,sokalmichener,sokalsneath,sqeuclidean,yule,mash,jaccarddistp}]
[-f FORMAT]
Calculate a distance matrix
optional arguments:
-h, --help show this help message and exit
--filenames [FILENAMES [FILENAMES ...]]
Genome files in fasta format (default: None)
--fast Jellyfish one-pass calculation (faster) (default:
False)
--canonical Jellyfish count only canonical mer (default: False)
-k KMER, --kmer KMER
-i INPUT, --input INPUT
List of genome files in txt (default: None)
-o OUTPUT, --output OUTPUT
Output filename (default: None)
-t THREAD, --thread THREAD
--transformed
-d {braycurtis,canberra,jsmash,chebyshev,cityblock,correlation,cosine,dice,euclidean,hamming,jaccard,kulsinsk,matching,rogerstanimoto,russellrao,sokalmichener,sokalsneath,sqeuclidean,yule,mash,jaccarddistp}, --distance {braycurtis,canberra,jsmash,chebyshev,cityblock,correlation,cosine,dice,euclidean,hamming,jaccard,kulsinsk,matching,rogerstanimoto,russellrao,sokalmichener,sokalsneath,sqeuclidean,yule,mash,jaccarddistp}
-f FORMAT, --format FORMAT
Kitsune provides a wrap-up comand to find optimum k-mer length for a given set of genome within a given kmer interval.
$ kitsune kopt -h
usage: kitsune (kopt) [-h] [--acf-cutoff ACF_CUTOFF] [--canonical]
[--closely-related] [--cre-cutoff CRE_CUTOFF] [--fast]
--filenames FILENAMES [--hashsize HASHSIZE]
[--in-memory] [--k-min K_MIN] --k-max K_MAX
[--lower LOWER] [--nproc NPROC] [--output OUTPUT]
[--threads THREADS]
Optimal kmer size selection for a set of genomes using Average number of
Common Features (ACF), Cumulative Relative Entropy (CRE), and Observed Common
Features (OCF). Example: kitsune kopt --filenames genomeList.txt --k-min 4
--k-max 12 --canonical --fast
optional arguments:
-h, --help show this help message and exit
--acf-cutoff ACF_CUTOFF
Cutoff to use in selecting kmers whose ACFs are >=
(cutoff * max(ACF)) (default: 0.1)
--canonical Jellyfish count only canonical kmers (default: False)
--closely-related Use in case of closely related genomes (default:
False)
--cre-cutoff CRE_CUTOFF
Cutoff to use in selecting kmers whose CREs are <=
(cutoff * max(CRE)) (default: 0.1)
--fast Jellyfish one-pass calculation (faster) (default:
False)
--filenames FILENAMES
Path to the file with the list of genome files paths.
There should be at list 2 input genomes (default:
None)
--hashsize HASHSIZE Jellyfish initial hash size (default: 100M)
--in-memory Keep Jellyfish counts in memory (default: False)
--k-min K_MIN Minimum kmer size (default: 4)
--k-max K_MAX Maximum kmer size (default: None)
--lower LOWER Do not let Jellyfish output kmers with count < --lower
(default: 1)
--nproc NPROC Maximum number of CPUs to make it parallel (default:
1)
--output OUTPUT Path to the output file (default: None)
--threads THREADS Maximum number of threads for Jellyfish (default: 1)
KITSUNE: K-mer-length Iterative Selection for UNbiased Ecophylogenomics
KITSUNE is a toolkit for evaluating the length of k-mer in a given genome dataset for alignment-free phylogenomic analysis.
K-mer based approach is simple and fast yet has been widely used in many applications including biological sequence comparison. However, selection of an appropriate k-mer length to obtain good information content for comparison is normally overlooked. The optimum k-mer length is a prerequisite to obtain biological meaningful genomic distance for assessment of phylogenetic relationships. Therefore, we have developed KITSUNE to aid k-mer length selection process in a systematic way, based on a three-steps approach described in Viral Phylogenomics Using an Alignment-Free Method: A Three-Step Approach to Determine Optimal Length of k-mer.
KITSUNE will calculate the three matrices across considered k-mer range:
Moreover, KITSUNE also provides various genomic distance calculations from the k-mer frequency vectors that can be used for species identification or phylogenomic tree construction.
Installation
Kitsune is developed under python version 3 environment. We recommend users use python >= v3.5.
Requirement packages: scipy >= 0.18.1, numpy >= 1.1.0, tqdm >= 4.32
Kitsune also requires Jellyfish for k-mer counting as an external software dependency. Thus, you need to install it before running the tool: https://github.com/gmarcais/Jellyfish
Install with pip
Install from source
Usage
Overview of kitsune
command for listing help
Calculate CRE, ACF, and OFC value for specific kmer
Kitsune provides three commands to calculate an appropiate k-mer using CRE, ACF, and OCF:
Calculate CRE
Calculate ACF
Calculate OFC
General Example
Calculate genomic distance at specific k-mer from kmer frequency vectors of two of genomes
Kitsune provides a commands to calculate genomic distance using different distance estimation method. Users can assess the impact of a selected k-mer length on the genomic distnace of choice below.
Kitsune provides a choice of distance transformation proposed by Fan et.al.
Calculate a distance matrix
Example of choosing distance option:
Find optimum k-mer from a given set of genomes
Kitsune provides a wrap-up comand to find optimum k-mer length for a given set of genome within a given kmer interval.
Example dataset
First download the example files. Download