Download KOfam database from ftp://ftp.genome.jp/pub/db/kofam/ and decompress it. You will get profile HMMs in profiles/ directory and ko_list.
Create config.yml in the same directory as exec_annotation script. See below for details.
Execute exec_annotation.
$ ./exec_annotation -o result.txt query.fasta
Query file
A query file is a FASTA file with one or more amino acid sequences. You cannot use nucleotide sequences.
Each sequence must have a unique name. A name of a sequence is a string between the header symbol (“>”) and the first blank character (whitespace, tab, line break, etc.). Do not put a whitespace right after “>”.
Profiles
Specify the path of the profile database you downloaded by giving --profile option to the command or writing it to config.yml. The path can be a directory, .hmm file, or .hal file.
If it is a directory, .hmm files in the directory will be used.
If a .hmm file, only the file will be used.
If a .hal file, files listed in the .hal file will be used. File paths in a .hal file are either absolute or relative to the directory of the file. Lines start with # are ignored.
KOfam has prokaryote.hal and eukaryote.hal in profiles directory. They are lists of profiles excluding eukaryote- and prokaryote-specific KOs respectively.
If you are interested in only several KOs, you can make your original .hal file and use it as a database. It will reduce computation time.
Options
-o FILE
The result are output to FILE. It defaults to stdout.
Set the number of hmmsearch processes started simultaneously to N. It defaults to 1 unless it is set in config.yml.
-c FILE
Use FILE as a config file instead of config.yml in the same directory as exec_annotation.
--tmp-dir=DIR
Use DIR as a temporary directory where hmmsearch results are. It will be created if not exist. It defaults to ./tmp.
-E, --e-value=VALUE
Require E-value to be smaller than or equal to VALUE. If not, an asterisk will not be added in detail format or the hit will not be reported in other formats.
-T, --threshold-scale=VALUE
The score thresholds are multiplied by VALUE. For example, with -T2 option, the thresholds become twice as strict.
-f, --format=FORMAT
Set the format of the output to FORMAT. Three formats below are available.
detail
Default format. Gene name, assigned K number, threshold of the KO, hmmsearch score and E-value, and the definition of KO are shown. In addition, an asterisk ‘*’ is added to the head of the line if the score is higher than the threshold.
detail-tsv
Tab separated values for detail format.
mapper
Format which can be used for KEGG Mapper input. It includes a gene name and an assigned K number separated by a tab. Here, an assigned K number represents a hit with score above the predefined threshold. Note that for some KOs, predefined score thresholds are not available when they are represented by a very few number of sequences in KEGG GENES.
mapper-oneline
Similar to mapper, but when more than one KO are assigned to a gene, all assigned KO are shown in one line separated by tabs.
--[no-]report-unannotated
With --report-unannotated option, gene names are shown even when no KO is assigned (default when --format=mapper(-oneline)). With --no-report-unannotated such genes are not shown at all (default when --format=detail).
--create-alignment
hmmsearch‘s normal outputs per profile are stored in the temporary directory. In addition, domain information and alignments in the outputs will be rearranged per query.
Not compatible with --reannotation
-r, --reannotation
Skip hmmsearch and assume that hmmsearch outputs are already in the temporary directory. This will help you to make an output in a different format or redo annotation changing thresholds.
Not compatible with --create-alignment
-h, --help
Show brief help message.
config.yml
The following variables can be set by config.yml.
profile
Path to KOfam profiles.
--profile option takes precedence.
ko_list
Path to the KO list of KOfam.
--ko-list option takes precedence.
cpu
Number of hmmsearch processes started simultaneously.
--cpu option takes precedence.
hmmsearch
Path to hmmsearch executable. If not given, it will be searched for PATH.
parallel
Path to parallel executable. If not given, it will be searched for PATH.
KofamScan
KofamScan is a gene function annotation tool based on KEGG Orthology and hidden Markov model. You need KOfam database to use this tool. Online version is available on https://www.genome.jp/tools/kofamkoala/ .
Requirements
Usage
profiles/directory andko_list.config.ymlin the same directory asexec_annotationscript. See below for details.exec_annotation.Query file
A query file is a FASTA file with one or more amino acid sequences. You cannot use nucleotide sequences. Each sequence must have a unique name. A name of a sequence is a string between the header symbol (“>”) and the first blank character (whitespace, tab, line break, etc.). Do not put a whitespace right after “>”.
Profiles
Specify the path of the profile database you downloaded by giving
--profileoption to the command or writing it toconfig.yml. The path can be a directory, .hmm file, or .hal file. If it is a directory, .hmm files in the directory will be used. If a .hmm file, only the file will be used. If a .hal file, files listed in the .hal file will be used. File paths in a .hal file are either absolute or relative to the directory of the file. Lines start with # are ignored.KOfam has
prokaryote.halandeukaryote.halinprofilesdirectory. They are lists of profiles excluding eukaryote- and prokaryote-specific KOs respectively. If you are interested in only several KOs, you can make your original .hal file and use it as a database. It will reduce computation time.Options
-o FILEFILE. It defaults tostdout.-p,--profile=PROFILEPROFILEas a profile database. See Profiles-k,--ko-list=FILEFILEas a KO list.--cpu=Nhmmsearchprocesses started simultaneously toN. It defaults to 1 unless it is set inconfig.yml.-c FILEFILEas a config file instead ofconfig.ymlin the same directory asexec_annotation.--tmp-dir=DIRDIRas a temporary directory where hmmsearch results are. It will be created if not exist. It defaults to./tmp.-E,--e-value=VALUEVALUE. If not, an asterisk will not be added indetailformat or the hit will not be reported in other formats.-T,--threshold-scale=VALUEVALUE. For example, with-T2option, the thresholds become twice as strict.-f,--format=FORMATFORMAT. Three formats below are available.detaildetail-tsvdetailformat.mappermapper-onelinemapper, but when more than one KO are assigned to a gene, all assigned KO are shown in one line separated by tabs.--[no-]report-unannotated--report-unannotatedoption, gene names are shown even when no KO is assigned (default when--format=mapper(-oneline)). With--no-report-unannotatedsuch genes are not shown at all (default when--format=detail).--create-alignmenthmmsearch‘s normal outputs per profile are stored in the temporary directory. In addition, domain information and alignments in the outputs will be rearranged per query.--reannotation-r,--reannotationhmmsearchand assume thathmmsearchoutputs are already in the temporary directory. This will help you to make an output in a different format or redo annotation changing thresholds.--create-alignment-h,--helpconfig.yml
The following variables can be set by
config.yml.--profileoption takes precedence.--ko-listoption takes precedence.hmmsearchprocesses started simultaneously.--cpuoption takes precedence.hmmsearchexecutable. If not given, it will be searched for PATH.parallelexecutable. If not given, it will be searched for PATH.License
This software is released under the MIT License.