6.1 [How to prepare the input files?](/NSCCN/hsdfinder/tree/master/#sec6.1) </br>
6.2 [How to run HSDFinder?](/NSCCN/hsdfinder/tree/master/#sec6.2) </br>
6.3 [How to visualize the HSDs across species?](/NSCCN/hsdfinder/tree/master/#sec6.3) </br>
6.4 [How to prepare the appropriate BLAST input file if error occurs?](/NSCCN/hsdfinder/tree/master/#sec6.4) </br>
HSDFinder - an integrated tool to predict highly similar duplicates (HSDs) in eukaryotic genomes.
HSDFinder aims to become a useful platform for the identification and analysis of HSDs in the eukaryotic genomes, which deepen our insights into the gene duplication mechanisms driving the genome adaptation.
What’s new
May. 9th, 2021: The peer-reviewed article “Protocol for HSDFinder: identifying, annotating, categorizing, and visualizing duplicated genes in eukaryotic genomes” was accepted to be published.
Jan. 16th, 2021: HSDFinder and HSDatabase were cited by the Cell Press Journal iScience with the aticle name “Draft genome sequence of the Antarctic green alga Chlamydomonas sp. UWO241” DOI:https://doi.org/10.1016/j.isci.2021.102084
Aug. 5th, 2020: Updated to version 1.5.
The result of the predicted HSDs is displayed in a spreadsheet, which offers an alternative way to browse the result in graphical and tabular form. The software presented here is the primary selection of HSDs, the manually curation should be done to filter the partial and pseudogenes.
Aug. 1st, 2020: Updated to version 1.0.
The web server is able to analyze the unannotated genome sequences by integrating the results from InterProScan (e.g., Pfam) and KEGG.
1. INSTALLATION
Conda install
To install this package run one of the following:
conda install bioconda::hsdfinder
Then testing the usage by running:
hsdfinder -h
OR
Download the package and run
tar -xzvf HSDFinder_v1.0.tar.gz
Make sure the three python scripts (HSDFinder.py, operation.py, pfam.py) are under the same dirctory.
HSDFinder is developed to run on Linux. There are no versions planned for Windows or Apple (MAC OS X) operating systems. A minimum specification requirement is a machine with 2 cores and 4 GB of RAM, which will allow the analysis of a small number of sequences at a time. However the more resources the faster the analysis/more sequences can be analysed at a time.
Software requirements:
64-bit Linux
Python 3
2. INPUT
Input File is the BLAST all-against-all result by using protein sequence in FASTA format.
Note: If the user chose the parameter -max_target_seqs in the blastp command to control the maximum blast hits, the HSDFinder might have the error occurred (missing the gene length information). Then, please follow the FAQ section below to solve the issue: How to prepare the appropriate BLAST input file if error occurs?
See argument details by python/python3 HSDFinder.py -h
Options:
-i or --input_file the BLAST output file
-p or --percentage_identity identity percent e.g. For 90%, input 90.0
-l or --length length e.g. 10
-f or --file the InterProScan output file
-t or --type type e.g. Pfam
-o or --output_file output file name
Run examples:
python3 HSDFinder.py -i '/.../.../##.BLAST.tabular' -p 90.0 -l 10 -f '/.../.../##.INTERPROSCAN.tsv' -t Pfam -o ##.species.txt
4. OUTPUT
HSDFinder generates one output files: 8-column spreadsheet integrating with the information of HSD identifier, gene copies number and Pfam domain.
Example of the 8-column spreadsheet:
g735.t1 g735.t1; g741.t1; g8053.t1 744; 744; 747 Pfam PF11999; PF11999; PF11999 Protein of unknown function (DUF3494); Protein of unknown function (DUF3494); Protein of unknown function (DUF3494) IPR021884; IPR021884; IPR021884 Ice-binding protein-like ; Ice-binding protein-like ; Ice-binding protein-like
Column explanation:
Highly Similar Duplicates (HSDs) identifiers: The first gene model of the duplicate gene copies is used as the HSD identifers in default. (e.g. g735.t1)
The color for the matrix reflects the number of HSDs across and the left hand side reflect different KEGG functional categories, such as carbohydrate metabolism, energy metabolism, and translation.
6. Common questions (FAQ):
How to prepare the input files?
Before running HSDFinder, two tab-delimited text files need to be prepared as inputs (Figure S1A). A protein BLAST search of the genes against themselves (Suggested parameters: E-value cut-off ≤10-5, BLASTP -outfmt 6) will yield the first input file. The BLAST result of the amino acid sequences shall be arranged in a 12-column tab-delimited text file, including the key information of the genes from the query name to percentage identity etc. (See more details in HSDFinder tutorial from GitHub). The second tab-delimited text file is acquired from the software InterProScan, which allow the genes to be scanned by different protein signature databases, such as Pfam domain. The output file of InterProsScan is tab-delimited text file in default.
How to run HSDFinder?
The two tab-delimited text files then can be uploaded to HSDFinder with some personalized options. The default setting of HSDFinder filters highly similar duplicates (HSDs) with near-identical protein lengths (within 10 amino acids of each other) and ≥ 90% pairwise amino acid identities. Choosing such a relative strict cut-off might rule out other genuine duplicates from the list. But from our past experience with green algae genomes, the thresholds of the metrics selected here can represent the majority of detected highly similar duplicates. Since the duplicates vary from different eukaryotic organisms, users always have the option to lower the thresholds to filter duplicates on their datasets (e.g., from 30% to 100% pairwise amino acid identity and from within 0-100 amino acid length variances), although lowering the threshold of the metrics might risk of increasing of false positives. The output file of HSDFinder will be arranged in an 8-column tab-delimited text file containing the information, such as HSD identifier, gene copy number, and Pfam domain.
How to visualize the HSDs across species?
For comparative analyses of the HSDs across different species, we developed an online heatmap plotting option to visualize the HSDs results in different KEGG pathway categories. To do so, the user will need to generate HSDs results following the previous steps for the species of interest. The default for plotting the heatmap is at least two species and at least two files are needed to plot the heatmap. Examples are given to guide the appropriate input files (See more details in the hands-on protocol on creating heatmap with example data). The first input file is the outputs of your interest species after running HSDFinder; the second file is retrieved from the KEGG database documenting the correlation of KEGG Orthology (KO) accession with each gene model identifier (The detailed steps are guided in HSDFinder tutorial from GitHub). Once the input files have been submitted for each species, the HSDs will be displayed in a heatmap (the color for the matrix reflects the number of HSDs across species) and a tab-delimited text file under different KEGG functional categories, such as carbohydrate metabolism, energy metabolism, and translation.
How to deal with Error: SyntaxError: Non-ASCII character ‘\xe2’ in file HSDFinder.py?
SyntaxError: Non-ASCII character ‘\xe2’ in file HSDFinder.py on line 3, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
This is can be solved by using python3 to run the code HSDFinder.py.
How to deal with Error: require length of gene ?
The common error sign look like this:
Traceback (most recent call last):
File "HSDFinder.py", line 72, in <module>
main(sys.argv[1:])
File "HSDFinder.py", line 67, in main
result = operation.pfam_file_fun(input_file, percentage, length, pfam, p_type, output_file)
File "/home/.../operation.py", line 23, in pfam_file_fun
output = pfam.step(lines, p_filter, s_length)
File "/home/.../pfam.py", line 39, in step
lengtha = int(genes[items[0]])
KeyError: 'XP_015611539.1'
In some situations, if running errors occur with missing the gene length information.This is ususally due to the BLAST search limit the max targets at default, however, some species are rich of gene duplicates. In this case, HSDFinder may not find one gene blast against itself with 100% identity, with aligned length to be gene length.
It can be easily solved by 1) running the following UNIX command on your original amino acid sequences to create a gene lenth file. 2) Then paste the gene length file into Blast result file. 3) Rerun the HSDFinder with new merged BLAST tabular file (“new.merged.BLAST.tabular.file”) and Interproscan result file.
For the genome with amino acid sequences (‘/…/…/protein.fa’), simply copy and paste the code below to create length of amino acid, make sure the gene identifier is consistent with the ones used as input files.
Xi Zhang*, Yining Hu, David Roy Smith*. (2021). HSDFinder: a BLAST-based strategy to search for highly similar duplicated genes in eukaryotic genomes.Front. Bioinform. doi: 10.3389/fbinf.2021.803176
Xi Zhang, Yining Hu, David Roy Smith. (2021). Protocol for HSDFinder: Identifying, annotating, categorizing, and visualizing duplicated genes in eukaryotic genomes DOI:https://doi.org/10.1016/j.xpro.2021.100619
HSDFinder Manual
HSDFinder
HSDFinder - an integrated tool to predict highly similar duplicates (HSDs) in eukaryotic genomes. HSDFinder aims to become a useful platform for the identification and analysis of HSDs in the eukaryotic genomes, which deepen our insights into the gene duplication mechanisms driving the genome adaptation.
What’s new
May. 9th, 2021: The peer-reviewed article “Protocol for HSDFinder: identifying, annotating, categorizing, and visualizing duplicated genes in eukaryotic genomes” was accepted to be published.
Jan. 16th, 2021: HSDFinder and HSDatabase were cited by the Cell Press Journal iScience with the aticle name “Draft genome sequence of the Antarctic green alga Chlamydomonas sp. UWO241” DOI:https://doi.org/10.1016/j.isci.2021.102084
Aug. 5th, 2020: Updated to version 1.5. The result of the predicted HSDs is displayed in a spreadsheet, which offers an alternative way to browse the result in graphical and tabular form. The software presented here is the primary selection of HSDs, the manually curation should be done to filter the partial and pseudogenes.
Aug. 1st, 2020: Updated to version 1.0. The web server is able to analyze the unannotated genome sequences by integrating the results from InterProScan (e.g., Pfam) and KEGG.
1. INSTALLATION
Conda install
To install this package run one of the following:
Then testing the usage by running:
OR
Download the package and run
Make sure the three python scripts (HSDFinder.py, operation.py, pfam.py) are under the same dirctory.
HSDFinder is developed to run on Linux. There are no versions planned for Windows or Apple (MAC OS X) operating systems. A minimum specification requirement is a machine with 2 cores and 4 GB of RAM, which will allow the analysis of a small number of sequences at a time. However the more resources the faster the analysis/more sequences can be analysed at a time.
Software requirements:
64-bit Linux
Python 3
2. INPUT
Input File is the BLAST all-against-all result by using protein sequence in FASTA format.
Example of the 12-column input file 1:
Column explanation:
Input File 2 is the InterProScan result by using protein sequence in FASTA format
Example of the 13-column input file 2:
Column explanation:
Note: If a value is missing in a column, for example, the match has no InterPro annotation, a ‘-‘ is displayed.
3. Running HSDFinder
Must Use python3 HSDFinder.py to run HSDFinder Or Use python HSDFinder.py in Python2 environment
4. OUTPUT
HSDFinder generates one output files: 8-column spreadsheet integrating with the information of HSD identifier, gene copies number and Pfam domain.
Example of the 8-column spreadsheet:
Column explanation:
5. Creating Heatmap
1) INPUT
Example of the 2-column input file for KO accession
Column explanation:
2) RUNNING
3) OUTPUT (.tsv and .eps)
Example of the 8-column tab-delimited file (.tsv ) for HSDs of different species categorized under different KEGG functional categories
Column explanation:
Example of the heatmap file (.eps) visualizing the HSDs across seven green algae
The high resolution version can be found here!
The color for the matrix reflects the number of HSDs across and the left hand side reflect different KEGG functional categories, such as carbohydrate metabolism, energy metabolism, and translation.
6. Common questions (FAQ):
How to prepare the input files?
Before running HSDFinder, two tab-delimited text files need to be prepared as inputs (Figure S1A). A protein BLAST search of the genes against themselves (Suggested parameters: E-value cut-off ≤10-5, BLASTP -outfmt 6) will yield the first input file. The BLAST result of the amino acid sequences shall be arranged in a 12-column tab-delimited text file, including the key information of the genes from the query name to percentage identity etc. (See more details in HSDFinder tutorial from GitHub). The second tab-delimited text file is acquired from the software InterProScan, which allow the genes to be scanned by different protein signature databases, such as Pfam domain. The output file of InterProsScan is tab-delimited text file in default.
How to run HSDFinder?
The two tab-delimited text files then can be uploaded to HSDFinder with some personalized options. The default setting of HSDFinder filters highly similar duplicates (HSDs) with near-identical protein lengths (within 10 amino acids of each other) and ≥ 90% pairwise amino acid identities. Choosing such a relative strict cut-off might rule out other genuine duplicates from the list. But from our past experience with green algae genomes, the thresholds of the metrics selected here can represent the majority of detected highly similar duplicates. Since the duplicates vary from different eukaryotic organisms, users always have the option to lower the thresholds to filter duplicates on their datasets (e.g., from 30% to 100% pairwise amino acid identity and from within 0-100 amino acid length variances), although lowering the threshold of the metrics might risk of increasing of false positives. The output file of HSDFinder will be arranged in an 8-column tab-delimited text file containing the information, such as HSD identifier, gene copy number, and Pfam domain.
How to visualize the HSDs across species?
For comparative analyses of the HSDs across different species, we developed an online heatmap plotting option to visualize the HSDs results in different KEGG pathway categories. To do so, the user will need to generate HSDs results following the previous steps for the species of interest. The default for plotting the heatmap is at least two species and at least two files are needed to plot the heatmap. Examples are given to guide the appropriate input files (See more details in the hands-on protocol on creating heatmap with example data). The first input file is the outputs of your interest species after running HSDFinder; the second file is retrieved from the KEGG database documenting the correlation of KEGG Orthology (KO) accession with each gene model identifier (The detailed steps are guided in HSDFinder tutorial from GitHub). Once the input files have been submitted for each species, the HSDs will be displayed in a heatmap (the color for the matrix reflects the number of HSDs across species) and a tab-delimited text file under different KEGG functional categories, such as carbohydrate metabolism, energy metabolism, and translation.
How to deal with Error: SyntaxError: Non-ASCII character ‘\xe2’ in file HSDFinder.py?
This is can be solved by using python3 to run the code HSDFinder.py.
How to deal with Error: require length of gene ?
The common error sign look like this:
Help
The distribution version of HSDFinder is also available. Current version: v1 (5 August 2020) download
Links to the InterProScan and KEGG InterProscan KEGG
Contact
Usage of this site follows AWS’s Privacy Policy. © Copyright (C) 2021
Reference
Xi Zhang*, Yining Hu, David Roy Smith*. (2021). HSDFinder: a BLAST-based strategy to search for highly similar duplicated genes in eukaryotic genomes.Front. Bioinform. doi: 10.3389/fbinf.2021.803176
Xi Zhang, Yining Hu, David Roy Smith. (2021). Protocol for HSDFinder: Identifying, annotating, categorizing, and visualizing duplicated genes in eukaryotic genomes DOI:https://doi.org/10.1016/j.xpro.2021.100619
X. Zhang, et.al. D. Smith (2021). Draft genome sequence of the Antarctic green alga Chlamydomonas sp. UWO241 DOI:https://doi.org/10.1016/j.isci.2021.102084