After installation and configuration, first you should run the configuration script to download and prepare the taxonomy files, and define the main paths.
./config.pl
Then, you can run the main tool to perform taxonomic identification:
./main.pl
If installed via Conda, you can use:
hymet-config
hymet
Reference Sketched Databases
The databases required to run the tool are available for download on Google Drive:
This folder includes scripts to install and prepare all necessary data to replicate the work using our dataset.
Prerequisites:
Before running the scripts in this folder, users need to download the assembly files (assembly_files.txt) for each domain from the NCBI FTP site.
Scripts:
create_database.py: Downloads 10% of the content from each downloaded assembly file and organizes the datasets by domain.
extractNC.py: Maps the content of each Genome Collection File (GCF) with its respective sequence identifiers. It generates a CSV file containing this mapping, with one column for the GCF and another column for the sequence identifiers (such as NC, NZ, etc.) present in each GCF.
extractTaxonomy.py: Creates a CSV file containing the GCF and its respective taxonomy, among other information.
Additional scripts modify the data format and organization, including:
Implementing mutations
Converting formats (e.g., FASTA to FASTQ)
Formatting into paired-end reads
GCFtocombinedfasta.py: Combines all GCFs from each domain into a single FASTA file, separating sequences by identifier. This script is used as input for most of the tools.
Support
For questions or issues, please open an issue in the repository.
HYMET (Hybrid Metagenomic Tool)
Installation and Configuration
Follow the steps below to install and configure HYMET
1. Installation with Conda (Recommended)
The easiest way to install HYMET is through Bioconda:
After installation, you will need to download the reference databases as described in the Reference Sketched Databases section.
2. Clone the Repository
Alternatively, you can clone the repository to your local environment:
3. Installation with Docker
If you prefer using Docker, follow these steps:
Build the Docker Image:
Run the Container:
Inside the Container:
4. Installation with Conda Environment File
If you cloned the repository, you can create a Conda environment from the included file:
Create the Conda Environment:
Activate the Environment:
Input Requirements
Input File Format
The tool expects input files in FASTA format (
.fnaor.fasta). Each file should contain metagenomic sequences with headers in the following format:Input Directory
Place your input files in the directory specified by the
$input_dirvariable in themain.plscript.For example, if your input directory contains the following files:
Each file (
sample1.fna,sample2.fna, etc.) should follow the FASTA format described above.Execution
Ensure all scripts have execution permissions:
After installation and configuration, first you should run the configuration script to download and prepare the taxonomy files, and define the main paths.
Then, you can run the main tool to perform taxonomic identification:
If installed via Conda, you can use:
Reference Sketched Databases
The databases required to run the tool are available for download on Google Drive:
Steps to Use the Databases
Download the Files:
.gzfiles.Place the Files in the
data/Directory:data/directory of the project.Unzip the Files:
.gzfiles:sketch1.msh,sketch2.msh, andsketch3.mshin thedata/directory.Verify the Files:
sketch1.msh,sketch2.msh, andsketch3.msh.Project Structure
Example Output
The tool generates a
classified_sequences.tsvfile in theoutput/directory with the following columns:Test Dataset
assembly_files.txt) for each domain from the NCBI FTP site.create_database.py: Downloads 10% of the content from each downloaded assembly file and organizes the datasets by domain.extractNC.py: Maps the content of each Genome Collection File (GCF) with its respective sequence identifiers. It generates a CSV file containing this mapping, with one column for the GCF and another column for the sequence identifiers (such as NC, NZ, etc.) present in each GCF.extractTaxonomy.py: Creates a CSV file containing the GCF and its respective taxonomy, among other information.GCFtocombinedfasta.py: Combines all GCFs from each domain into a single FASTA file, separating sequences by identifier. This script is used as input for most of the tools.Support
For questions or issues, please open an issue in the repository.