EbolaSeq is a command-line tool that simplifies the process of analyzing Ebola virus sequences. It automates the complete workflow from downloading sequences to creating phylogenetic trees. The tool retrieves Ebola virus sequences from NCBI GenBank, processes them according to user specifications, performs multiple sequence alignment and generates phylogenetic trees.
Non-Interactive Mode - for HPC submissions or automated runs
Options
-o, --output-dir — Output directory for results
--virus — Virus / species
1 = Zaire ebolavirus
2 = Sudan ebolavirus
3 = Bundibugyo ebolavirus
4 = Tai Forest ebolavirus
5 = Reston ebolavirus
6 = Pan-Ebola: all 5 species
Comma-separated = multiple species (e.g. 1,2 for Zaire+Sudan; 1,2,3 for Zaire+Sudan+Bundibugyo)
--genome — Genome completeness
1 = Complete genomes only
2 = Partial genomes (requires --completeness)
3 = All genomes
--completeness — Required when --genome=2
Value between 1–100 (percentage)
--host — Host filter
1 = Human only
2 = Non-human only
3 = All hosts
--metadata — Metadata filter
1 = Location only
2 = Date only
3 = Both location and date
4 = None
Optional
--beast — Required when --metadata is 2 or 3
1 = No
2 = Yes
Consensus FASTA per species — Path to a FASTA file
--c_z = Zaire
--c_s = Sudan
--c_r = Reston
--c_b = Bundibugyo
--c_t = Tai Forest
--alignment, -a — Alignment type
1 = Whole-genome alignment
2 = Protein (CDS) alignment
3 = No alignment
--proteins, -pr — For alignment 2 only; comma-separated
1 = L (RNA-dependent RNA polymerase)
2 = NP (nucleoprotein)
3 = VP35 (polymerase cofactor)
4 = VP40 (matrix protein)
5 = GP (spike glycoprotein)
6 = VP30 (minor nucleoprotein)
7 = VP24 (membrane-associated protein)
Or use names: L, NP, VP35, VP40, GP, VP30, VP24
--phylogeny, -p — Create phylogenetic tree from alignment
-m, --min-cds-fraction — For alignment 2: minimum fraction of reference CDS length to keep a sequence (default 0.5). E.g. 0.2 keeps more partial sequences, 0.8 is stricter.
-t, --threads — Threads for minimap2 and MAFFT (default 1). E.g. -t 64 on a 64-core node. 0 = use all CPUs.
--remove — Path to file listing sequence IDs/headers to exclude
Alignment/ — For whole-genome: FASTA/, MAFFT/, Trimmed/. For protein: pan/ (or species name) with e.g. L/, NP/ each containing cds_aligned.fasta.
Phylogeny/ — IQTree2 results (whole-genome: one tree; protein: one folder per protein).
summary_*.txt — Run summary and location counts.
Notes
Use --remove with a list of IDs to exclude cell-culture, lab-adapted, or other non-natural sequences.
For large Zaire trees, consider rooting with 1976 Yambuku outbreak sequences.
Dependencies
Python ≥ 3.9
Biopython ≥ 1.81
MAFFT, TrimAl, IQTree2
For protein alignment: minimap2, pal2nal
Citation
If you use EbolaSeq in your research, please cite:
Jansen, D., & Vercauteren, K. (2025). EbolaSeq: A Command-Line Tool for Downloading, Processing, and Analyzing Ebola Virus Sequences for Phylogenetic Analysis (v0.1.8). Zenodo. https://doi.org/10.5281/zenodo.14851686
License
This project is licensed under the GNU General Public License v3.0 (GPL-3.0) - see the LICENSE file for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Support
If you encounter any problems or have questions, please open an issue on GitHub.
EbolaSeq
EbolaSeq is a command-line tool that simplifies the process of analyzing Ebola virus sequences. It automates the complete workflow from downloading sequences to creating phylogenetic trees. The tool retrieves Ebola virus sequences from NCBI GenBank, processes them according to user specifications, performs multiple sequence alignment and generates phylogenetic trees.
Installation
Prerequisites
First, install conda if you haven’t already:
Then, ensure you have the required channels:
Option 1: Using Conda (Recommended)
Install EbolaSeq via Conda:
Option 2: From source
Usage
EbolaSeq can be run in two modes:
Options
-o,--output-dir— Output directory for results--virus— Virus / species--genome— Genome completeness--completeness)--completeness— Required when--genome=2--host— Host filter--metadata— Metadata filterOptional
--beast— Required when--metadatais 2 or 3Consensus FASTA per species — Path to a FASTA file
--c_z= Zaire--c_s= Sudan--c_r= Reston--c_b= Bundibugyo--c_t= Tai Forest--alignment,-a— Alignment type--proteins,-pr— For alignment 2 only; comma-separated--phylogeny,-p— Create phylogenetic tree from alignment-m,--min-cds-fraction— For alignment 2: minimum fraction of reference CDS length to keep a sequence (default 0.5). E.g. 0.2 keeps more partial sequences, 0.8 is stricter.-t,--threads— Threads for minimap2 and MAFFT (default 1). E.g.-t 64on a 64-core node. 0 = use all CPUs.--remove— Path to file listing sequence IDs/headers to excludeExamples
Output
location.txt.FASTA/,MAFFT/,Trimmed/. For protein:pan/(or species name) with e.g.L/,NP/each containingcds_aligned.fasta.Notes
--removewith a list of IDs to exclude cell-culture, lab-adapted, or other non-natural sequences.Dependencies
Citation
If you use EbolaSeq in your research, please cite:
License
This project is licensed under the GNU General Public License v3.0 (GPL-3.0) - see the LICENSE file for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Support
If you encounter any problems or have questions, please open an issue on GitHub.