Improve user-facing messages and fix typos
- Fix typos: “entires” → “entries”, “starnd” → “strand”
- Replace unprofessional error message with descriptive one
- Improve CLI help text clarity across all commands
- Make error messages more specific and actionable
- Fix grammar: “_splitted” → “_split”
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9
京公网安备 11010802032778号
GFF3-to-DDBJ
日本語版はこちら。
Overview
GFF3-to-DDBJ converts GFF3 and FASTA files into the DDBJ annotation format required for submission. It is the DDBJ-specific equivalent of tools like
table2asn(NCBI) orEMBLmyGFF3(ENA).View the tests/golden directory for example output (
.annfiles).Accuracy and Validation
Since “perfect” GFF3-to-DDBJ conversion is not formally defined, this tool uses RefSeq GFF3-GenBank correspondence as a gold standard. We validate output by:
gff3-to-ddbjresults against GenBank-sourced annotations via an internalgenbank-to-ddbjtool.Installation
Via Bioconda
Via PyPI
Via GitHub (Nightly)
Usage
Argument Details
--locus_tag_prefix: The prefix assigned by BioSample.--transl_table: Genetic code index (e.g., 11 for Bacteria). See DDBJ Genetic Codes.Under the Hood
GFF3-to-DDBJ processes your data through the following pipeline:
1. Data Preparation
bgzip(e.g., creatingmyfile_bgzip.fa.gz). This enables indexing and reduces memory usage; the resulting file remains compatible with standardgziptools.Nruns and automatically generatesassembly_gapfeatures.Is_circular=true, the tool inserts aTOPOLOGYfeature and manages origin-spanning features.2. Feature & Qualifier Mapping
transcript(SO:0000673) is translated to amisc_RNAfeature.ID=foobarbecomes/note="ID:foobar"./transl_tablequalifier to everyCDSfeature based on the user-provided index (default: 1).3. Coordinate Processing
join()notation. This applies toCDS,exon,mat_peptide,V_segment,C_region,D-loop, andmisc_feature.exonsis assigned to the parent RNA’s location, and individualexonentries are discarded.exonsare not joined if their direct parent is agene.<or>) toCDSlocations if start or stop codons are missing. (See: Offset of the frame at translation initiation by codon_start).4. DDBJ Compliance Logic (Product & Gene)
CDSis restricted to a single/product:/genequalifier has a single value; additional values move to/gene_synonym. (Reference: Definition of Qualifier key: /gene)./geneand/gene_synonymqualifiers from parentgenefeatures to all children (e.g.,mRNA,CDS).5. Metadata & Filtering
sourceinformation and global qualifiers from the metadata file. See “Metadata Configuration” in “Customization” below.genefeature is discarded by default in this process.6. Final Formatting
Sorting: Lines are ordered by start position, feature priority (placing
sourceandTOPOLOGYat the top), and end position.Validation Logs: Displays all discarded items via
stderr:Customization
Metadata Configuration
Use a TOML file (e.g.,
metadata.toml) to provide information absent from GFF3/FASTA files, such as submitter details and common qualifiers.--metadatais omitted, the tool uses this default configuration.Key Sections
COMMON Entry: Define
SUBMITTER,REFERENCE, andCOMMENTblocks.Global Qualifiers (DDBJ-side injection): Use the
[COMMON.feature]syntax to instruct the DDBJ system to insert qualifiers into every occurrence of a feature.Local Injection (Tool-side injection): Use the
[feature]syntax (without theCOMMONprefix) to havegff3-to-ddbjexplicitly insert these qualifiers into the generated.annfile.Note: Currently, only
[source]and[assembly_gap]are supported for local injection.[Advanced] Feature and Qualifier Renaming
GFF3 and DDBJ formats do not share a 1:1 nomenclature. GFF3 “types” (column 3) map to DDBJ “Features,” while GFF3 “attributes” (column 9) map to DDBJ “Qualifiers.”
gff3-to-ddbjuses a default translation table to handle these conversions. You can override these rules using--config_rename <FILE>.Customization Examples:
Renaming Types: Map a GFF3 type to a specific DDBJ feature key.
Renaming Attributes: Map GFF3 attributes to DDBJ qualifiers. Use
__ANY__to apply a rule across all feature types.Complex Translations: Map a GFF3 type to a DDBJ feature/qualifier pair (e.g.,
snRNAtoncRNAwith a class).Attribute-to-Feature Mapping: Convert specific attribute values into distinct DDBJ features (e.g.,
RNAtype withbiotype=misc_RNAattribute becomes amisc_RNAfeature).[Advanced] Feature and Qualifier Filtering
To comply with the DDBJ usage matrix, output is filtered by a default configuration. Only features and qualifiers explicitly allowed in this TOML file will appear in the final output.
To use a custom filter, provide a TOML file via
--config_filter <FILE>using the following structure:Troubleshooting
Validate GFF3
It might be a good practice to validate your GFF3 files. GFF3 online validator is useful though the file size is limited to 50MB.
Split FASTA from GFF3 (if needed)
GFF3_to_DDBJ does not work when GFF3 contains FASTA information inside with
##FASTAdirective. Attached toolsplit-fastareads a GFF3 file and saves GFF3 (without FASTA info) and FASTA.This creates two files,
myfile_splitted.gff3andmyfile_splitted.fa.Normalize entry names (if needed)
Letters like
=|>" []are not allowed in the 1st column (= “Entry”) of the DDBJ annotation. The attached programnormalize-entry-namesrenames such entries. This program converts an ID likeERS324955|SC|contig000013intoERS324955:SC:contig000013for example.This command create as files
myannotation_output_renamed.txtif the invalid letters are found. Otherwise, you’ll see no output.Known Issues
Biological & Sequence Logic
join()syntax for features containing the/trans_splicingqualifier./transl_exceptat start or stop codons is not yet implemented./translationqualifier when an/exceptionqualifier is present, which may lead to DDBJ validation errors.123^124) are not currently supported and may be processed incorrectly.Performance
Acknowledgments
The design of GFF3-to-DDBJ is inspired by EMBLmyGFF3, a versatile tool used for converting GFF3 data into the EMBL annotation format.