GNparser splits scientific names into their semantic elements with an
associated meta information. Parsing is indispensable for matching names
from different data sources, because it can normalize different lexical
variants of names to the same canonical form.
This parser, written in Go, is the 3rd iteration of the project. The
first, biodiversity, had been written in Ruby, the second, also
gnparser, had been written in Scala. This project is
now a substitution for the other two. Scala project is in an archived state,
biodiversity now uses Go code for parsing. All three projects were developed
as a part of Global Names Architecture Project.
To use GNparser as a command line tool under Windows, Mac or Linux,
download the latest release, uncompress it, and copy gnparser
binary somewhere in your PATH. On a Mac you might also need to go to
System Preferences and security panel select Allow from other developers. Then, after running gnparser, click ‘Yes’ in a dialog box
allowing to run a program from an “unregistered developer”.
Global Names Parser or GNparser is a program written in Go for breaking up
scientific names into their elements. It uses peg – a Parsing
Expression Grammar (PEG) tool.
Many other parsing algorithms for scientific names use regular expressions.
This approach works well for extracting canonical forms in simple cases.
However, for complex scientific names and to parse scientific names into
all semantic elements, regular expressions often fail, unable to overcome
the recursive nature of data embedded in names. By contrast, GNparser
is able to deal with the most complex scientific name-strings.
GNparser takes a name-string like Drosophila (Sophophora) melanogaster Meigen, 1830 and returns parsed components in CSV, TSV or JSON format.
The parsing of scientific names might become surprisingly complex and the
GNparser'stest file is a good source of information about the parser’s
capabilities, its input and output.
Number of names parsed per second on an AMD Ryzen 7 5800H CPU
(8 cores, 16 threads), GNparser v1.3.0:
gnparser 1_000_000_names.txt -j 200 > /dev/null
Threads
names/sec
1
9,000
2
19,000
4
35,000
8
56,000
16
82,000
100
107,000
200
111,000
For simplest output Go GNparser is roughly 2 times faster than Scala
GNparser and about 100 times faster than pure Ruby implementation. For
JSON formats the parser is approximately 8 times faster than Scala one, due to
more efficient JSON conversion.
Features
Fastest parser ever.
Very easy to install, just placing executable somewhere in the PATH is
sufficient.
Parsing can be adjusted to rules of specific nomenclatural code (Botanical,
Botanical Cultivar, Zoological, Viral).
Extracts all elements from a name, not only canonical forms.
Works with very complex scientific names, including hybrid formulas.
Includes RESTful service and interactive web interface.
Can run as a command line application.
Can be used as a library in Go projects.
Can be scaled to many CPUs and computers (if 250 millions names an
hour is not enough).
Calculates a stable UUID version 5 ID from the content of a string.
Provides C-binding to incorporate parser to other languages.
Use Cases
Getting the simplest possible canonical form
Canonical forms of a scientific name are the latinized components without
annotations, authors or dates. They are great for matching lexical variants
of names. Three versions of canonical forms are included:
Canonical
Example
Use
-
Spiraea alba var. alba Du Roi
Best for disambiguation, but has many lexical variants
Full
Spiraea alba var. alba
Presentation, infraspecies disambiguation
Simple
Spiraea alba alba
Name matching, presentation
Stem
Spiraea alb alb
Best for matching fem./masc. inconsistencies
The canonicalName -> full is good for presentation, as it keeps more
details.
The canonicalName -> simple field is good for matching names from different
sources, because sometimes dataset curators omit hybrid sign in named hybrids,
or remove ranks for infraspecific epithets.
The canonicalName -> stem field normalizes simple canonical form even
further. It allows to match names with inconsistent gender suffixes in specific
epithets (for example alba vs. albus). The normalization is done according
to stemming rules for Latin language described in Schinke R et al (1996). For
example letters j are converted to i, letters v are converted to u, and
suffixes are removed from the specific and infraspecific epithets.
If you only care mostly about canonical form of a name you can use default
--format csv flag with command line tool.
Usually scientific names can be broken into groups according to the number of
elements:
Uninomial
Binomial
Trinomial
Quadrinomial
The output of GNparser contains a Cardinality field that tells, when
possible, how many elements are detected in the name.
Cardinality
Name Type
0
Undetermined
1
Uninomial
2
Binomial
3
Trinomial
4
Quadrinomial
For hybrid formulas, “approximate” names (with “sp.”, “spp.” etc.), unparsed
names, as well as names from BOLD project cardinality is 0 (Undetermined)
Normalizing name-strings
There are many inconsistencies in how scientific names may be written.
Use normalized field to bring them all to a common form (spelling, spacing,
ranks).
Removing authorship from the middle of the name
Often data administrators spit name-strings into “name part” and
“authorship part”. This practice misses some information when dealing with
names like “Prosthechea cochleata (L.) W.E.Higgins var. grandiflora
(Mutel) Christenson”. However, if this is the use case, a combination of
canonicalName -> full with the authorship from the lowest taxon will do
the job. You can also use the default --format csv flag for gnparser
command line tool.
Figuring out if names are well-formed
If there are problems with parsing a name, parser generates qualityWarnings
messages and lowers parsing quality of the name. Quality values
mean the following:
"quality": 1 - No problems were detected.
"quality": 2 - There were small problems, normalized result
should still be good.
"quality": 3 - There are some significant problems with parsing.
"quality": 4 - There were serious problems with the name, and the
final result is rather doubtful.
"quality": 0 - A string could not be recognized as a scientific
name and parsing failed.
Creating stable GUIDs for name-strings
GNparser uses UUID version 5 to generate its id field.
There is algorithmic 1:1 relationship between the name-string and the UUID.
Moreover the same algorithm can be used in any popular language to
generate the same UUID. Such IDs can be used to globally connect information
about name-strings or information associated with name-strings.
More information about UUID version 5 can be found in the Global Names
blog
Assembling canonical forms etc. from original spelling
GNparser tries to correct problems with spelling, but sometimes it is
important to keep original spelling of the canonical forms or authorship.
The words field attaches semantic meaning to every word in the
original name-string and allows users to create canonical forms or other
combinations using the original verbatim spelling of the words. Each element
in words contains 4 parts:
verbatim value of a word
semantic meaning of the word
start position of the word
end position of the word
The words section belongs to additional details. To use it enable
--details flag for the command line application.
Compiled programs in Go are self-sufficient and small (GNparser is only a
few megabytes). As a result the binary file of gnparser is all you need to
make it work. You can install it by downloading the latest version of the
binary for your operating system and CPU architecture, and
placing it in your PATH.
Install with Homebrew (Mac OS X, Linux)
Homebrew is a packaging system originally made for Mac OS X. You can use it
now for Mac, Linux, or Windows X WSL (Windows subsystem for Linux).
It is also possible to install Windows Subsystem for Linux on Windows
(v10 or v11), and use gnparser as a Linux executable.
Install with Go
If you have Go installed on your computer use
go get -u github.com/gnames/gnparser/gnparser
For development install [just] and use the following:
git clone https://github.com/gnames/gnparser.git
cd gnparser
just tools
just install
You do need your PATH to include $HOME/go/bin
Usage
Command Line
gnparser -f pretty "Quadrella steyermarkii (Standl.) Iltis & Cornejo"
Relevant flags:
--help -h
: Displays help information about the available flags.
--batch_size -b
: Sets the maximum number of names processed in a batch. This is ignored
in streaming mode (-s).
--cultivar -C
: Deprecated. Use --nomenclatural-code instead.
--capitalize -c
: Capitalizes the first letter of input name-strings.
--details -d
: Provides more detailed output for each parsed name. Ignored for
CSV/TSV formats.
--diaereses -D
: Preserves diaereses, e.g. Leptochloöpsis virgata. The stemmed
canonical name does not include diaereses.
--flatten-output -F
: Converts nested JSON output into a flattened structure. Only applies to JSON
formats (CSV/TSV formats are always flattened). Instead of nested objects like
canonical and authorship, all fields are flattened to the top level, making
the output easier to process in some applications. Some detailed information
would be lost in the flattened format.
--compact-authors -a
: Removes space between authors’ initials, e.g.
Schoenoplectus tabernaemontani (C. C. Gmel.) Palla. The normalized
authorship will be generated without space between initials
Schoenoplectus tabernaemontani (C.C.Gmel.) Palla.
--format -f
: Specifies the output format: csv, tsv, compact, or pretty.
Defaults to csv. CSV and TSV formats include a header row.
--jobs -j
: Sets the number of jobs to run concurrently.
--ignore_tags -i
: Increases performance by skipping HTML entity and tag processing.
Only use if your input is known to be free of HTML.
--nomenclatural-code -n
: Specifies the nomenclatural code (e.g., botanical, zoological) to use
for parsing in ambiguous cases. For example in Aus (Bus) cus: according
to zoological code Aus is genus, Bus is subgenus, while according
to botanical code Bus is the author of Aus. For modern binomial viral
code this setting is a hard constraint, while for other codes it sets
a priority in ambiguous situations.
--port -p
: Sets the port for the web-interface and RESTful API.
--species-group-cut
: Modifies the stemmed canonical form for autonyms and species-group names
by removing the infraspecific epithet. Useful for matching names like
Aus bus and Aus bus bus.
--stream -s
: Enables streaming mode, where names are processed one at a time.
Useful for integrating gnparser with languages other than Go.
--unordered -u
: Disables output ordering. The output order may not match the input order.
--version -V
: Displays the version number of GNparser.
To parse one name:
# CSV output (default)
gnparser "Parus major Linnaeus, 1788"
# or
gnparser -f csv "Parus major Linnaeus, 1788"
# TSV output
gnparser -f tsv "Parus major Linnaeus, 1788"
# JSON compact format
gnparser "Parus major Linnaeus, 1788" -f compact
# pretty format
gnparser -f pretty "Parus major Linnaeus, 1788"
# JSON with flattened output structure (no nested objects)
gnparser -f compact -F "Parus major Linnaeus, 1788"
# to parse a name from the standard input
echo "Parus major Linnaeus, 1788" | gnparser
# to parse a botanical cultivar name
gnparser "Anthurium 'Ace of Spades'" --cultivar
gnparser "Phyllostachys vivax cv aureocaulis" -c
# to parse name that is all in low-case
gnparser "parus major" --capitalize
gnparser "parus major" -c
To parse a file:
There is no flag for parsing a file. If parser finds the given file path on
your computer, it will parse the content of the file, assuming that every line
is a new scientific name. If the file path is not found, GNparser will try
to parse the “path” as a scientific name.
Parsed results will stream to STDOUT, while progress of the parsing
will be directed to STDERR.
# to parse with 200 parallel processes
gnparser -j 200 names.txt > names_parsed.csv
# to parse file with more detailed output
gnparser names.txt -d -f compact > names_parsed.txt
# to parse files using pipes
cat names.txt | gnparser -f csv -j 200 > names_parsed.csv
# to parse using `stream` method instead of `batch` method.
cat names.txt | gnparser -s > names_parsed.csv
# to not remove html tags and entities during parsing. You gain a bit of
# performance with this option if your data does not contain HTML tags or
# entities.
gnparser "<i>Pomatomus</i> <i>saltator</i>"
gnparser -i "<i>Pomatomus</i> <i>saltator</i>"
gnparser -i "Pomatomus saltator"
If jobs number is set to more than 1, parsing uses several concurrent
processes. This approach increases speed of parsing on multi-CPU
computers. The results are returned in some random order, and reassembled
into the order of input transparently for a user.
Potentially the input file might contain millions of names, therefore creating
one properly formatted JSON output might be prohibitively expensive. Therefore
the parser creates one JSON line per name (when compact format is used)
You can use up to 20 times more “threads” than the number of your CPU cores
to reach maximum speed of parsing (--jobs 200 flag). It is practical
because additional “threads” are very cheap in Go and they try to fill out
every idle gap in the CPU usage.
Pipes
About any language has an ability to use pipes of the underlying operating
system. From the inside of your program you can make the CLI executable
GNparser to listen on a STDIN pipe and produce output into STDOUT pipe. Here
is an example in Ruby:
def self.start_gnparser
io = {}
['compact', 'csv'].each do |format|
stdin, stdout, stderr = Open3.popen3("./gnparser -s --format #{format}")
io[format.to_sym] = { stdin: stdin, stdout: stdout, stderr: stderr }
end
end
# run as a website and a RESTful service
docker run -p 0.0.0.0:80:8080 gnames/gognparser -p 8080
# just parse something
docker run gnames/gognparser "Amaurorhinus bewichianus (Wollaston,1860) (s.str.)"
It is possible to bind GNparser functionality with languages that can use
C Application Binary Interface. For example such languages include
Python, Ruby, Rust, C, C++, Java (via JNI).
To compile GNparser shared library for your platform/operating system of
choice you need [just] and GNU gcc compiler installed:
just clib
cd binding
cp libgnparser* /path/to/some/project
Some name-strings cannot be parsed unambiguously without some additional data.
Names with filius (ICN code)
For names like Aus bus Linn. f. cus the f. is ambiguous. It might mean
that species were described by a son of (filius) Linn., or it might mean
that cus is forma of bus. We provide a warning
“Ambiguous f. (filius or forma)” for such cases.
Names with subgenus (ICZN code) and genus author (ICN code)
For names like Aus (Bus) L. or Aus (Bus) cus L. the (Bus) token would
mean the name of subgenus for ICZN names, but for ICN names it would be an
author of genus Aus. We created a list of ICN generic authors using data from
IRMNG to distinguish such names from each other. For detected ICN names we
provide a warning “Ambiguity: ICN author or subgenus”.
Virus names according to modern ICVCN binomial nomenclature
ICVCN code adopted binomial nomenclature in 2021, and converted most names to
new rules by 2026. However the rules for viral names differ significanlty in
comparison with other nomenclatural codes (e.g., names like Batravirus ranidallo3 or Pradovirus XAJ24 are legal ICVCN names), so we had to create a
specialized parser for them. It creates parsing challenges for names like
Calviria, Euvira (ICZN genera) that by chance matched ICVCN rules for Ream
and Subrealm. We try to detect such names and place them in an exception list.
If you want to submit a bug or add a feature read
CONTRIBUTING file.
Artificial Intelligence Policy
We use artificial intelligence to help find algorithms, decide on
implementation approaches, and generate code. We carefully review all
automatically generated code, fixing inconsistencies, removing superfluous
implementations, and improving optimization. No code that we do not understand
or approve makes it into published versions of GNparser. We primarily use
Claude Code, with limited use of Gemini CLI.
References
Mozzherin, D.Y., Myltsev, A.A. & Patterson, D.J. “gnparser”: a powerful parser
for scientific names based on Parsing Expression Grammar. BMC Bioinformatics
18, 279 (2017).https://doi.org/10.1186/s12859-017-1663-3
Rees, T. (compiler) (2019). The Interim Register of Marine and Nonmarine
Genera. Available from http://www.irmng.org at VLIZ.
Accessed 2019-04-10
Global Names Parser: GNparser written in Go
Try
GNparseronline.Try GNparser with OpenRefine
GNparsersplits scientific names into their semantic elements with an associated meta information. Parsing is indispensable for matching names from different data sources, because it can normalize different lexical variants of names to the samecanonical form.This parser, written in Go, is the 3rd iteration of the project. The first, biodiversity, had been written in Ruby, the second, also gnparser, had been written in Scala. This project is now a substitution for the other two. Scala project is in an archived state, biodiversity now uses Go code for parsing. All three projects were developed as a part of Global Names Architecture Project.
To use
GNparseras a command line tool under Windows, Mac or Linux, download the latest release, uncompress it, and copygnparserbinary somewhere in your PATH. On a Mac you might also need to go toSystem Preferencesand security panel selectAllow from other developers. Then, after runninggnparser, click ‘Yes’ in a dialog box allowing to run a program from an “unregistered developer”.filius(ICN code)Citing
Zenodo DOI can be used to cite GNparser
Introduction
Global Names Parser or
GNparseris a program written in Go for breaking up scientific names into their elements. It uses peg – a Parsing Expression Grammar (PEG) tool.Many other parsing algorithms for scientific names use regular expressions. This approach works well for extracting canonical forms in simple cases. However, for complex scientific names and to parse scientific names into all semantic elements, regular expressions often fail, unable to overcome the recursive nature of data embedded in names. By contrast,
GNparseris able to deal with the most complex scientific name-strings.GNparsertakes a name-string likeDrosophila (Sophophora) melanogaster Meigen, 1830and returns parsed components inCSV,TSVorJSONformat. The parsing of scientific names might become surprisingly complex and theGNparser'stest file is a good source of information about the parser’s capabilities, its input and output.GNparserreached a stable v1. Differences between v1 and v0Speed
Number of names parsed per second on an AMD Ryzen 7 5800H CPU (8 cores, 16 threads), GNparser v1.3.0:
For simplest output Go
GNparseris roughly 2 times faster than ScalaGNparserand about 100 times faster than pure Ruby implementation. For JSON formats the parser is approximately 8 times faster than Scala one, due to more efficient JSON conversion.Features
Use Cases
Getting the simplest possible canonical form
Canonical forms of a scientific name are the latinized components without annotations, authors or dates. They are great for matching lexical variants of names. Three versions of canonical forms are included:
The
canonicalName -> fullis good for presentation, as it keeps more details.The
canonicalName -> simplefield is good for matching names from different sources, because sometimes dataset curators omit hybrid sign in named hybrids, or remove ranks for infraspecific epithets.The
canonicalName -> stemfield normalizessimplecanonical form even further. It allows to match names with inconsistent gender suffixes in specific epithets (for example alba vs. albus). The normalization is done according to stemming rules for Latin language described in Schinke R et al (1996). For example lettersjare converted toi, lettersvare converted tou, and suffixes are removed from the specific and infraspecific epithets.If you only care mostly about canonical form of a name you can use default
--format csvflag with command line tool.CSV/TSV output has the following fields:
Quickly partition names by the type
Usually scientific names can be broken into groups according to the number of elements:
The output of
GNparsercontains aCardinalityfield that tells, when possible, how many elements are detected in the name.For hybrid formulas, “approximate” names (with “sp.”, “spp.” etc.), unparsed names, as well as names from
BOLDproject cardinality is 0 (Undetermined)Normalizing name-strings
There are many inconsistencies in how scientific names may be written. Use
normalizedfield to bring them all to a common form (spelling, spacing, ranks).Removing authorship from the middle of the name
Often data administrators spit name-strings into “name part” and “authorship part”. This practice misses some information when dealing with names like “Prosthechea cochleata (L.) W.E.Higgins var. grandiflora (Mutel) Christenson”. However, if this is the use case, a combination of
canonicalName -> fullwith the authorship from the lowest taxon will do the job. You can also use the default--format csvflag forgnparsercommand line tool.Figuring out if names are well-formed
If there are problems with parsing a name, parser generates
qualityWarningsmessages and lowers parsingqualityof the name. Quality values mean the following:"quality": 1- No problems were detected."quality": 2- There were small problems, normalized result should still be good."quality": 3- There are some significant problems with parsing."quality": 4- There were serious problems with the name, and the final result is rather doubtful."quality": 0- A string could not be recognized as a scientific name and parsing failed.Creating stable GUIDs for name-strings
GNparseruses UUID version 5 to generate itsidfield. There is algorithmic 1:1 relationship between the name-string and the UUID. Moreover the same algorithm can be used in any popular language to generate the same UUID. Such IDs can be used to globally connect information about name-strings or information associated with name-strings.More information about UUID version 5 can be found in the Global Names blog
Assembling canonical forms etc. from original spelling
GNparsertries to correct problems with spelling, but sometimes it is important to keep original spelling of the canonical forms or authorship. Thewordsfield attaches semantic meaning to every word in the original name-string and allows users to create canonical forms or other combinations using the original verbatim spelling of the words. Each element inwordscontains 4 parts:The
wordssection belongs to additional details. To use it enable--detailsflag for the command line application.Tutorials
Installation
Compiled programs in Go are self-sufficient and small (
GNparseris only a few megabytes). As a result the binary file ofgnparseris all you need to make it work. You can install it by downloading the latest version of the binary for your operating system and CPU architecture, and placing it in yourPATH.Install with Homebrew (Mac OS X, Linux)
Homebrew is a packaging system originally made for Mac OS X. You can use it now for Mac, Linux, or Windows X WSL (Windows subsystem for Linux).
Install Homebrew according to their instructions.
Install
gnparserwith:Linux or Mac OS X
Move
gnparserexecutable somewhere in your PATH (for example/usr/local/bin)If you’re using Mac OS, you might encounter a security warning that prevents
gnparserfrom running. Here’s how to fix it:In the warning dialog click the
Donebutton (not theMove to Trashbutton).Locate the Security Settings: Go to
System Settings -> Privacy & Securityand scroll down to theSecuritysection.Allow
gnparser: You should see a message saying"gnparser" was blocked.... Click theAllow Anywaybutton next to it.Run gnparser again: Try running gnparser from your terminal. This time, a dialog box will pop up with an
Open Anywaybutton.Open and Unblock: Click
Open Anywayand enter your administrator password when prompted. This will unblock thegnparserbinary.After these steps, you should be able to use gnparser without any issues. You can also copy, move, or rename it freely.
Windows
One possible way would be to create a default folder for executables and place
gnparserthere.Use
Windows+Rkeys combination and type “cmd“. In the appeared terminal window type:Add
C:\bindirectory to yourPATHuserand/orsystemenvironment variables.It is also possible to install Windows Subsystem for Linux on Windows (v10 or v11), and use
gnparseras a Linux executable.Install with Go
If you have Go installed on your computer use
For development install [just] and use the following:
You do need your
PATHto include$HOME/go/binUsage
Command Line
Relevant flags:
--help -h: Displays help information about the available flags.--batch_size -b: Sets the maximum number of names processed in a batch. This is ignored in streaming mode (-s).--cultivar -C: Deprecated. Use--nomenclatural-codeinstead.--capitalize -c: Capitalizes the first letter of input name-strings.--details -d: Provides more detailed output for each parsed name. Ignored for CSV/TSV formats.--diaereses -D: Preserves diaereses, e.g.Leptochloöpsis virgata. The stemmed canonical name does not include diaereses.--flatten-output -F: Converts nested JSON output into a flattened structure. Only applies to JSON formats (CSV/TSV formats are always flattened). Instead of nested objects likecanonicalandauthorship, all fields are flattened to the top level, making the output easier to process in some applications. Some detailed information would be lost in the flattened format.--compact-authors -a: Removes space between authors’ initials, e.g.Schoenoplectus tabernaemontani (C. C. Gmel.) Palla. The normalized authorship will be generated without space between initialsSchoenoplectus tabernaemontani (C.C.Gmel.) Palla.--format -f: Specifies the output format:csv,tsv,compact, orpretty. Defaults tocsv. CSV and TSV formats include a header row.--jobs -j: Sets the number of jobs to run concurrently.--ignore_tags -i: Increases performance by skipping HTML entity and tag processing. Only use if your input is known to be free of HTML.--nomenclatural-code -n: Specifies the nomenclatural code (e.g.,botanical,zoological) to use for parsing in ambiguous cases. For example inAus (Bus) cus: according to zoological codeAusis genus,Busis subgenus, while according to botanical codeBusis the author ofAus. For modern binomialviralcode this setting is a hard constraint, while for other codes it sets a priority in ambiguous situations.Supported values:
bact,bacterial,ICNP,bot,vir,viral,ICVCN,botanical,ICN,cult,cultivar,ICNCP,zoo,zoological,ICZN.--port -p: Sets the port for the web-interface and RESTful API.--species-group-cut: Modifies the stemmed canonical form for autonyms and species-group names by removing the infraspecific epithet. Useful for matching names likeAus busandAus bus bus.--stream -s: Enables streaming mode, where names are processed one at a time. Useful for integrating gnparser with languages other than Go.--unordered -u: Disables output ordering. The output order may not match the input order.--version -V: Displays the version number ofGNparser.To parse one name:
To parse a file:
There is no flag for parsing a file. If parser finds the given file path on your computer, it will parse the content of the file, assuming that every line is a new scientific name. If the file path is not found,
GNparserwill try to parse the “path” as a scientific name.Parsed results will stream to STDOUT, while progress of the parsing will be directed to STDERR.
If jobs number is set to more than 1, parsing uses several concurrent processes. This approach increases speed of parsing on multi-CPU computers. The results are returned in some random order, and reassembled into the order of input transparently for a user.
Potentially the input file might contain millions of names, therefore creating one properly formatted JSON output might be prohibitively expensive. Therefore the parser creates one JSON line per name (when
compactformat is used)You can use up to 20 times more “threads” than the number of your CPU cores to reach maximum speed of parsing (
--jobs 200flag). It is practical because additional “threads” are very cheap in Go and they try to fill out every idle gap in the CPU usage.Pipes
About any language has an ability to use pipes of the underlying operating system. From the inside of your program you can make the CLI executable
GNparserto listen on a STDIN pipe and produce output into STDOUT pipe. Here is an example in Ruby:@marcobrt kindly provided an example in PHP.
Note that you have to use
--stream -sflag for this approach to work.R language package
For R language it is possible to use
rgnparserpackage. It implements mentioned abovepipesmethod. It does requiregnparserapp be installed.Ruby Gem
Ruby developers can use
GNparserfunctionality via biodiversity gem. It uses C-binding and does not require an installedgnparserapp.Node.js
@tobymarsden created a wrapper for node.js. It uses C-binding and does not require an installed
gnparserapp.Usage as a REST API Interface or Web-based User Graphical Interface
Web-based user interface and API are invoked by
--portor-pflag. To start web server onhttp://0.0.0.0:9000Opening a browser with this address will now show an interactive interface to parser. API calls would be accessible on
http://0.0.0.0:9000/api/v1/.The API and schema are described fully using OpenAPI specification.
Make sure to CGI-escape name-strings for GET requests. An ‘&’ character needs to be converted to ‘%26’
GET /api?q=Aus+bus|Aus+bus+D.+%26+M.,+1870POST /apiwith request body of JSON array of stringsUse as a Docker image
You need to have docker runtime installed on your computer for these examples to work.
Use as a library in Go
Use as a shared C library
It is possible to bind
GNparserfunctionality with languages that can use C Application Binary Interface. For example such languages include Python, Ruby, Rust, C, C++, Java (via JNI).To compile
GNparsershared library for your platform/operating system of choice you need [just] andGNU gcc compilerinstalled:As an example how to use the shared library check this StackOverflow question and biodiversity Ruby gem.
Parsing ambiguities
Some name-strings cannot be parsed unambiguously without some additional data.
Names with
filius(ICN code)For names like
Aus bus Linn. f. custhef.is ambiguous. It might mean that species were described by a son of (filius) Linn., or it might mean thatcusisformaofbus. We provide a warning “Ambiguous f. (filius or forma)” for such cases.Names with subgenus (ICZN code) and genus author (ICN code)
For names like
Aus (Bus) L.orAus (Bus) cus L.the(Bus)token would mean the name of subgenus for ICZN names, but for ICN names it would be an author of genusAus. We created a list of ICN generic authors using data from IRMNG to distinguish such names from each other. For detected ICN names we provide a warning “Ambiguity: ICN author or subgenus”.Virus names according to modern ICVCN binomial nomenclature
ICVCN code adopted binomial nomenclature in 2021, and converted most names to new rules by 2026. However the rules for viral names differ significanlty in comparison with other nomenclatural codes (e.g., names like
Batravirus ranidallo3orPradovirus XAJ24are legal ICVCN names), so we had to create a specialized parser for them. It creates parsing challenges for names likeCalviria,Euvira(ICZN genera) that by chance matched ICVCN rules for Ream and Subrealm. We try to detect such names and place them in an exception list.Authors
Contributors
If you want to submit a bug or add a feature read CONTRIBUTING file.
Artificial Intelligence Policy
We use artificial intelligence to help find algorithms, decide on implementation approaches, and generate code. We carefully review all automatically generated code, fixing inconsistencies, removing superfluous implementations, and improving optimization. No code that we do not understand or approve makes it into published versions of GNparser. We primarily use Claude Code, with limited use of Gemini CLI.
References
Mozzherin, D.Y., Myltsev, A.A. & Patterson, D.J. “gnparser”: a powerful parser for scientific names based on Parsing Expression Grammar. BMC Bioinformatics 18, 279 (2017).https://doi.org/10.1186/s12859-017-1663-3
Rees, T. (compiler) (2019). The Interim Register of Marine and Nonmarine Genera. Available from
http://www.irmng.orgat VLIZ. Accessed 2019-04-10License
Released under MIT license