taxopy is a Python package that provides an interface for assessing NCBI-formatted taxonomic databases. It enables various operations on taxonomic data, such as obtaining complete lineages, determining the lowest common ancestors (LCAs), retrieving taxa names from taxonomic identifiers, and more.
Alternatively, you can install the rapidfuzz library alongside taxopy:
# Using pip
pip install taxopy rapidfuzz
# Using conda
conda install -c conda-forge -c bioconda taxopy rapidfuzz
Usage
For a detailed guide on how to use taxopy, please refer to the documentation.
import taxopy
First you need to download taxonomic information from NCBI’s servers and put this data into a TaxDb object:
taxdb = taxopy.TaxDb()
# You can also use your own set of taxonomy files:
taxdb = taxopy.TaxDb(nodes_dmp="taxdb/nodes.dmp", names_dmp="taxdb/names.dmp")
# If you want to support legacy taxonomic identifiers (that were merged to other identifier), you also need to provide a `merged.dmp` file. This is not necessary if the data is being downloaded from NCBI.
taxdb = taxopy.TaxDb(nodes_dmp="taxdb/nodes.dmp", names_dmp="taxdb/names.dmp", merged_dmp="taxdb/merged.dmp")
The TaxDb object stores the name, rank and parent-child relationships of each taxonomic identifier:
If you want to retrieve the new taxonomic identifier of a legacy identifier you can use the oldtaxid2newtaxid attribute:
print(taxdb.oldtaxid2newtaxid[260])
143224
To get information of a given taxon you can create a Taxon object using its taxonomic identifier:
saccharomyces = taxopy.Taxon(4930, taxdb)
human = taxopy.Taxon(9606, taxdb)
gorilla = taxopy.Taxon(9593, taxdb)
lagomorpha = taxopy.Taxon(9975, taxdb)
Each Taxon object stores a variety of information, such as the rank, identifier and name of the input taxon, and the identifiers and names of all the parent taxa:
The find_majority_vote function allows you to control its stringency via the fraction parameter. For instance, if you would set fraction to 0.75 the resulting taxon would be shared by more than 75% of the input lineages. By default, fraction is 0.5.
To check the level of agreement between the taxa that were aggregated using find_majority_vote and the output taxon, you can check the agreement attribute.
This function returns a list of all taxonomic identifiers associated with the input name. In the case of homonyms, the list will contain multiple taxonomic identifiers:
When querying a TaxDb using a taxon name, you can enable fuzzy search by setting the fuzzy parameter of taxid_from_name to True. This allows the function to find taxa with names similar, but not identical, to the query string(s).
For a practical use case of this feature, consider the GTDB taxonomy. In GTDB some taxa have suffixes appended to their names because they are either not monophyletic in the GTDB reference tree or have unstable placements between different releases. By using fuzzy searches, you can find all the taxonomic identifiers representing a given taxon, such as Myxococcota, without needing to know in advance if any suffixes are appended to the name.
# The `taxdump_url` parameter of the `TaxDb` class can be used retrieve a custom taxdump from a URL. In this case, we will use a GTDB taxdump provided by Wei Shen (https://github.com/shenwei356/gtdb-taxdump)
gtdb_taxdb = taxopy.TaxDb(taxdump_url="https://github.com/shenwei356/gtdb-taxdump/releases/download/v0.5.0/gtdb-taxdump-R220.tar.gz")
for t in taxopy.taxid_from_name("Myxococcota", gtdb_taxdb, fuzzy=True):
print(taxopy.Taxon(t, gtdb_taxdb).name)
Myxococcota_A
Myxococcota
You can adjust the minimum similarity threshold between the query string(s) and the matches in the database using the score_cutoff parameter, which determines how closely a name must match a query string to be considered a valid result. The default value is 0.9, but you can lower this threshold to find matches that are less similar to the queries.
for t in taxopy.taxid_from_name(
"Myxococcota", gtdb_taxdb, fuzzy=True, score_cutoff=0.7
):
print(taxopy.Taxon(t, gtdb_taxdb).name)
taxopy
Documentation | DOI
taxopyis a Python package that provides an interface for assessing NCBI-formatted taxonomic databases. It enables various operations on taxonomic data, such as obtaining complete lineages, determining the lowest common ancestors (LCAs), retrieving taxa names from taxonomic identifiers, and more.Installation
There are two ways to install
taxopy:Usage
For a detailed guide on how to use
taxopy, please refer to the documentation.First you need to download taxonomic information from NCBI’s servers and put this data into a
TaxDbobject:The
TaxDbobject stores the name, rank and parent-child relationships of each taxonomic identifier:If you want to retrieve the new taxonomic identifier of a legacy identifier you can use the
oldtaxid2newtaxidattribute:To get information of a given taxon you can create a
Taxonobject using its taxonomic identifier:Each
Taxonobject stores a variety of information, such as the rank, identifier and name of the input taxon, and the identifiers and names of all the parent taxa:You can use the
parentmethod to get aTaxonobject of the parent node of a given taxon:LCA and majority vote
You can get the lowest common ancestor of a list of taxa using the
find_lcafunction:You may also use the
find_majority_voteto discover the most specific taxon that is shared by more than half of the lineages of a list of taxa:The
find_majority_votefunction allows you to control its stringency via thefractionparameter. For instance, if you would setfractionto 0.75 the resulting taxon would be shared by more than 75% of the input lineages. By default,fractionis 0.5.You can also assign weights to each input lineage:
To check the level of agreement between the taxa that were aggregated using
find_majority_voteand the output taxon, you can check theagreementattribute.Taxid from name
If you only have the name of a taxon, you can get its corresponding taxid using the
taxid_from_namefunction:This function returns a list of all taxonomic identifiers associated with the input name. In the case of homonyms, the list will contain multiple taxonomic identifiers:
In case a list of names is provided as input, the function will return a list of lists.
When querying a
TaxDbusing a taxon name, you can enable fuzzy search by setting thefuzzyparameter oftaxid_from_nametoTrue. This allows the function to find taxa with names similar, but not identical, to the query string(s).For a practical use case of this feature, consider the GTDB taxonomy. In GTDB some taxa have suffixes appended to their names because they are either not monophyletic in the GTDB reference tree or have unstable placements between different releases. By using fuzzy searches, you can find all the taxonomic identifiers representing a given taxon, such as Myxococcota, without needing to know in advance if any suffixes are appended to the name.
You can adjust the minimum similarity threshold between the query string(s) and the matches in the database using the
score_cutoffparameter, which determines how closely a name must match a query string to be considered a valid result. The default value is0.9, but you can lower this threshold to find matches that are less similar to the queries.Acknowledgements
Some of the code used in taxopy was taken from the CAT/BAT tool for taxonomic classification of contigs and metagenome-assembled genomes.