genomepy is designed to provide a simple and straightforward way to download and use genomic data.
This includes (1) searching available data,
(2) showing the available metadata,
(3) automatically downloading, preprocessing and matching data and
(4) generating optional aligner indexes.
All with sensible, yet controllable defaults.
Currently, genomepy supports Ensembl, UCSC, NCBI and GENCODE.
Pssst, hey there! Is genomepy not doing what you want? Does it fail? Is it clunky?
Is the documentation unclear? Have any other ideas on how to improve it?
Don’t be shy and let us know!
name provider accession tax_id annotation species other_info
GRCz11 Ensembl GCA_000002035.4 7955 ✓ Danio rerio 2017-08-Ensembl/2018-04
^
Use name for genomepy install
The default genomes directory: ~/.local/share/genomes/
Command line interface
All commands come with a short explanation when appended with -h/--help.
$ genomepy --help
Usage: genomepy [OPTIONS] COMMAND [ARGS]...
Options:
--version Show the version and exit.
-h, --help Show this message and exit.
Commands:
annotation show 1st lines of each annotation
clean remove provider data
config manage configuration
genomes list available genomes
install install a genome & run active plugins
plugin manage plugins
providers list available providers
search search for genomes
Search genomes & gene annotations
Let’s say we want to download a Xenopus tropicalis genome & gene annotation.
First, lets find out what’s out there!
You can search by name, taxonomy ID or assembly accession ID.
Additionally, you can limit the search result to one provider with -p/--provider.
Furthermore, you can get the absolute --size of each genome (this option slows down the search).
$ genomepy search xenopus tro
name provider accession tax_id annotation species other_info
n r e k
Xenopus_tropicalis_v9.1 Ensembl GCA_000004195.3 8364 ✓ Xenopus tropicalis 2019-04-Ensembl/2019-12
xenTro1 UCSC na 8364 ✗ ✗ ✗ ✗ Xenopus tropicalis Oct. 2004 (JGI 3.0/xenTro1)
xenTro2 UCSC na 8364 ✗ ✓ ✓ ✗ Xenopus tropicalis Aug. 2005 (JGI 4.1/xenTro2)
xenTro3 UCSC GCA_000004195.1 8364 ✗ ✓ ✓ ✗ Xenopus tropicalis Nov. 2009 (JGI 4.2/xenTro3)
xenTro7 UCSC GCA_000004195.2 8364 ✓ ✓ ✗ ✗ Xenopus tropicalis Sep. 2012 (JGI 7.0/xenTro7)
xenTro9 UCSC GCA_000004195.3 8364 ✓ ✓ ✓ ✗ Xenopus tropicalis Jul. 2016 (Xenopus_tropicalis_v9.1/xenTro9)
Xtropicalis_v7 NCBI GCF_000004195.2 8364 ✓ Xenopus tropicalis DOE Joint Genome Institute
Xenopus_tropicalis_v9.1 NCBI GCF_000004195.3 8364 ✓ Xenopus tropicalis DOE Joint Genome Institute
UCB_Xtro_10.0 NCBI GCF_000004195.4 8364 ✓ Xenopus tropicalis University of California, Berkeley
ASM1336827v1 NCBI GCA_013368275.1 8364 ✗ Xenopus tropicalis Southern University of Science and Technology
^
Use name for genomepy install
Inspect gene annotations
Let’s say we want to download the Xenopus tropicalis genome & gene annotation from UCSC.
Since we are interested in the gene annotation as well, we should check which gene annotation suits our needs.
As you can see in the search results, UCSC has several gene annotations for us to choose from.
In the search results, n r e k denotes which UCSC annotations are available.
These stand for ncbiRefSeq, refGene, ensGene and knownGene, respectively.
We can quickly inspect these with the genomepy annotation command:
Here we can see that the refGene annotation has actual HGNC gene names, so lets go with this annotation.
This differs between assemblies, so be sure to check!
Install a genome & gene annotation
Copy the name returned by the search function to install.
$ genomepy install xenTro9
You can choose to download gene annotation files with the -a/--annotation option.
$ genomepy install xenTro9 --annotation
For UCSC we can also select the annotation type.
See genomepy install --help for all provider specific options.
Since we did not specify the provider here, genomepy will use the first provider with xenTro9.
You can specify a provider by name with -p/--provider:
$ genomepy install xenTro9 -p UCSC
Downloading genome from http://hgdownload.soe.ucsc.edu/goldenPath/xenTro9/bigZips/xenTro9.fa.gz...
Genome download successful, starting post processing...
name: xenTro9
local name: xenTro9
fasta: ~/.local/share/genomes/xenTro9/xenTro9.fa
Next, the genome is downloaded to the directory specified in the config file (by default ~/.local/share/genomes).
To choose a different directory, use the -g/--genomes_dir option:
$ genomepy install sacCer3 -p UCSC -g /path/to/my/genomes
Downloading genome from http://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/bigZips/chromFa.tar.gz...
Genome download successful, starting post processing...
name: sacCer3
local name: sacCer3
fasta: /path/to/my/genomes/sacCer3/sacCer3.fa
Regex, masking & compression
You can use a regular expression to filter for matching sequences
(or non-matching sequences by using the -n/--no-match option).
For instance, the following command downloads hg38 and saves only the major chromosomes:
genomepy: genes and genomes at your fingertips
genomepy is designed to provide a simple and straightforward way to download and use genomic data. This includes (1) searching available data, (2) showing the available metadata, (3) automatically downloading, preprocessing and matching data and (4) generating optional aligner indexes. All with sensible, yet controllable defaults. Currently, genomepy supports Ensembl, UCSC, NCBI and GENCODE.
Pssst, hey there! Is genomepy not doing what you want? Does it fail? Is it clunky? Is the documentation unclear? Have any other ideas on how to improve it? Don’t be shy and let us know!
Table of Contents
Installation
genomepy requires Python 3.9+
You can install genomepy via bioconda, pip or git.
Bioconda
Pip
With the Pip installation, you will have to install additional dependencies, and make them available in your PATH.
To read/write bgzipped genomes you will have to install
pysam.If you want to use gene annotation features, you will have to install the following utilities:
genePredToBedgenePredToGtfbedToGenePredgtfToGenePredgff3ToGenePredYou can find the binaries here.
Git
Quick usage
$ genomepy search zebrafishConsole output:
$ genomepy install --annotation GRCz11 --provider ensemblThe default genomes directory:
~/.local/share/genomes/Command line interface
All commands come with a short explanation when appended with
-h/--help.Search genomes & gene annotations
Let’s say we want to download a Xenopus tropicalis genome & gene annotation. First, lets find out what’s out there!
You can search by name, taxonomy ID or assembly accession ID. Additionally, you can limit the search result to one provider with
-p/--provider. Furthermore, you can get the absolute--sizeof each genome (this option slows down the search).Inspect gene annotations
Let’s say we want to download the Xenopus tropicalis genome & gene annotation from UCSC.
Since we are interested in the gene annotation as well, we should check which gene annotation suits our needs. As you can see in the search results, UCSC has several gene annotations for us to choose from. In the search results,
n r e kdenotes which UCSC annotations are available. These stand for ncbiRefSeq, refGene, ensGene and knownGene, respectively.We can quickly inspect these with the
genomepy annotationcommand:Here we can see that the
refGeneannotation has actual HGNC gene names, so lets go with this annotation. This differs between assemblies, so be sure to check!Install a genome & gene annotation
Copy the name returned by the search function to install.
You can choose to download gene annotation files with the
-a/--annotationoption.For UCSC we can also select the annotation type. See
genomepy install --helpfor all provider specific options.Since we did not specify the provider here, genomepy will use the first provider with
xenTro9. You can specify a provider by name with-p/--provider:Next, the genome is downloaded to the directory specified in the config file (by default
~/.local/share/genomes). To choose a different directory, use the-g/--genomes_diroption:Regex, masking & compression
You can use a regular expression to filter for matching sequences (or non-matching sequences by using the
-n/--no-matchoption). For instance, the following command downloads hg38 and saves only the major chromosomes: