CuratedAtlasQuery is a query interface that allow the programmatic
exploration and retrieval of the harmonised, curated and reannotated
CELLxGENE single-cell human cell atlas. Data can be retrieved at cell,
sample, or dataset levels based on filtering criteria.
Harmonised data is stored in the ARDC Nectar Research Cloud, and most
CuratedAtlasQuery functions interact with Nectar via web requests, so
a network connection is required for most functionality.
Usage
The API has delivered more than 15Tb of data to the community in the first year. Thanks!
This convert the H5 SingleCellExperiment to Seurat so it might take long
time and occupy a lot of memory depending on how many cells you are
requesting.
single_cell_counts_seurat =
metadata |>
dplyr::filter(
ethnicity == "African" &
stringr::str_like(assay, "%10x%") &
tissue == "lung parenchyma" &
stringr::str_like(cell_type, "%CD4%")
) |>
get_seurat()
#> ℹ Realising metadata.
#> ℹ Synchronising files
#> ℹ Downloading 0 files, totalling 0 GB
#> ℹ Reading files.
#> ℹ Compiling Single Cell Experiment.
single_cell_counts_seurat
#> An object of class Seurat
#> 36229 features across 1571 samples within 1 assay
#> Active assay: originalexp (36229 features, 0 variable features)
Save your SingleCellExperiment
The returned SingleCellExperiment can be saved with two modalities, as
.rds or as HDF5.
Saving as RDS (fast saving, slow reading)
Saving as .rds has the advantage of being fast, andd the .rds file
occupies very little disk space as it only stores the links to the files
in your cache.
However it has the disadvantage that for big SingleCellExperiment
objects, which merge a lot of HDF5 from your
get_single_cell_experiment, the display and manipulation is going to
be slow. In addition, an .rds saved in this way is not portable: you
will not be able to share it with other users.
Saving as .hdf5 executes any computation on the SingleCellExperiment
and writes it to disk as a monolithic HDF5. Once this is done,
operations on the SingleCellExperiment will be comparatively very
fast. The resulting .hdf5 file will also be totally portable and
sharable.
However this .hdf5 has the disadvantage of being larger than the
corresponding .rds as it includes a copy of the count information, and
the saving process is going to be slow for large objects.
metadata |>
# Filter and subset
dplyr::filter(cell_type_harmonised=="nk") |>
# Get counts per million for HCA-A gene
get_single_cell_experiment(assays = "cpm", features = "HLA-A") |>
# Plot (styling code have been omitted)
tidySingleCellExperiment::join_features("HLA-A", shape = "wide") |>
ggplot(aes(tissue_harmonised, `HLA.A`,color = file_id)) +
geom_jitter(shape=".")
#> ℹ Realising metadata.
#> ℹ Synchronising files
#> ℹ Downloading 0 files, totalling 0 GB
#> ℹ Reading files.
#> ℹ Compiling Single Cell Experiment.
Obtain Unharmonised Metadata
Various metadata fields are not common between datasets, so it does
not make sense for these to live in the main metadata table. However, we
can obtain it using the get_unharmonised_metadata() function. This
function returns a data frame with one row per dataset, including the
unharmonised column which contains unharmnised metadata as a nested
data frame.
cell_annotation_blueprint_singler: SingleR cell annotation using
Blueprint reference
cell_annotation_blueprint_monaco: SingleR cell annotation using
Monaco reference
sample_id_db: Sample subdivision for internal use
file_id_db: File subdivision for internal use
sample_: Sample ID
.sample_name: How samples were defined
RNA abundance
The raw assay includes RNA abundance in the positive real scale (not
transformed with non-linear functions, e.g. log sqrt). Originally
CELLxGENE include a mix of scales and transformations specified in the
x_normalization column.
CuratedAtlasQueryR
CuratedAtlasQueryis a query interface that allow the programmatic exploration and retrieval of the harmonised, curated and reannotated CELLxGENE single-cell human cell atlas. Data can be retrieved at cell, sample, or dataset levels based on filtering criteria.Harmonised data is stored in the ARDC Nectar Research Cloud, and most
CuratedAtlasQueryfunctions interact with Nectar via web requests, so a network connection is required for most functionality.Usage
The API has delivered more than 15Tb of data to the community in the first year. Thanks!
Query interface
Installation
Load the package
Load and explore the metadata
Load the metadata
The
metadatavariable can then be re-used for all subsequent queries.Explore the tissue
Download single-cell RNA sequencing counts
Query raw counts
Query counts scaled per million
This is helpful if just few genes are of interest, as they can be compared across samples.
Extract only a subset of genes
Extract the counts as a Seurat object
This convert the H5 SingleCellExperiment to Seurat so it might take long time and occupy a lot of memory depending on how many cells you are requesting.
Save your
SingleCellExperimentThe returned
SingleCellExperimentcan be saved with two modalities, as.rdsor asHDF5.Saving as RDS (fast saving, slow reading)
Saving as
.rdshas the advantage of being fast, andd the.rdsfile occupies very little disk space as it only stores the links to the files in your cache.However it has the disadvantage that for big
SingleCellExperimentobjects, which merge a lot of HDF5 from yourget_single_cell_experiment, the display and manipulation is going to be slow. In addition, an.rdssaved in this way is not portable: you will not be able to share it with other users.Saving as HDF5 (slow saving, fast reading)
Saving as
.hdf5executes any computation on theSingleCellExperimentand writes it to disk as a monolithicHDF5. Once this is done, operations on theSingleCellExperimentwill be comparatively very fast. The resulting.hdf5file will also be totally portable and sharable.However this
.hdf5has the disadvantage of being larger than the corresponding.rdsas it includes a copy of the count information, and the saving process is going to be slow for large objects.Visualise gene transcription
We can gather all CD14 monocytes cells and plot the distribution of HLA-A across all tissues
Obtain Unharmonised Metadata
Various metadata fields are not common between datasets, so it does not make sense for these to live in the main metadata table. However, we can obtain it using the
get_unharmonised_metadata()function. This function returns a data frame with one row per dataset, including theunharmonisedcolumn which contains unharmnised metadata as a nested data frame.Notice that the columns differ between each dataset’s data frame:
Cell metadata
Dataset-specific columns (definitions available at cellxgene.cziscience.com)
cell_count,collection_id,created_at.x,created_at.y,dataset_deployments,dataset_id,file_id,filename,filetype,is_primary_data.y,is_valid,linked_genesets,mean_genes_per_cell,name,published,published_at,revised_at,revision,s3_uri,schema_version,tombstone,updated_at.x,updated_at.y,user_submitted,x_normalizationSample-specific columns (definitions available at cellxgene.cziscience.com)
sample_,sample_name,age_days,assay,assay_ontology_term_id,development_stage,development_stage_ontology_term_id,ethnicity,ethnicity_ontology_term_id,experiment___,organism,organism_ontology_term_id,sample_placeholder,sex,sex_ontology_term_id,tissue,tissue_harmonised,tissue_ontology_term_id,disease,disease_ontology_term_id,is_primary_data.xCell-specific columns (definitions available at cellxgene.cziscience.com)
cell_,cell_type,cell_type_ontology_term_idm,cell_type_harmonised,confidence_class,cell_annotation_azimuth_l2,cell_annotation_blueprint_singlerThrough harmonisation and curation we introduced custom column, not present in the original CELLxGENE metadata
tissue_harmonised: a coarser tissue name for better filteringage_days: the number of days corresponding to the agecell_type_harmonised: the consensus call identity (for immune cells) using the original and three novel annotations using Seurat Azimuth and SingleRconfidence_class: an ordinal class of how confidentcell_type_harmonisedis. 1 is complete consensus, 2 is 3 out of four and so on.cell_annotation_azimuth_l2: Azimuth cell annotationcell_annotation_blueprint_singler: SingleR cell annotation using Blueprint referencecell_annotation_blueprint_monaco: SingleR cell annotation using Monaco referencesample_id_db: Sample subdivision for internal usefile_id_db: File subdivision for internal usesample_: Sample ID.sample_name: How samples were definedRNA abundance
The
rawassay includes RNA abundance in the positive real scale (not transformed with non-linear functions, e.g. log sqrt). Originally CELLxGENE include a mix of scales and transformations specified in thex_normalizationcolumn.The
cpmassay includes counts per million.Session Info