TENxIO allows users to import 10X pipeline files into known
Bioconductor classes. The package is not comprehensive, there are file
types that are not supported. For Visium datasets, we direct users to
the VisiumIO package on Bioconductor. TENxIO consolidates
functionality from DropletUtils. If you would like a file format to be
supported, open an issue at https://github.com/waldronlab/TENxIO.
Supported Formats
Extension
Class
Imported as
.h5
TENxH5
SingleCellExperiment w/ TENxMatrix
.mtx / .mtx.gz
TENxMTX
SummarizedExperiment w/ dgCMatrix
.tar.gz
TENxFileList
SingleCellExperiment w/ dgCMatrix
peak_annotation.tsv
TENxPeaks
GRanges
fragments.tsv.gz
TENxFragments
RaggedExperiment
.tsv / .tsv.gz
TENxTSV
tibble
spatial.tar.gz
TENxSpatialList
inter. DataFrame list
Tested 10X Products
We have tested these functions with somedatasets from 10x
Genomics including those from:
Single Cell Gene Expression
Single Cell ATAC
Single Cell Multiome ATAC + Gene Expression
Spatial Gene Expression
Note. That extensive testing has not been performed and the codebase may
require some adaptation to ensure compatibility with all pipeline
outputs.
Bioconductor implementations
We are aware of existing functionality in both DropletUtils and
SpatialExperiment. We are working with the authors of those packages
to cover the use cases in both those packages and possibly port I/O
functionality into TENxIO. We are using long tests and the
DropletTestFiles package to cover example datasets on ExperimentHub,
if you would like to know more, see the longtests directory on GitHub.
Installation
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("waldronlab/TENxIO")
Load the package
library(TENxIO)
Description
TENxIO offers an set of classes that allow users to easily work with
files typically obtained from the 10X Genomics website. Generally, these
are outputs from the Cell Ranger pipeline.
Procedure
Loading the data into a Bioconductor class is a two step process. First,
the file must be identified by either the user or the TENxFile
function. The appropriate function will be evoked to provide a TENxIO
class representation, e.g., TENxH5 for HDF5 files with an .h5
extension. Secondly, the import method for that particular file class
will render a common Bioconductor class representation for the user. The
main representations used by the package are SingleCellExperiment,
SummarizedExperiment, GRanges, and RaggedExperiment.
Dataset versioning
The versioning schema in the package mostly applies to HDF5 resources
and is loosely based on versions of 10X datasets. For the most part,
version 3 datasets usually contain ranged information at specific
locations in the data file. Version 2 datasets will usually contain a
genes.tsv file, rather than features.tsv as in version 3. If the
file version is unknown, the software will attempt to derive the version
from the data where possible.
File classes
TENxFile
The TENxFile class is the catch-all class superclass that allows
transition to subclasses pertinent to specific files. It inherits from
the BiocFile class and allows for easy dispatching import methods.
TENxFile can handle resources from ExperimentHub with careful
inputs. For example, one can import a TENxBrainData dataset via the
appropriate ExperimentHub identifier (EH1039):
hub <- ExperimentHub::ExperimentHub()
#> snapshotDate(): 2025-04-21
hub["EH1039"]
#> ExperimentHub with 1 record
#> # snapshotDate(): 2025-04-21
#> # names(): EH1039
#> # package(): TENxBrainData
#> # $dataprovider: 10X Genomics
#> # $species: Mus musculus
#> # $rdataclass: character
#> # $rdatadateadded: 2017-10-26
#> # $title: Brain scRNA-seq data, 'HDF5-based 10X Genomics' format
#> # $description: Single-cell RNA-seq data for 1.3 million brain cells from E18 mice. 'HDF5-based 10X Genomics' format originally pro...
#> # $taxonomyid: 10090
#> # $genome: mm10
#> # $sourcetype: HDF5
#> # $sourceurl: http://cf.10xgenomics.com/samples/cell-exp/1.3.0/1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.h5
#> # $sourcesize: NA
#> # $tags: c("SequencingData", "RNASeqData", "ExpressionData", "SingleCell")
#> # retrieve record with 'object[["EH1039"]]'
Currently, ExperimentHub resources do not have an extension and it is
best to provide that to the TENxFile constructor function.
fname <- hub[["EH1039"]]
TENxFile(fname, extension = "h5", group = "mm10", version = "2")
Note. EH1039 is a large ~ 4GB file and files without extension as
those obtained from ExperimentHub will emit a warning so that the user
is aware that the import operation may fail, esp. if the internal
structure of the file is modified.
TENxH5
TENxIO mainly supports version 3 and 2 type of H5 files. These are
files with specific groups and names as seen in h5.version.map, an
internal data.frame map that guides the import operations.
TENxIO:::h5.version.map
#> Version ID Symbol Type Ranges
#> 1 3 /features/id /features/name /features/feature_type /features/interval
#> 2 2 /genes /gene_names <NA> <NA>
In the case that, there is a file without genomic coordinate
information, the constructor function can take an NA_character_ input
for the ranges argument.
The TENxH5 constructor function can be used on either version of these
H5 files. In this example, we use a subset of the PBMC granulocyte H5
file obtained from the 10X
website.
Note. Although the main representation in the package is
SingleCellExperiment, there could be a need for alternative data class
representations of the data. The projection field in the TENxH5 show
method is an initial attempt to allow alternative representations.
TENxMTX
Matrix Market formats are also supported (.mtx extension). These are
typically imported as SummarizedExperiment as they usually contain count
data.
Generally, the 10X website will provide tarballs (with a .tar.gz
extension) which can be imported with the TENxFileList class. The
tarball can contain components of a gene expression experiment including
the matrix data, row data (aka ‘features’) expressed as Ensembl
identifiers, gene symbols, etc. and barcode information for the columns.
The TENxFileList class allows importing multiple files within a
tar.gz archive. The untar function with the list = TRUE argument
shows all the file names in the tarball.
We then use the import method across all file types to obtain an
integrated Bioconductor representation that is ready for analysis. Files
in TENxFileList can be represented as a SingleCellExperiment with
row names and column names.
Peak files can be handled with the TENxPeaks class. These files are
usually named *peak_annotation files with a .tsv extension. Peak
files are represented as GRanges.
Fragment files are quite large and we make use of the Rsamtools
package to import them with the yieldSize parameter. By default, we
use a yieldSize of 200.
Introduction
TENxIOallows users to import 10X pipeline files into known Bioconductor classes. The package is not comprehensive, there are file types that are not supported. For Visium datasets, we direct users to theVisiumIOpackage on Bioconductor. TENxIO consolidates functionality fromDropletUtils. If you would like a file format to be supported, open an issue at https://github.com/waldronlab/TENxIO.Supported Formats
Tested 10X Products
We have tested these functions with some datasets from 10x Genomics including those from:
Note. That extensive testing has not been performed and the codebase may require some adaptation to ensure compatibility with all pipeline outputs.
Bioconductor implementations
We are aware of existing functionality in both
DropletUtilsandSpatialExperiment. We are working with the authors of those packages to cover the use cases in both those packages and possibly port I/O functionality intoTENxIO. We are using long tests and theDropletTestFilespackage to cover example datasets onExperimentHub, if you would like to know more, see thelongtestsdirectory on GitHub.Installation
Load the package
Description
TENxIOoffers an set of classes that allow users to easily work with files typically obtained from the 10X Genomics website. Generally, these are outputs from the Cell Ranger pipeline.Procedure
Loading the data into a Bioconductor class is a two step process. First, the file must be identified by either the user or the
TENxFilefunction. The appropriate function will be evoked to provide aTENxIOclass representation, e.g.,TENxH5for HDF5 files with an.h5extension. Secondly, theimportmethod for that particular file class will render a common Bioconductor class representation for the user. The main representations used by the package areSingleCellExperiment,SummarizedExperiment,GRanges, andRaggedExperiment.Dataset versioning
The versioning schema in the package mostly applies to HDF5 resources and is loosely based on versions of 10X datasets. For the most part, version 3 datasets usually contain ranged information at specific locations in the data file. Version 2 datasets will usually contain a
genes.tsvfile, rather thanfeatures.tsvas in version 3. If the file version is unknown, the software will attempt to derive the version from the data where possible.File classes
TENxFile
The
TENxFileclass is the catch-all class superclass that allows transition to subclasses pertinent to specific files. It inherits from theBiocFileclass and allows for easy dispatchingimportmethods.ExperimentHubresourcesTENxFilecan handle resources fromExperimentHubwith careful inputs. For example, one can import aTENxBrainDatadataset via the appropriateExperimentHubidentifier (EH1039):Currently,
ExperimentHubresources do not have an extension and it is best to provide that to theTENxFileconstructor function.Note.
EH1039is a large ~ 4GB file and files without extension as those obtained fromExperimentHubwill emit a warning so that the user is aware that the import operation may fail, esp. if the internal structure of the file is modified.TENxH5
TENxIOmainly supports version 3 and 2 type of H5 files. These are files with specific groups and names as seen inh5.version.map, an internaldata.framemap that guides the import operations.In the case that, there is a file without genomic coordinate information, the constructor function can take an
NA_character_input for therangesargument.The
TENxH5constructor function can be used on either version of these H5 files. In this example, we use a subset of the PBMC granulocyte H5 file obtained from the 10X website.Note. The
h5lsfunction gives an overview of the structure of the file. It matches version 3 in our version map.The show method gives an overview of the data components in the file:
import TENxH5 method
We can simply use the import method to convert the file representation to a Bioconductor class representation, typically a
SingleCellExperiment.Note. Although the main representation in the package is
SingleCellExperiment, there could be a need for alternative data class representations of the data. Theprojectionfield in theTENxH5show method is an initial attempt to allow alternative representations.TENxMTX
Matrix Market formats are also supported (
.mtxextension). These are typically imported as SummarizedExperiment as they usually contain count data.import MTX method
The
importmethod yields aSummarizedExperimentwithout colnames or rownames.TENxFileList
Generally, the 10X website will provide tarballs (with a
.tar.gzextension) which can be imported with theTENxFileListclass. The tarball can contain components of a gene expression experiment including the matrix data, row data (aka ‘features’) expressed as Ensembl identifiers, gene symbols, etc. and barcode information for the columns.The
TENxFileListclass allows importing multiple files within atar.gzarchive. Theuntarfunction with thelist = TRUEargument shows all the file names in the tarball.We then use the
importmethod across all file types to obtain an integrated Bioconductor representation that is ready for analysis. Files inTENxFileListcan be represented as aSingleCellExperimentwith row names and column names.TENxPeaks
Peak files can be handled with the
TENxPeaksclass. These files are usually named*peak_annotationfiles with a.tsvextension. Peak files are represented asGRanges.TENxFragments
Fragment files are quite large and we make use of the
Rsamtoolspackage to import them with theyieldSizeparameter. By default, we use ayieldSizeof 200.Internally, we use the
TabixFileconstructor function to work with indexedtsv.gzfiles.Note. A warning is emitted whenever a
yieldSizeparameter is not set.Because there may be a variable number of fragments per barcode, we use a
RaggedExperimentrepresentation for this file type.Similar operations to those used with
SummarizedExperimentare supported. For example, the genomic ranges can be displayed viarowRanges:Click here to expand
sessionInfo()Session Information