Tutorial: Integrating QIIME2 and R for data visualization and analysis using qiime2R (v0.99.6)
Background
The qiime artifact is a method for storing the input and outputs for QIIME2 along with associated metadata and provenance information about how the object was formed. This method of storing objects has a number of obvious advantages; however, on the surface it does not lend itself to easy import to R for the R-minded data scientist. In reality, the .qza file is a compressed directory with an intuitive structure.
While it is possible to export data on a one-by-one basis from the qiime artifacts using qiime’s built in suite of export features this is problematic and runs antithetical to the purpose of the artifact format for the following reasons:
Export of data from the artifact using QIIME2 requires an installation which may not be available on the user’s computer and may not be trivial to install for a novice user
Export of the data will loose the associated provenance information. Now the origin of the data can’t be traced and the parameters that led to its generation have been lost.
Export of the data on a one-by-one basis is tedious and creates multiple copies of intermediate files
R has many options for advanced data analysis and/or visualization that may not natively supported in QIIME or python environments
This package is trying to simplify the process of getting the artifact into R without discarding any of the associated data through a simple read_qza function. The artifact is unpacked in to a temporary directory and the raw data and associated metadata are read into a named list (see below). Data are typically returned as either a data.frame, phylo object (trees), or DNAStringSets (nucleic acid sequences).
Functions
read_qza() - Function for reading artifacts (.qza).
qza_to_phyloseq() - Imports multiple artifacts to produce a phyloseq object.
write_q2manifest() - Writes a read manifest file to import data into qiime2
theme_q2r() - A ggplot2 theme for for clean figures.
print_provenance() - A function to display provenance information.
is_q2metadata() - A function to check if a file is a qiime2 metadata file.
parse_taxonomy() - A function to parse taxonomy strings and return a table where each column is a taxonomic class.
parse_ordination() - A function to parse the internal ordination format.
read_q2biom() - A function for reading QIIME2 biom files in format v2.1
make_clr() - Transform feature table using centered log2 ratio.
make_proportion() - Transform feature table to proportion (sum to 1).
make_percent() - Transform feature to percent (sum to 100).
interactive_table() - Create an interactive table in Rstudio viewer or rmarkdown html.
summarize_taxa()- Create a list of tables with abundances sumed to each taxonomic level.
taxa_barplot() - Create a stacked barplot using ggplot2.
taxa_heatmap() - Create a heatmap of taxonomic abundances using gplot2.
corner() - Show top corner of a large table-like obejct.
min_nonzero() - Find the smallest non-zero, non-NA in a numeric vector.
mean_sd() - Return mean and standard deviation for plotting.
subsample_table() - Subsample a table with or without replacement.
filter_features() - Remove low abundance features by number of counts and number of samples they appear in.
Installing qiime2R
qiime2R is currently available via github which can easily be installed in R via the following command:
if (!requireNamespace("devtools", quietly = TRUE)){install.packages("devtools")}
devtools::install_github("jbisanz/qiime2R")
Reading Artifacts (.qza)
This example is using data derived from the moving pictures tutorial. The main function we will need is read_qza():
?read_qza
read qiime2 artifacts (.qza)
Description
extracts embedded data and object metadata into an R session
Usage
read_qza(file, tmp, rm)
Arguments
file - path to the input file, ex: file="~/data/moving_pictures/table.qza"
tmp - a temporary directory that the object will be decompressed to (default="tempdir()")
rm - should the decompressed object be removed at completion of function (T/F default=TRUE)
Value
a named list of the following objects:
artifact$data - the raw data ex OTU table as matrix or tree in phylo format
artifact$uuid - the unique identifer of the artifact
artifact$type - the semantic type of the object (ex FeatureData[Sequence])
artifact$format - the format of the qiime artifact
artifact$provenance - information tracking how the object was created
artifact$contents - a table of all the files contained within the artifact and their file size
artifact$version - the reported version for the artifact, a warning error may be thrown if a new version is seen
We will start be reading in a table of sequence variants (SVs):
SVs<-read_qza("table.qza")
When the artifact is imported, there are a number of pieces of information included. To see them we can use the names command:
In the above file each row denotes a sequence variant where in the actual text is the hash of the complete sequence. See here for an example tool for generating hashes.
We can also look at the unique identifier for this object:
We can also print the providence; however, it is probably easier to use the q2view tool for a graphical aid in its interpretation.
print_provenance(SVs)
artifact$provenance = list 3 (118072 bytes)
. 706b6bce-8f19-4ae9-b8f5-21b14a814a1b/provenance/artifacts/4de0fc23-6462-43d3-8497-f55fc49f5db6/action/action.yaml = list 4
. . execution = list 2
. . . uuid = character 1= 8614aa95-3638-4e49-88f
. . . runtime = list 3
. . . . start = character 1= 2020-02-28T10:19:02.86
. . . . end = character 1= 2020-02-28T10:19:24.46
. . . . duration = character 1= 21 seconds, and 605755
...
Reading Metadata
If you are using a qiime2 metadata file (outlined here), you can use the supplied function read_q2metadata() as below. If using a standard tsv or csv file, you can use read.table(), readr::read_tsv(), or readr::csv().
metadata<-read_q2metadata("sample-metadata.tsv")
head(metadata) # show top lines of metadata
# SampleID barcode-sequence body-site year month day subject reported-antibiotic-usage days-since-experiment-start
#2 L1S8 AGCTGACTAGTC gut 2008 10 28 subject-1 Yes 0
#3 L1S57 ACACACTATGGC gut 2009 1 20 subject-1 No 84
#4 L1S76 ACTACGTGTGGT gut 2009 2 17 subject-1 No 112
#5 L1S105 AGTGCGATGCGT gut 2009 3 17 subject-1 No 140
#6 L2S155 ACGATGCGACCA left palm 2009 1 20 subject-1 No 84
#7 L2S175 AGCTATCCACGA left palm 2009 2 17 subject-1 No 112
Reading Taxonomy
Note, when taxonomy is imported, a single string is returned along with a confidence score. For many analysis we will want to break up this string and for that purpose the parse_taxonomy() function is provided:
A wrapper function called qza_to_phyloseq() is provided which links multiple read_qza() calls together to create a phyloseq object for subsequent analysis as per the phyloseq tutorials. An example usage is shown below:
A wrapper function called qza_to_tse() is provided which links multiple read_qza() calls together to create a TSE object for subsequent analysis. You may find some useful tutorials on:
OMA tutorials. An example usage is shown below:
In this next section, methods for publication-ready figures will be demonstrated through extensive use of the tidyverse, and particularly ggplot2 and its many extensions. If starting with R, learning these packages early on will have a bit of a learning curve, but pay off massively in the long term. Each section contains a self-contained code snippets capable of replicating the figures and links to the data used. A lot can be learned from modifying parameters of the graphing commands to customize and play around with the visualizations. Note, the pipe operator (%>%) is being used frequently here and passes the output of one line to the next. Learn more about piping in R here.
All the data we will use is from the “Moving Pictures” tutorial and/or related content which can be found here.
library(tidyverse)
library(qiime2R)
metadata<-read_q2metadata("sample-metadata.tsv")
shannon<-read_qza("shannon_vector.qza")
shannon<-shannon$data %>% rownames_to_column("SampleID") # this moves the sample names to a new column that matches the metadata and allows them to be merged
This dataset provides a good example of why you should always look at your data. There are an extra 3 samples in the metadata which do not have an assigned Shannon diversity value as seen below:
If this was your dataset, you should look into why this was and try to see why these samples have been lost. We will continue after sorting them out:
metadata<-
metadata %>%
left_join(shannon)
head(metadata)
# SampleID barcode-sequence body-site year month day subject reported-antibiotic-usage days-since-experiment-start shannon
#1 L1S8 AGCTGACTAGTC gut 2008 10 28 subject-1 Yes 0 3.188316
#2 L1S57 ACACACTATGGC gut 2009 1 20 subject-1 No 84 3.985702
#3 L1S76 ACTACGTGTGGT gut 2009 2 17 subject-1 No 112 3.461625
#4 L1S105 AGTGCGATGCGT gut 2009 3 17 subject-1 No 140 3.972339
#5 L2S155 ACGATGCGACCA left palm 2009 1 20 subject-1 No 84 5.064577
#6 L2S175 AGCTATCCACGA left palm 2009 2 17 subject-1 No 112 4.966707
Note how we have now added a column for the shannon diversity to our table which will make everything easy to plot. Given this dataset is longitudinal we can plot it as such. Note: the code I am using here would handle replicate data and plot error bars; however, this case does not have replicates. Also, a quick note is that R does generally not like -s in column names. To use these names we can surround them with ticks.
metadata %>%
filter(!is.na(shannon)) %>%
ggplot(aes(x=`days-since-experiment-start`, y=shannon, color=`body-site`)) +
stat_summary(geom="errorbar", fun.data=mean_se, width=0) +
stat_summary(geom="line", fun.data=mean_se) +
stat_summary(geom="point", fun.data=mean_se) +
xlab("Days") +
ylab("Shannon Diversity") +
theme_q2r() + # try other themes like theme_bw() or theme_classic()
scale_color_viridis_d(name="Body Site") # use different color scale which is color blind friendly
ggsave("Shannon_by_time.pdf", height=3, width=4, device="pdf") # save a PDF 3 inches by 4 inches
Imagine we wanted to instead ask if there was an effect of antibiotics on diversity. We can change the type of plot:
metadata %>%
filter(!is.na(shannon)) %>%
ggplot(aes(x=`reported-antibiotic-usage`, y=shannon, fill=`reported-antibiotic-usage`)) +
stat_summary(geom="bar", fun.data=mean_se, color="black") + #here black is the outline for the bars
geom_jitter(shape=21, width=0.2, height=0) +
coord_cartesian(ylim=c(2,7)) + # adjust y-axis
facet_grid(~`body-site`) + # create a panel for each body site
xlab("Antibiotic Usage") +
ylab("Shannon Diversity") +
theme_q2r() +
scale_fill_manual(values=c("cornflowerblue","indianred")) + #specify custom colors
theme(legend.position="none") #remove the legend as it isn't needed
ggsave("../../../images/Shannon_by_abx.pdf", height=3, width=4, device="pdf") # save a PDF 3 inches by 4 inches
Maybe instead the question is if the two individuals have different diversity?
metadata %>%
filter(!is.na(shannon)) %>%
ggplot(aes(x=subject, y=shannon, fill=`subject`)) +
stat_summary(geom="bar", fun.data=mean_se, color="black") + #here black is the outline for the bars
geom_jitter(shape=21, width=0.2, height=0) +
coord_cartesian(ylim=c(2,7)) + # adjust y-axis
facet_grid(~`body-site`) + # create a panel for each body site
xlab("Antibiotic Usage") +
ylab("Shannon Diversity") +
theme_q2r() +
scale_fill_manual(values=c("cornflowerblue","indianred")) + #specify custom colors
theme(legend.position="none") #remove the legend as it isn't needed
ggsave("Shannon_by_person.pdf", height=3, width=4, device="pdf") # save a PDF 3 inches by 4 inches
Note: It is important to remember that these a graphical tools, but to make a claim, you should also generate some form of statistical support taking the nature of the data into consideration (ie that it uses repeated sampling over time).
Plotting PCoA
Here I am going to demo a PCoA with extra metadata mapped to it. I indicate the body site, antibiotic usage, and the shannon diversity by mapping to different aesthetic parameters.
Heat maps provide a very good overview of the composition of samples but tend to not perform well for complex samples or visualizing small differences. qiime2R includes a function for making heatmaps called taxa_heatmap().
library(tidyverse)
library(qiime2R)
metadata<-read_q2metadata("sample-metadata.tsv")
SVs<-read_qza("table.qza")$data
taxonomy<-read_qza("taxonomy.qza")$data %>% parse_taxonomy()
taxasums<-summarize_taxa(SVs, taxonomy)$Genus
taxa_heatmap(taxasums, metadata, "body-site")
ggsave("heatmap.pdf", height=4, width=8, device="pdf") # save a PDF 4 inches by 8 inches
Heat maps provide a very good overview of the composition of samples but tend to not perform well for complex samples or visualizing small differences. qiime2R includes a function for making heatmaps called taxa_heatmap().
library(tidyverse)
library(qiime2R)
metadata<-read_q2metadata("sample-metadata.tsv")
SVs<-read_qza("table.qza")$data
taxonomy<-read_qza("taxonomy.qza")$data %>% parse_taxonomy()
taxasums<-summarize_taxa(SVs, taxonomy)$Genus
taxa_barplot(taxasums, metadata, "body-site")
ggsave("barplot.pdf", height=4, width=8, device="pdf") # save a PDF 4 inches by 8 inches
Differential Abundance Analysis
There are many ways to find features that are deferentially abundant across conditions. In this case we are going to use the output of q2-aldex as described here.
library(tidyverse)
library(qiime2R)
library(ggrepel) # for offset labels
library(ggtree) # for visualizing phylogenetic trees
library(ape) # for manipulating phylogenetic trees
metadata<-read_q2metadata("sample-metadata.tsv")
SVs<-read_qza("table.qza")$data
results<-read_qza("differentials.qza")$data
taxonomy<-read_qza("taxonomy.qza")$data
tree<-read_qza("rooted-tree.qza")$data
If we only wanted to plot one significantly different feature, say 4b5eeb300368260019c1fbc7a3c718fc, we can plot only its abundances. The abundances are not exported by ALDEx2 so instead we will recalculate the CLR-abundances using an approximation of ALDEx; however, in R this could be easily calculated using the package itself.
Finally, we can visualize the results on a phylogenetic tree with the help of ggtree.
results<-results %>% mutate(Significant=if_else(we.eBH<0.1,"*", ""))
tree<-drop.tip(tree, tree$tip.label[!tree$tip.label %in% results$Feature.ID]) # remove all the features from the tree we do not have data for
ggtree(tree, layout="circular") %<+% results +
geom_tippoint(aes(fill=diff.btw), shape=21, color="grey50") +
geom_tiplab2(aes(label=Significant), size=10) +
scale_fill_gradient2(low="darkblue",high="darkred", midpoint = 0, mid="white", name="log2(fold-change") +
theme(legend.position="right")
ggsave("tree.pdf", height=10, width=10, device="pdf", useDingbats=F)
Tutorial: Integrating QIIME2 and R for data visualization and analysis using qiime2R (v0.99.6)
Background
The qiime artifact is a method for storing the input and outputs for QIIME2 along with associated metadata and provenance information about how the object was formed. This method of storing objects has a number of obvious advantages; however, on the surface it does not lend itself to easy import to R for the R-minded data scientist. In reality, the .qza file is a compressed directory with an intuitive structure.
While it is possible to export data on a one-by-one basis from the qiime artifacts using qiime’s built in suite of export features this is problematic and runs antithetical to the purpose of the artifact format for the following reasons:
This package is trying to simplify the process of getting the artifact into R without discarding any of the associated data through a simple
read_qzafunction. The artifact is unpacked in to a temporary directory and the raw data and associated metadata are read into a named list (see below). Data are typically returned as either a data.frame, phylo object (trees), or DNAStringSets (nucleic acid sequences).Functions
read_qza()- Function for reading artifacts (.qza).qza_to_phyloseq()- Imports multiple artifacts to produce a phyloseq object.read_q2metadata()- Reads qiime2 metadata file (containing q2-types definition line)write_q2manifest()- Writes a read manifest file to import data into qiime2theme_q2r()- A ggplot2 theme for for clean figures.print_provenance()- A function to display provenance information.is_q2metadata()- A function to check if a file is a qiime2 metadata file.parse_taxonomy()- A function to parse taxonomy strings and return a table where each column is a taxonomic class.parse_ordination()- A function to parse the internal ordination format.read_q2biom()- A function for reading QIIME2 biom files in format v2.1make_clr()- Transform feature table using centered log2 ratio.make_proportion()- Transform feature table to proportion (sum to 1).make_percent()- Transform feature to percent (sum to 100).interactive_table()- Create an interactive table in Rstudio viewer or rmarkdown html.summarize_taxa()- Create a list of tables with abundances sumed to each taxonomic level.taxa_barplot()- Create a stacked barplot using ggplot2.taxa_heatmap()- Create a heatmap of taxonomic abundances using gplot2.corner()- Show top corner of a large table-like obejct.min_nonzero()- Find the smallest non-zero, non-NA in a numeric vector.mean_sd()- Return mean and standard deviation for plotting.subsample_table()- Subsample a table with or without replacement.filter_features()- Remove low abundance features by number of counts and number of samples they appear in.Installing qiime2R
qiime2R is currently available via github which can easily be installed in R via the following command:
Reading Artifacts (.qza)
This example is using data derived from the moving pictures tutorial. The main function we will need is
read_qza():We will start be reading in a table of sequence variants (SVs):
When the artifact is imported, there are a number of pieces of information included. To see them we can use the names command:
To access the actual data stored within the object, access the data as below:
In the above file each row denotes a sequence variant where in the actual text is the hash of the complete sequence. See here for an example tool for generating hashes.
We can also look at the unique identifier for this object:
We can see the type of artifact:
We can also get a complete list of the files within the artifact and their sizes:
We can also print the providence; however, it is probably easier to use the q2view tool for a graphical aid in its interpretation.
Reading Metadata
If you are using a qiime2 metadata file (outlined here), you can use the supplied function
read_q2metadata()as below. If using a standard tsv or csv file, you can useread.table(),readr::read_tsv(), orreadr::csv().Reading Taxonomy
Note, when taxonomy is imported, a single string is returned along with a confidence score. For many analysis we will want to break up this string and for that purpose the
parse_taxonomy()function is provided:Creating a Phyloseq Object
A wrapper function called
qza_to_phyloseq()is provided which links multipleread_qza()calls together to create a phyloseq object for subsequent analysis as per the phyloseq tutorials. An example usage is shown below:Creating a TreeSummarizedExperiment (TSE) Object
A wrapper function called
qza_to_tse()is provided which links multipleread_qza()calls together to create a TSE object for subsequent analysis. You may find some useful tutorials on: OMA tutorials. An example usage is shown below:Example Visualizations
In this next section, methods for publication-ready figures will be demonstrated through extensive use of the tidyverse, and particularly ggplot2 and its many extensions. If starting with R, learning these packages early on will have a bit of a learning curve, but pay off massively in the long term. Each section contains a self-contained code snippets capable of replicating the figures and links to the data used. A lot can be learned from modifying parameters of the graphing commands to customize and play around with the visualizations. Note, the pipe operator (%>%) is being used frequently here and passes the output of one line to the next. Learn more about piping in R here.
All the data we will use is from the “Moving Pictures” tutorial and/or related content which can be found here.
Alpha Diversity Over Time
Data:
This dataset provides a good example of why you should always look at your data. There are an extra 3 samples in the metadata which do not have an assigned Shannon diversity value as seen below:
If this was your dataset, you should look into why this was and try to see why these samples have been lost. We will continue after sorting them out:
Note how we have now added a column for the shannon diversity to our table which will make everything easy to plot. Given this dataset is longitudinal we can plot it as such. Note: the code I am using here would handle replicate data and plot error bars; however, this case does not have replicates. Also, a quick note is that R does generally not like
-s in column names. To use these names we can surround them with ticks.Imagine we wanted to instead ask if there was an effect of antibiotics on diversity. We can change the type of plot:
Maybe instead the question is if the two individuals have different diversity?
Note: It is important to remember that these a graphical tools, but to make a claim, you should also generate some form of statistical support taking the nature of the data into consideration (ie that it uses repeated sampling over time).
Plotting PCoA
Here I am going to demo a PCoA with extra metadata mapped to it. I indicate the body site, antibiotic usage, and the shannon diversity by mapping to different aesthetic parameters.
Data:
Plotting a Heatmap
Data:
Heat maps provide a very good overview of the composition of samples but tend to not perform well for complex samples or visualizing small differences. qiime2R includes a function for making heatmaps called
taxa_heatmap().Making a taxonomic barplot
Data:
Heat maps provide a very good overview of the composition of samples but tend to not perform well for complex samples or visualizing small differences. qiime2R includes a function for making heatmaps called
taxa_heatmap().Differential Abundance Analysis
There are many ways to find features that are deferentially abundant across conditions. In this case we are going to use the output of q2-aldex as described here.
Volcano Plot
Plotting Per-Feature Abundances
If we only wanted to plot one significantly different feature, say 4b5eeb300368260019c1fbc7a3c718fc, we can plot only its abundances. The abundances are not exported by ALDEx2 so instead we will recalculate the CLR-abundances using an approximation of ALDEx; however, in R this could be easily calculated using the package itself.
Plotting a Phylogenetic Tree
Finally, we can visualize the results on a phylogenetic tree with the help of ggtree.
Note: please report problems or suggestions on the qiime2 forum or the github issue tracker.