In the following I will work with a tidy version of the movies dataset
from ggplot. It contains a list of all movies in IMDB, their release
data and other general information on the movie. It also includes a
list column that contains annotation to which genre a movie belongs
(Action, Drama, Romance etc.)
tidy_movies
#> # A tibble: 50,000 × 10
#> title year length budget rating votes mpaa Genres stars percent_rating
#> <chr> <int> <int> <int> <dbl> <int> <chr> <list> <dbl> <dbl>
#> 1 Ei ist ei… 1993 90 NA 8.4 15 "" <chr> 1 4.5
#> 2 Hamos sto… 1985 109 NA 5.5 14 "" <chr> 1 4.5
#> 3 Mind Bend… 1963 99 NA 6.4 54 "" <chr> 1 0
#> 4 Trop (peu… 1998 119 NA 4.5 20 "" <chr> 1 24.5
#> 5 Crystania… 1995 85 NA 6.1 25 "" <chr> 1 0
#> 6 Totale!, … 1991 102 NA 6.3 210 "" <chr> 1 4.5
#> 7 Visibleme… 1995 100 NA 4.6 7 "" <chr> 1 24.5
#> 8 Pang shen… 1976 85 NA 7.4 8 "" <chr> 1 0
#> 9 Not as a … 1955 135 2e6 6.6 223 "" <chr> 1 4.5
#> 10 Autobiogr… 1994 87 NA 7.4 5 "" <chr> 1 0
#> # ℹ 49,990 more rows
ggupset makes it easy to get an immediate impression how many movies
are in each genre and their combination. For example there are slightly
more than 1200 Dramas in the set, more than 1000 which don’t belong to
any genre and ~170 that are Comedy and Drama.
The best feature about ggupset is that it plays well with existing
tricks from ggplot2. For example, you can easily add the size of the
counts on top of the bars with this trick from
stackoverflow
Often enough the raw data you are starting with is not in such a neat
tidy shape. But that is a prerequisite to make such ggupset plots, so
how can you get from wide dataset to a useful one? And how to actually
create a list-column, anyway?
Imagine we measured for a set of genes if they are a member of certain
pathway. A gene can be a member of multiple pathways and we want to see
which pathways have a large overlap. Unfortunately, we didn’t record the
data in a tidy format but as a simple matrix.
A ficitional dataset of this type is provided as
gene_pathway_membership variable
tidy_pathway_member is already a very good starting point for plotting
with ggplot. But we care about the genes that are members of multiple
pathways so we will aggregate the data by Gene and create a
list-column with the Pathway information.
Because the process of collapsing list columns into delimited strings is
fairly generic, I provide a new scale that does this automatically
(scale_x_mergelist()).
But the problem is that it can be difficult to read those labels.
Instead I provide a third function that replaces the axis labels with a
combination matrix (axis_combmatrix()).
To make publication ready plots, you often want to have complete control
how each part of a plot looks. This is why I provide an easy way to
style the combination matrix. Simply add a theme_combmatrix() to the
plot.
Sometimes the limited styling options using
combmatrix.panel.point.color.fill are not enough. To fully customize
the combination matrix plot, axis_combmatrix has an
override_plotting_function parameter, that allows us to plot anything
in place of the combination matrix.
Let us first reproduce the standard combination plot, but use the
override_plotting_function parameter to see how it works:
#> [1] "tbl_df" "tbl" "data.frame"
#> # A tibble: 336 × 7
#> labels single_label id labels_split at observed index
#> <ord> <ord> <int> <list> <dbl> <lgl> <dbl>
#> 1 "" Short 1 <chr [0]> 0.0124 FALSE 1
#> 2 "Action" Short 2 <chr [1]> 0.0332 FALSE 1
#> 3 "Action-Animation" Short 3 <chr [2]> 0.0539 FALSE 1
#> 4 "Action-Animation-Roma… Short 4 <chr [3]> 0.0747 FALSE 1
#> 5 "Action-Animation-Shor… Short 5 <chr [3]> 0.0954 TRUE 1
#> 6 "Action-Comedy" Short 6 <chr [2]> 0.116 FALSE 1
#> 7 "Action-Comedy-Drama" Short 7 <chr [3]> 0.137 FALSE 1
#> 8 "Action-Comedy-Romance" Short 8 <chr [3]> 0.158 FALSE 1
#> 9 "Action-Comedy-Short" Short 9 <chr [3]> 0.178 TRUE 1
#> 10 "Action-Documentary" Short 10 <chr [2]> 0.199 FALSE 1
#> # ℹ 326 more rows
The override_plotting_function is incredibly powerful, but also an
advanced feature that comes with pitfalls. Use at your own risk.
Alternative Packages
There is already a package called UpSetR
(GitHub,
CRAN) that provides very
similar functionality and that heavily inspired me to write this
package. It produces a similar plot with an additional view that shows
the overall size of each genre.
The UpSetR package provides a lot convenient helpers around this kind
of plot; the main advantage of my package is that it can be combined
with any kind of ggplot that uses a categorical x-axis. This additional
flexibility can be useful if you want to create non-standard plots. The
following plot for example shows when movies of a certain genre were
published.
dplyr currently does not support list columns as grouping variables.
In that case it makes sense to collapse it manually and use the
axis_combmatrix() function to get a good looking plot.
# Percentage of votes for n stars for top 12 genres
avg_rating <- tidy_movies %>%
mutate(Genres_collapsed = sapply(Genres, function(x) paste0(sort(x), collapse="-"))) %>%
mutate(Genres_collapsed = fct_lump(fct_infreq(as.factor(Genres_collapsed)), n=12)) %>%
group_by(stars, Genres_collapsed) %>%
summarize(percent_rating = sum(votes * percent_rating)) %>%
group_by(Genres_collapsed) %>%
mutate(percent_rating = percent_rating / sum(percent_rating)) %>%
arrange(Genres_collapsed)
#> `summarise()` has grouped output by 'stars'. You can override using the
#> `.groups` argument.
avg_rating
#> # A tibble: 130 × 3
#> # Groups: Genres_collapsed [13]
#> stars Genres_collapsed percent_rating
#> <dbl> <fct> <dbl>
#> 1 1 Drama 0.0437
#> 2 2 Drama 0.0411
#> 3 3 Drama 0.0414
#> 4 4 Drama 0.0433
#> 5 5 Drama 0.0506
#> 6 6 Drama 0.0717
#> 7 7 Drama 0.129
#> 8 8 Drama 0.175
#> 9 9 Drama 0.170
#> 10 10 Drama 0.235
#> # ℹ 120 more rows
# Plot using the combination matrix axis
# the red lines indicate the average rating per genre
ggplot(avg_rating, aes(x=Genres_collapsed, y=stars)) +
geom_tile(aes(fill=percent_rating)) +
stat_summary_bin(aes(y=percent_rating * stars), fun = sum, geom="point",
shape="—", color="red", size=6) +
axis_combmatrix(sep = "-", levels = c("Drama", "Comedy", "Short",
"Documentary", "Action", "Romance", "Animation", "Other")) +
scale_fill_viridis_c()
Saving Plots
There is an important pitfall when trying to save a plot with a
combination matrix. When you use ggsave(), ggplot2 automatically saves
the last plot that was created. However, here last_plot() refers to
only the combination matrix. To store the full plot, you need to
explicitly assign it to a variable and save that.
ggupset
Plot a combination matrix instead of the standard x-axis and create UpSet plots with ggplot2.
Installation
You can install the released version of ggupset from CRAN with:
Example
This is a basic example which shows you how to solve a common problem:
In the following I will work with a tidy version of the movies dataset from ggplot. It contains a list of all movies in IMDB, their release data and other general information on the movie. It also includes a
listcolumn that contains annotation to which genre a movie belongs (Action, Drama, Romance etc.)ggupsetmakes it easy to get an immediate impression how many movies are in each genre and their combination. For example there are slightly more than 1200 Dramas in the set, more than 1000 which don’t belong to any genre and ~170 that are Comedy and Drama.Adding Numbers on top
The best feature about
ggupsetis that it plays well with existing tricks fromggplot2. For example, you can easily add the size of the counts on top of the bars with this trick from stackoverflowReshaping quadratic data
Often enough the raw data you are starting with is not in such a neat tidy shape. But that is a prerequisite to make such
ggupsetplots, so how can you get from wide dataset to a useful one? And how to actually create alist-column, anyway?Imagine we measured for a set of genes if they are a member of certain pathway. A gene can be a member of multiple pathways and we want to see which pathways have a large overlap. Unfortunately, we didn’t record the data in a tidy format but as a simple matrix.
A ficitional dataset of this type is provided as
gene_pathway_membershipvariableWe will now turn first turn this matrix into a tidy tibble and then plot it
tidy_pathway_memberis already a very good starting point for plotting withggplot. But we care about the genes that are members of multiple pathways so we will aggregate the data byGeneand create alist-column with thePathwayinformation.What if I need more flexibility?
The first important idea is to realize that a list column is just as good as a character vector with the list elements collapsed
We can easily make a plot using the strings as categorical axis labels
Because the process of collapsing list columns into delimited strings is fairly generic, I provide a new scale that does this automatically (
scale_x_mergelist()).But the problem is that it can be difficult to read those labels. Instead I provide a third function that replaces the axis labels with a combination matrix (
axis_combmatrix()).One thing that is only possible with the
scale_x_upset()function is to automatically order the categories and genres byfreqor bydegree.Styling
To make publication ready plots, you often want to have complete control how each part of a plot looks. This is why I provide an easy way to style the combination matrix. Simply add a
theme_combmatrix()to the plot.Maximum Flexibility
Sometimes the limited styling options using
combmatrix.panel.point.color.fillare not enough. To fully customize the combination matrix plot,axis_combmatrixhas anoverride_plotting_functionparameter, that allows us to plot anything in place of the combination matrix.Let us first reproduce the standard combination plot, but use the
override_plotting_functionparameter to see how it works:We can use the above template, to specifically highlight for example all sets that include the Action category.
The
override_plotting_functionis incredibly powerful, but also an advanced feature that comes with pitfalls. Use at your own risk.Alternative Packages
There is already a package called
UpSetR(GitHub, CRAN) that provides very similar functionality and that heavily inspired me to write this package. It produces a similar plot with an additional view that shows the overall size of each genre.The
UpSetRpackage provides a lot convenient helpers around this kind of plot; the main advantage of my package is that it can be combined with any kind of ggplot that uses a categorical x-axis. This additional flexibility can be useful if you want to create non-standard plots. The following plot for example shows when movies of a certain genre were published.Advanced examples
1. Complex experimental design
The combination matrix axis can be used to show complex experimental designs, where each sample got a combination of different treatments.
2. Aggregation of information
dplyrcurrently does not support list columns as grouping variables. In that case it makes sense to collapse it manually and use theaxis_combmatrix()function to get a good looking plot.Saving Plots
There is an important pitfall when trying to save a plot with a combination matrix. When you use
ggsave(), ggplot2 automatically saves the last plot that was created. However, herelast_plot()refers to only the combination matrix. To store the full plot, you need to explicitly assign it to a variable and save that.Session Info