The cachem R package provides objects creating and managing caches.
These cache objects are key-value stores, but unlike other basic
key-value stores, they have built-in support for memory and age limits
so that they won’t have unbounded growth.
The cache objects in cachem differ from some other key-value stores
in the following ways:
The cache objects provide automatic pruning so that they remain within
memory limits.
Fetching a non-existing object returns a sentinel value. An
alternative is to simply return NULL. This is what R lists and
environments do, but it is ambiguous whether the value really is
NULL, or if it is not present. Another alternative is to throw an
exception when fetching a non-existent object. However, this results
in more complicated code, as every get() needs to be wrapped in a
tryCatch().
Installation
To install the CRAN version:
install.packages("cachem")
You can install the development version from with:
if (!require("remotes")) install.packages("remotes")
remotes::install_github("r-lib/cachem")
Usage
To create a memory-based cache, call cache_mem().
library(cachem)
m <- cache_mem()
Add arbitrary R objects to the cache using $set(key, value):
The key must be a string consisting of lowercase letters, numbers, and
the underscore (_) and hyphen (-) characters. (Upper-case characters
are not allowed because some storage backends do not distinguish between
lowercase and uppercase letters.) The value can be any R object.
If you call get() on a key that doesn’t exists, it will return a
key_missing() sentinel value:
m$get("dog")
#> <Key Missing>
A common usage pattern is to call get(), and then check if the result
is a key_missing object:
value <- m$get(key)
if (is.key_missing(value)) {
# Cache miss - do something
} else {
# Cache hit - do another thing
}
The reason for doing this (instead of calling $exists(key) and then
$get(key)) is that for some storage backends, there is a potential
race condition: the object could be removed from the cache between the
exists() and get() calls. For example:
If multiple R processes have cache_disks that share the same
directory, one process could remove an object from the cache in
between the exists() and get() calls in another process, resulting
in an error.
If you use a cache_mem with a max_age, it’s possible for an object
to be present when you call exists(), but for its age to exceed
max_age by the time get() is called. In that case, the get()
will return a key_missing() object.
# Avoid this pattern, due to a potential race condition!
if (m$exists(key)) {
value <- m$get(key)
}
Cache types
cachem comes with two kinds of cache objects: a memory cache, and a
disk cache.
cache_mem()
The memory cache stores stores objects in memory, by simply keeping a
reference to each object. To create a memory cache:
m <- cache_mem()
The default size of the cache is 200MB, but this can be customized with
max_size:
m <- cache_mem(max_size = 10 * 1024^2)
It may also be useful to set a maximum age of objects. For example, if
you only want objects to stay for a maximum of one hour:
m <- cache_mem(max_size = 10 * 1024^2, max_age = 3600)
For more about how objects are evicted from the cache, see section
Pruning below.
An advantage that the memory cache has over the disk cache (and any
other type of cache that stores the objects outside of the R process’s
memory), is that it does not need to serialize objects. Instead, it
merely stores references to the objects. This means that it can store
objects that other caches cannot, and with more efficient use of memory
– if two objects in the cache share some of their contents (such that
they refer to the same sub-object in memory), then cache_mem will not
create duplicate copies of the contents, as cache_disk would, since it
serializes the objects with the serialize() function.
Compared to the memory usage, the size calculation is not as
intelligent: if there are two objects that share contents, their sizes
are computed separately, even if they have items that share the exact
same represention in memory. This is done with the object.size()
function, which does not account for multiple references to the same
object in memory.
In short, a memory cache, if anything, over-counts the amount of memory
actually consumed. In practice, this means that if you set a 200MB limit
to the size of cache, and the cache thinks it has 200MB of contents,
the actual amount of memory consumed could be less than 200MB.
Demonstration of memory over-counting from `object.size()`
# Create a and b which both contain the same numeric vector.
x <- list(rnorm(1e5))
a <- list(1, x)
b <- list(2, x)
# Add to cache
m$set("a", a)
m$set("b", b)
# Each object is about 800kB in memory, so the cache_mem() will consider the
# total memory used to be 1600kB.
object.size(m$get("a"))
#> 800224 bytes
object.size(m$get("b"))
#> 800224 bytes
For reference, lobstr::obj_size can detect shared objects, and knows
that these objects share most of their memory.
However, lobstr is not on CRAN, and if obj_size() were used to find the
incremental memory used when an object was added to the cache, it would
have to walk all objects in the cache every time a single object is
added. For these reasons, cache_mem uses object.size() to compute the
object sizes.
cache_disk()
Disk caches are stored in a directory on disk. A disk cache is slower
than a memory cache, but can generally be larger. To create one:
d <- cache_disk()
By default, it creates a subdirectory of the R process’s temp directory,
and it will persist until the R process exits.
Since objects in a disk cache are serialized, they are subject to the
limitations of the serialize() function. For more information, see
section Limitations of serialized
objects.
The storage directory can be specified with dir; it will be created if
necessary.
cache_disk(dir = "cachedir")
Sharing a disk cache among processes
Multiple R processes can use disk_cache objects that share the same
cache directory. To do this, simply point each cache_disk to the same
directory.
disk_cache pruning
For a disk_cache, pruning does not happen on every access, because
finding the size of files in the cache directory can take a nontrivial
amount of time. By default, pruning happens once every 20 times that
$set() is called, or if at least five seconds have elapsed since the
last pruning. The prune_rate controls how many times $set() must be
called before a pruning occurs. It defaults to 20; smaller values result
in more frequent pruning and larger values result in less frequent
pruning (but keep in mind pruning always occurs if it has been at least
five seconds since the last pruning).
Cleaning up the cache directory
The cache directory can be deleted by calling $destroy(). After it is
destroyed, the cache object can no longer be used.
d$destroy()
d$set("a", 1) # Error
To create a cache_disk that will automatically delete its storage
directory when garbage collected, use destroy_on_finalize=TRUE:
d <- cache_disk(destroy_on_finalize = TRUE)
d$set("a", 1)
cachedir <- d$info()$dir
dir(cachedir)
#> [1] "a.rds"
# Remove reference to d and trigger a garbage collection
rm(d)
gc()
dir.exists(cachedir)
Using custom serialization functions
It is possible to use custom serialization functions rather than the
default of writeRDS() and readRDS() with the write_fn, read_fn
and extension arguments respectively. This could be used to use
alternative serialization formats like
qs, or specialized object formats
fst or parquet.
cache_mem() and cache_disk() support all of the methods listed
below. If you want to create a compatible caching object, it must have
at least the get() and set() methods:
get(key, missing = missing_): Get the object associated with key.
The missing parameter allows customized behavior if the key is not
present: it actually is an expression which is evaluated when there is
a cache miss, and it could return a value or throw an error.
set(key, value): Set a key to a value.
exists(key): Check whether a particular key exists in the cache.
remove(key): Remove a key-value from the cache.
Some optional methods:
reset(): Clear all objects from the cache.
keys(): Return a character vector of all keys in the cache.
prune(): Prune the cache. (Some types of caches may not prune on
every access, and may temporarily grow past their limits, until the
next pruning is triggered automatically, or manually with this
function.)
size(): Return the number of objects in the cache.
For these methods:
key: can be any string with lowercase letters, numbers, underscore
(_) and hyphen (-). Some storage backends may not be handle very
long keys well. For example, with a cache_disk(), the key is used as
a filename, and on some filesystems, very filenames may hit limits on
path lengths.
value: can be any R object, with some exceptions noted below.
Limitations of serialized objects
For any cache that serializes the object for storage outside of the R
process – in other words, any cache other than a cache_mem() – some
types of objects will not save and restore as well. Notably, reference
objects may consume more memory when restored, since R may not know to
deduplicate shared objects. External pointers are not be able to be
serialized, since they point to memory in the R process. See
?serialize for more information.
Read-only caches
It is possible to create a read-only cache by making the set(),
remove(), reset(), and prune() methods into no-ops. This can be
useful if sharing a cache with another R process which can write to the
cache. For example, one (or more) processes can write to the cache, and
other processes can read from it.
This function will wrap a cache object in a read-only wrapper. Note,
however, that code that uses such a cache must not require that $set()
actually sets a value in the cache. This is good practice anyway,
because with these cache objects, items can be pruned from them at any
time.
The cache objects provided by cachem have automatic pruning. (Note that
pruning is not required by the API, so one could implement an
API-compatible cache without pruning.)
This section describes how pruning works for cache_mem() and
cache_disk().
When the cache object is created, the maximum size (in bytes) is
specified by max_size. When the size of objects in the cache exceeds
max_size, objects will be pruned from the cache.
When objects are pruned from the cache, which ones are removed is
determined by the eviction policy, evict:
lru: The least-recently-used objects will be removed from the
cache, until it fits within the limit. This is the default and is
appropriate for most cases.
fifo: The oldest objects will be removed first.
It is also possible to set the maximum number of items that can be in
the cache, with max_n. By default this is set to Inf, or no limit.
The max_age parameter is somewhat different from max_size and
max_n. The latter two set limits on the cache store as a whole,
whereas max_age sets limits for each individual item; for each item,
if its age exceeds max_age, then it will be removed from the cache.
Layered caches
Multiple caches can be composed into a single cache, using
cache_layered(). This can be used to create a multi-level cache. (Note
thate cache_layered() is currently experimental.) For example, we can
create a layered cache with a very fast 100MB memory cache and a larger
but slower 2GB disk cache:
m <- cache_mem(max_size = 100 * 1024^2)
d <- cache_disk(max_size = 2 * 1024^3)
cl <- cache_layered(m, d)
The layered cache will have the same API, with $get(), $set(), and
so on, so it can be used interchangeably with other caching objects.
For this example, we’ll recreate the cache_layered with logging
enabled, so that it will show cache hits and misses.
cl <- cache_layered(m, d, logfile = stderr())
# Each of the objects generated by rnorm() is about 40 MB
cl$set("a", rnorm(5e6))
cl$set("b", rnorm(5e6))
cl$set("c", rnorm(5e6))
# View the objects in each of the component caches
m$keys()
#> [1] "c" "b"
d$keys()
#> [1] "a" "b" "c"
# The layered cache reports having all keys
cl$keys()
#> [1] "c" "b" "a"
When $get() is called, it searches the first cache, and if it’s
missing there, it searches the next cache, and so on. If not found in
any caches, it returns key_missing().
# Get object that exists in the memory cache
x <- cl$get("c")
#> [2020-10-23 13:11:09.985] cache_layered Get: c
#> [2020-10-23 13:11:09.985] cache_layered Get from cache_mem... hit
# Get object that doesn't exist in the memory cache
x <- cl$get("a")
#> [2020-10-23 13:13:10.968] cache_layered Get: a
#> [2020-10-23 13:13:10.969] cache_layered Get from cache_mem... miss
#> [2020-10-23 13:13:11.329] cache_layered Get from cache_disk... hit
# Object is not present in any component caches
cl$get("d")
#> [2020-10-23 13:13:40.197] cache_layered Get: d
#> [2020-10-23 13:13:40.197] cache_layered Get from cache_mem... miss
#> [2020-10-23 13:13:40.198] cache_layered Get from cache_disk... miss
#> <Key Missing>
Multiple cache objects can be layered this way. You could even add a
cache which uses a remote store, such as a network file system or even
AWS S3.
cache_mem()cache_disk()cachem
The cachem R package provides objects creating and managing caches. These cache objects are key-value stores, but unlike other basic key-value stores, they have built-in support for memory and age limits so that they won’t have unbounded growth.
The cache objects in cachem differ from some other key-value stores in the following ways:
NULL. This is what R lists and environments do, but it is ambiguous whether the value really isNULL, or if it is not present. Another alternative is to throw an exception when fetching a non-existent object. However, this results in more complicated code, as everyget()needs to be wrapped in atryCatch().Installation
To install the CRAN version:
You can install the development version from with:
Usage
To create a memory-based cache, call
cache_mem().Add arbitrary R objects to the cache using
$set(key, value):The
keymust be a string consisting of lowercase letters, numbers, and the underscore (_) and hyphen (-) characters. (Upper-case characters are not allowed because some storage backends do not distinguish between lowercase and uppercase letters.) Thevaluecan be any R object.Get the values with
$get():If you call
get()on a key that doesn’t exists, it will return akey_missing()sentinel value:A common usage pattern is to call
get(), and then check if the result is akey_missingobject:The reason for doing this (instead of calling
$exists(key)and then$get(key)) is that for some storage backends, there is a potential race condition: the object could be removed from the cache between theexists()andget()calls. For example:cache_disks that share the same directory, one process could remove an object from the cache in between theexists()andget()calls in another process, resulting in an error.cache_memwith amax_age, it’s possible for an object to be present when you callexists(), but for its age to exceedmax_ageby the timeget()is called. In that case, theget()will return akey_missing()object.Cache types
cachem comes with two kinds of cache objects: a memory cache, and a disk cache.
cache_mem()The memory cache stores stores objects in memory, by simply keeping a reference to each object. To create a memory cache:
The default size of the cache is 200MB, but this can be customized with
max_size:It may also be useful to set a maximum age of objects. For example, if you only want objects to stay for a maximum of one hour:
For more about how objects are evicted from the cache, see section Pruning below.
An advantage that the memory cache has over the disk cache (and any other type of cache that stores the objects outside of the R process’s memory), is that it does not need to serialize objects. Instead, it merely stores references to the objects. This means that it can store objects that other caches cannot, and with more efficient use of memory – if two objects in the cache share some of their contents (such that they refer to the same sub-object in memory), then
cache_memwill not create duplicate copies of the contents, ascache_diskwould, since it serializes the objects with theserialize()function.Compared to the memory usage, the size calculation is not as intelligent: if there are two objects that share contents, their sizes are computed separately, even if they have items that share the exact same represention in memory. This is done with the
object.size()function, which does not account for multiple references to the same object in memory.In short, a memory cache, if anything, over-counts the amount of memory actually consumed. In practice, this means that if you set a 200MB limit to the size of cache, and the cache thinks it has 200MB of contents, the actual amount of memory consumed could be less than 200MB.
Demonstration of memory over-counting from `object.size()`
For reference, lobstr::obj_size can detect shared objects, and knows that these objects share most of their memory.
However, lobstr is not on CRAN, and if obj_size() were used to find the incremental memory used when an object was added to the cache, it would have to walk all objects in the cache every time a single object is added. For these reasons, cache_mem uses
object.size()to compute the object sizes.cache_disk()Disk caches are stored in a directory on disk. A disk cache is slower than a memory cache, but can generally be larger. To create one:
By default, it creates a subdirectory of the R process’s temp directory, and it will persist until the R process exits.
Like a
cache_mem, themax_size,max_n,max_agecan be customized. See section Pruning below for more information.Each object in the cache is stored as an RDS file on disk, using the
serialize()function.Since objects in a disk cache are serialized, they are subject to the limitations of the
serialize()function. For more information, see section Limitations of serialized objects.The storage directory can be specified with
dir; it will be created if necessary.Sharing a disk cache among processes
Multiple R processes can use
disk_cacheobjects that share the same cache directory. To do this, simply point eachcache_diskto the same directory.disk_cachepruningFor a
disk_cache, pruning does not happen on every access, because finding the size of files in the cache directory can take a nontrivial amount of time. By default, pruning happens once every 20 times that$set()is called, or if at least five seconds have elapsed since the last pruning. Theprune_ratecontrols how many times$set()must be called before a pruning occurs. It defaults to 20; smaller values result in more frequent pruning and larger values result in less frequent pruning (but keep in mind pruning always occurs if it has been at least five seconds since the last pruning).Cleaning up the cache directory
The cache directory can be deleted by calling
$destroy(). After it is destroyed, the cache object can no longer be used.To create a
cache_diskthat will automatically delete its storage directory when garbage collected, usedestroy_on_finalize=TRUE:Using custom serialization functions
It is possible to use custom serialization functions rather than the default of
writeRDS()andreadRDS()with thewrite_fn,read_fnandextensionarguments respectively. This could be used to use alternative serialization formats like qs, or specialized object formats fst or parquet.Cache API
cache_mem()andcache_disk()support all of the methods listed below. If you want to create a compatible caching object, it must have at least theget()andset()methods:get(key, missing = missing_): Get the object associated withkey. Themissingparameter allows customized behavior if the key is not present: it actually is an expression which is evaluated when there is a cache miss, and it could return a value or throw an error.set(key, value): Set a key to a value.exists(key): Check whether a particular key exists in the cache.remove(key): Remove a key-value from the cache.Some optional methods:
reset(): Clear all objects from the cache.keys(): Return a character vector of all keys in the cache.prune(): Prune the cache. (Some types of caches may not prune on every access, and may temporarily grow past their limits, until the next pruning is triggered automatically, or manually with this function.)size(): Return the number of objects in the cache.For these methods:
key: can be any string with lowercase letters, numbers, underscore (_) and hyphen (-). Some storage backends may not be handle very long keys well. For example, with acache_disk(), the key is used as a filename, and on some filesystems, very filenames may hit limits on path lengths.value: can be any R object, with some exceptions noted below.Limitations of serialized objects
For any cache that serializes the object for storage outside of the R process – in other words, any cache other than a
cache_mem()– some types of objects will not save and restore as well. Notably, reference objects may consume more memory when restored, since R may not know to deduplicate shared objects. External pointers are not be able to be serialized, since they point to memory in the R process. See?serializefor more information.Read-only caches
It is possible to create a read-only cache by making the
set(),remove(),reset(), andprune()methods into no-ops. This can be useful if sharing a cache with another R process which can write to the cache. For example, one (or more) processes can write to the cache, and other processes can read from it.This function will wrap a cache object in a read-only wrapper. Note, however, that code that uses such a cache must not require that
$set()actually sets a value in the cache. This is good practice anyway, because with these cache objects, items can be pruned from them at any time.Pruning
The cache objects provided by cachem have automatic pruning. (Note that pruning is not required by the API, so one could implement an API-compatible cache without pruning.)
This section describes how pruning works for
cache_mem()andcache_disk().When the cache object is created, the maximum size (in bytes) is specified by
max_size. When the size of objects in the cache exceedsmax_size, objects will be pruned from the cache.When objects are pruned from the cache, which ones are removed is determined by the eviction policy,
evict:lru: The least-recently-used objects will be removed from the cache, until it fits within the limit. This is the default and is appropriate for most cases.fifo: The oldest objects will be removed first.It is also possible to set the maximum number of items that can be in the cache, with
max_n. By default this is set toInf, or no limit.The
max_ageparameter is somewhat different frommax_sizeandmax_n. The latter two set limits on the cache store as a whole, whereasmax_agesets limits for each individual item; for each item, if its age exceedsmax_age, then it will be removed from the cache.Layered caches
Multiple caches can be composed into a single cache, using
cache_layered(). This can be used to create a multi-level cache. (Note thatecache_layered()is currently experimental.) For example, we can create a layered cache with a very fast 100MB memory cache and a larger but slower 2GB disk cache:The layered cache will have the same API, with
$get(),$set(), and so on, so it can be used interchangeably with other caching objects.For this example, we’ll recreate the
cache_layeredwith logging enabled, so that it will show cache hits and misses.When
$get()is called, it searches the first cache, and if it’s missing there, it searches the next cache, and so on. If not found in any caches, it returnskey_missing().Multiple cache objects can be layered this way. You could even add a cache which uses a remote store, such as a network file system or even AWS S3.