Create balanced partitions and cross-validation folds.
Perform time series windowing and general grouping and
splitting of data.
Balance existing groups with up- and downsampling.
Collapse existing groups to fewer, balanced groups.
Finds values, or indices of values, that differ from the previous
value by some threshold(s).
Check if two grouping factors have the same groups, memberwise.
Main functions
Function
Description
group_factor()
Divides data into groups by a wide range of methods.
group()
Creates grouping factor and adds to the given data frame.
splt()
Creates grouping factor and splits the data by these groups.
partition()
Splits data into partitions. Balances a given categorical variable and/or numerical variable between partitions and keeps all data points with a shared ID in the same partition.
fold()
Creates folds for (repeated) cross-validation. Balances a given categorical variable and/or numerical variable between folds and keeps all data points with a shared ID in the same fold.
collapse_groups()
Collapses existing groups into a smaller set of groups with categorical, numerical, ID, and size balancing.
balance()
Uses up- and/or downsampling to equalize group sizes. Can balance on ID level. See wrappers: downsample(), upsample().
Other tools
Function
Description
all_groups_identical()
Checks whether two grouping factors contain the same groups, memberwise.
differs_from_previous()
Finds values, or indices of values, that differ from the previous value by some threshold(s).
find_starts()
Finds values or indices of values that are not the same as the previous value.
find_missing_starts()
Finds missing starts for the l_starts method.
summarize_group_cols()
Calculates summary statistics about group columns (i.e. factors).
summarize_balances()
Summarizes the balances of numeric, categorical, and ID columns in and between groups in one or more group columns.
ranked_balances()
Extracts the standard deviations from the Summary data frame from the output of summarize_balances()
Creates a grouping factor and adds it to the given data frame. The data
frame is grouped by the grouping factor for easy use in magrittr
(%>%) pipelines.
# Use group()
group(data = df_small, n = 5, method = 'n_dist') %>%
kable()
x
species
age
.groups
1
cat
68
1
2
pig
39
1
3
human
1
2
4
cat
34
2
5
pig
87
3
6
human
43
3
7
cat
14
3
8
pig
82
4
9
human
59
4
10
cat
51
5
11
pig
85
5
12
human
21
5
# Use group() in a pipeline
# Get average age per group
df_small %>%
group(n = 5, method = 'n_dist') %>%
dplyr::summarise(mean_age = mean(age)) %>%
kable()
.groups
mean_age
1
53.5
2
17.5
3
48.0
4
70.5
5
52.3
# Using group() with 'l_starts' method
# Starts group at the first 'cat',
# then skips to the second appearance of "pig" after "cat",
# then starts at the following "cat".
df_small %>%
group(n = list("cat", c("pig", 2), "cat"),
method = 'l_starts',
starts_col = "species") %>%
kable()
x
species
age
.groups
1
cat
68
1
2
pig
39
1
3
human
1
1
4
cat
34
1
5
pig
87
2
6
human
43
2
7
cat
14
3
8
pig
82
3
9
human
59
3
10
cat
51
3
11
pig
85
3
12
human
21
3
splt()
Creates the specified groups with group_factor() and splits the given
data by the grouping factor with base::split. Returns the splits in a
list.
Creates (optionally) balanced partitions (e.g. training/test sets).
Balance partitions on categorical variable(s) and/or a numerical
variable. Make sure that all datapoints sharing an ID is in the same
partition.
# First set seed to ensure reproducibility
set.seed(1)
# Use partition() with categorical and numerical balancing,
# while ensuring all rows per ID are in the same partition
df_partitioned <- partition(
data = df_medium,
p = 0.7,
cat_col = 'diagnosis',
num_col = "age",
id_col = 'participant'
)
df_partitioned %>%
kable()
participant
age
diagnosis
diagnosis2
score
session
1
20
a
z
10
1
1
20
a
y
24
2
1
20
a
x
45
3
2
33
b
z
24
1
2
33
b
x
40
2
2
33
b
x
67
3
3
27
a
z
15
1
3
27
a
x
30
2
3
27
a
z
40
3
4
21
b
z
35
1
4
21
b
x
50
2
4
21
b
z
78
3
participant
age
diagnosis
diagnosis2
score
session
5
32
b
y
24
1
5
32
b
x
54
2
5
32
b
z
62
3
6
25
a
x
14
1
6
25
a
z
25
2
6
25
a
x
30
3
fold()
Creates (optionally) balanced folds for use in cross-validation. Balance
folds on categorical variable(s) and/or a numerical variable. Ensure
that all datapoints sharing an ID is in the same fold. Create multiple
unique fold columns at once, e.g. for repeated cross-validation.
# First set seed to ensure reproducibility
set.seed(1)
# Use fold() with categorical and numerical balancing,
# while ensuring all rows per ID are in the same fold
df_folded <- fold(
data = df_medium,
k = 3,
cat_col = 'diagnosis',
num_col = "age",
id_col = 'participant'
)
# Show df_folded ordered by folds
df_folded %>%
arrange(.folds) %>%
kable()
participant
age
diagnosis
diagnosis2
score
session
.folds
1
20
a
z
10
1
1
1
20
a
y
24
2
1
1
20
a
x
45
3
1
5
32
b
y
24
1
1
5
32
b
x
54
2
1
5
32
b
z
62
3
1
4
21
b
z
35
1
2
4
21
b
x
50
2
2
4
21
b
z
78
3
2
6
25
a
x
14
1
2
6
25
a
z
25
2
2
6
25
a
x
30
3
2
2
33
b
z
24
1
3
2
33
b
x
40
2
3
2
33
b
x
67
3
3
3
27
a
z
15
1
3
3
27
a
x
30
2
3
3
27
a
z
40
3
3
# Show distribution of diagnoses and participants
df_folded %>%
group_by(.folds) %>%
count(diagnosis, participant) %>%
kable()
.folds
diagnosis
participant
n
1
a
1
3
1
b
5
3
2
a
6
3
2
b
4
3
3
a
3
3
3
b
2
3
# Show age representation in folds
# Notice that we would get a more even distribution if we had more data.
# As age is fixed per ID, we only have 3 ages per category to balance with.
df_folded %>%
group_by(.folds) %>%
summarize(mean_age = mean(age),
sd_age = sd(age)) %>%
kable()
.folds
mean_age
sd_age
1
26
6.57
2
23
2.19
3
30
3.29
Notice, that the we now have the opportunity to include the
session variable and/or use participant as a random effect in our
model when doing cross-validation, as any participant will only appear
in one fold.
We also have a balance in the representation of each diagnosis, which
could give us better, more consistent results.
collapse_groups()
Collapses a set of groups into a smaller set of groups while attempting
to balance the new groups by specified numerical columns, categorical
columns, level counts in ID columns, and/or the number of rows.
# We consider each participant a group
# and collapse them into 3 new groups
# We balance the number of levels in diagnosis2 column,
# as this diagnosis is not constant within the participants
df_collapsed <- collapse_groups(
data = df_medium,
n = 3,
group_cols = 'participant',
cat_cols = 'diagnosis2',
num_cols = "score"
)
# Show df_collapsed ordered by new collapsed groups
df_collapsed %>%
arrange(.coll_groups) %>%
kable()
participant
age
diagnosis
diagnosis2
score
session
.coll_groups
1
20
a
z
10
1
1
1
20
a
y
24
2
1
1
20
a
x
45
3
1
2
33
b
z
24
1
1
2
33
b
x
40
2
1
2
33
b
x
67
3
1
3
27
a
z
15
1
2
3
27
a
x
30
2
2
3
27
a
z
40
3
2
4
21
b
z
35
1
2
4
21
b
x
50
2
2
4
21
b
z
78
3
2
5
32
b
y
24
1
3
5
32
b
x
54
2
3
5
32
b
z
62
3
3
6
25
a
x
14
1
3
6
25
a
z
25
2
3
6
25
a
x
30
3
3
# Summarize the balances of the new groups
coll_summ <- df_collapsed %>%
summarize_balances(group_cols = '.coll_groups',
cat_cols = "diagnosis2",
num_cols = "score")
coll_summ$Groups %>%
kable()
.group_col
.group
# rows
mean(score)
sum(score)
# diag_x
# diag_y
# diag_z
.coll_groups
1
6
35.0
210
3
1
2
.coll_groups
2
6
41.3
248
2
0
4
.coll_groups
3
6
34.8
209
3
1
2
coll_summ$Summary %>%
kable()
.group_col
measure
# rows
mean(score)
sum(score)
# diag_x
# diag_y
# diag_z
.coll_groups
mean
6
37.06
222.3
2.667
0.667
2.67
.coll_groups
median
6
35.00
210.0
3.000
1.000
2.00
.coll_groups
SD
0
3.71
22.2
0.577
0.577
1.16
.coll_groups
IQR
0
3.25
19.5
0.500
0.500
1.00
.coll_groups
min
6
34.83
209.0
2.000
0.000
2.00
.coll_groups
max
6
41.33
248.0
3.000
1.000
4.00
# Check the across-groups standard deviations
# This is a measure of how balanced the groups are (lower == more balanced)
# and is especially useful when comparing multiple group columns
coll_summ %>%
ranked_balances() %>%
kable()
.group_col
measure
# rows
mean(score)
sum(score)
# diag_x
# diag_y
# diag_z
.coll_groups
SD
0
3.71
22.2
0.577
0.577
1.16
Recommended: By enabling the auto_tune setting, we often get a
much better balance.
balance()
Uses up- and/or downsampling to fix the group sizes to the min, max,
mean, or median group size or to a specific number of rows. Balancing
can also happen on the ID level, e.g. to ensure the same number of IDs
in each category.
# Lets first unbalance the dataset by removing some rows
df_b <- df_medium %>%
arrange(diagnosis) %>%
filter(!row_number() %in% c(5,7,8,13,14,16,17,18))
# Show distribution of diagnoses and participants
df_b %>%
count(diagnosis, participant) %>%
kable()
diagnosis
participant
n
a
1
3
a
3
2
a
6
1
b
2
3
b
4
1
# First set seed to ensure reproducibility
set.seed(1)
# Downsampling by diagnosis
balance(
data = df_b,
size = "min",
cat_col = "diagnosis"
) %>%
count(diagnosis, participant) %>%
kable()
There are currently 10 methods available. They can be divided into 6
categories.
Examples of group sizes are based on a vector with 57 elements.
Specify group size
Method: greedy
Divides up the data greedily given a specified group size.
E.g. group sizes: 10, 10, 10, 10, 10, 7
Specify number of groups
Method: n_dist (Default)
Divides the data into a specified number of groups and distributes
excess data points across groups.
E.g. group sizes: 11, 11, 12, 11, 12
Method: n_fill
Divides the data into a specified number of groups and fills up groups
with excess data points from the beginning.
E.g. group sizes: 12, 12, 11, 11, 11
Method: n_last
Divides the data into a specified number of groups. The algorithm finds
the most equal group sizes possible, using all data points. Only the
last group is able to differ in size.
E.g. group sizes: 11, 11, 11, 11, 13
Method: n_rand
Divides the data into a specified number of groups. Excess data points
are placed randomly in groups (only 1 per group).
E.g. group sizes: 12, 11, 11, 11, 12
Specify list
Method: l_sizes
Uses a list / vector of group sizes to divide up the data. Excess data points are placed in an extra group.
E.g. n = c(11, 11) returns group sizes: 11, 11, 35
Method: l_starts
Uses a list of starting positions to divide up the data. Starting positions are values in a vector (e.g. column in data frame).
Skip to a specific nth appearance of a value by using
c(value, skip_to).
E.g. n = c(11, 15, 27, 43) returns group sizes: 10, 4, 12, 16, 15
Identical to n = list(11, 15, c(27, 1), 43 where 1 specifies that we
want the first appearance of 27 after the previous value 15.
If passing n = "auto" starting positions are automatically found with
find_starts().
Specify distance between members
Method: every
Every nth data point is combined to a group.
E.g. group sizes: 12, 12, 11, 11, 11
Specify step size
Method: staircase
Uses step_size to divide up the data. Group size increases with 1 step
for every group, until there is no more data.
E.g. group sizes: 5, 10, 15, 20, 7
Specify start at
Method: primes
Creates groups with sizes corresponding to prime numbers. Starts at n (prime number). Increases to the the next prime number
until there is no more data.
E.g. group sizes: 5, 7, 11, 13, 17, 4
Balancing ID Methods
There are currently 4 methods for balancing (up-/downsampling) on ID
level in balance().
ID method: n_ids
Balances on ID level only. It makes sure there are the same number of
IDs in each category. This might lead to a different number of rows
between categories.
ID method: n_rows_c
Attempts to level the number of rows per category, while only
removing/adding entire IDs. This is done with repetition and by
iteratively picking the ID with the number of rows closest to the
lacking/excessive number of rows in the category.
ID method: distributed
Distributes the lacking/excess rows equally between the IDs. If the
number to distribute cannot be equally divided, some IDs will have 1 row
more/less than the others.
ID method: nested
Balances the IDs within their categories, meaning that all IDs in a
category will have the same number of rows.
groupdata2
Author: Ludvig R. Olsen ( r-pkgs@ludvigolsen.dk )
License: MIT
Started: October 2016
Overview
R package for dividing data into groups.
Main functions
group_factor()group()splt()partition()fold()collapse_groups()balance()downsample(),upsample().Other tools
all_groups_identical()differs_from_previous()find_starts()find_missing_starts()l_startsmethod.summarize_group_cols()factors).summarize_balances()ranked_balances()Summarydata frame from the output ofsummarize_balances()%primes%primesmethod.%staircase%staircasemethod.Table of Contents
Installation
CRAN version:
Development version:
Vignettes
groupdata2contains a number of vignettes with relevant use cases and descriptions:Data for examples
Functions
group_factor()
Returns a factor with group numbers, e.g.
factor(c(1,1,1,2,2,2,3,3,3)).This can be used to subset, aggregate, group_by, etc.
Create equally sized groups by setting
force_equal = TRUERandomize grouping factor by setting
randomize = TRUEgroup()
Creates a grouping factor and adds it to the given data frame. The data frame is grouped by the grouping factor for easy use in
magrittr(%>%) pipelines.splt()
Creates the specified groups with
group_factor()and splits the given data by the grouping factor withbase::split. Returns the splits in a list.partition()
Creates (optionally) balanced partitions (e.g. training/test sets). Balance partitions on categorical variable(s) and/or a numerical variable. Make sure that all datapoints sharing an ID is in the same partition.
fold()
Creates (optionally) balanced folds for use in cross-validation. Balance folds on categorical variable(s) and/or a numerical variable. Ensure that all datapoints sharing an ID is in the same fold. Create multiple unique fold columns at once, e.g. for repeated cross-validation.
Notice, that the we now have the opportunity to include the session variable and/or use participant as a random effect in our model when doing cross-validation, as any participant will only appear in one fold.
We also have a balance in the representation of each diagnosis, which could give us better, more consistent results.
collapse_groups()
Collapses a set of groups into a smaller set of groups while attempting to balance the new groups by specified numerical columns, categorical columns, level counts in ID columns, and/or the number of rows.
Recommended: By enabling the
auto_tunesetting, we often get a much better balance.balance()
Uses up- and/or downsampling to fix the group sizes to the min, max, mean, or median group size or to a specific number of rows. Balancing can also happen on the ID level, e.g. to ensure the same number of IDs in each category.
Grouping Methods
There are currently 10 methods available. They can be divided into 6 categories.
Examples of group sizes are based on a vector with 57 elements.
Specify group size
Method: greedy
Divides up the data greedily given a specified group size.
E.g. group sizes: 10, 10, 10, 10, 10, 7
Specify number of groups
Method: n_dist (Default)
Divides the data into a specified number of groups and distributes excess data points across groups.
E.g. group sizes: 11, 11, 12, 11, 12
Method: n_fill
Divides the data into a specified number of groups and fills up groups with excess data points from the beginning.
E.g. group sizes: 12, 12, 11, 11, 11
Method: n_last
Divides the data into a specified number of groups. The algorithm finds the most equal group sizes possible, using all data points. Only the last group is able to differ in size.
E.g. group sizes: 11, 11, 11, 11, 13
Method: n_rand
Divides the data into a specified number of groups. Excess data points are placed randomly in groups (only 1 per group).
E.g. group sizes: 12, 11, 11, 11, 12
Specify list
Method: l_sizes
Uses a list / vector of group sizes to divide up the data.
Excess data points are placed in an extra group.
E.g.
n = c(11, 11)returns group sizes: 11, 11, 35Method: l_starts
Uses a list of starting positions to divide up the data.
Starting positions are values in a vector (e.g. column in data frame). Skip to a specific nth appearance of a value by using
c(value, skip_to).E.g.
n = c(11, 15, 27, 43)returns group sizes: 10, 4, 12, 16, 15Identical to
n = list(11, 15, c(27, 1), 43where1specifies that we want the first appearance of 27 after the previous value 15.If passing
n = "auto"starting positions are automatically found withfind_starts().Specify distance between members
Method: every
Every
nth data point is combined to a group.E.g. group sizes: 12, 12, 11, 11, 11
Specify step size
Method: staircase
Uses step_size to divide up the data. Group size increases with 1 step for every group, until there is no more data.
E.g. group sizes: 5, 10, 15, 20, 7
Specify start at
Method: primes
Creates groups with sizes corresponding to prime numbers.
Starts at
n(prime number). Increases to the the next prime number until there is no more data.E.g. group sizes: 5, 7, 11, 13, 17, 4
Balancing ID Methods
There are currently 4 methods for balancing (up-/downsampling) on ID level in
balance().ID method: n_ids
Balances on ID level only. It makes sure there are the same number of IDs in each category. This might lead to a different number of rows between categories.
ID method: n_rows_c
Attempts to level the number of rows per category, while only removing/adding entire IDs. This is done with repetition and by iteratively picking the ID with the number of rows closest to the lacking/excessive number of rows in the category.
ID method: distributed
Distributes the lacking/excess rows equally between the IDs. If the number to distribute cannot be equally divided, some IDs will have 1 row more/less than the others.
ID method: nested
Balances the IDs within their categories, meaning that all IDs in a category will have the same number of rows.