tidy_stat_tbl(
.data,.x = y,
.fns,.return_type = "vector",
.use_data_table = FALSE,
... )
Introduction
Many times someone may want to see a summary or cumulative statistic for a given set of data or even from several simulations of data. I went over bootstrap plotting earlier this month, and this is a form of what we will go over today although slightly more restrictive.
I have decided to make today my weekly r-tip because tomorrow is Thanksgiving here in the US and I am taking an extended holiday so I won’t be back until Monday.
Today’s function and weekly tip is on tidy_stat_tbl()
. It is meant to be used with a tidy_
distribution function. Let’s take a look.
Function
Here is the function call:
Here are the arguments to the parameters of the function:
.data
- The input data coming from a tidy_ distribution function..x
- The default is y but can be one of the other columns from the inputdata.
.fns
- The default is IQR, but this can be any stat function like quantile or median etc..return_type
- The default is “vector” which returns an sapply object..use_data_table
- The default is FALSE, TRUE will use data.table under the hood and still return a tibble. If this argument is set to TRUE then the .return_type parameter will be ignored....
- Addition function arguments to be supplied to the parameters of.fns
Examples
Single Simulation
Let’s go over some examples. Firstly, we will go over all the different .return_type
’s of a single simulation of tidy_normal()
using the quantile
function.
Vector Output BE CAREFUL IT USES SAPPLY
library(TidyDensity)
set.seed(123)
<- tidy_normal()
tn
tidy_stat_tbl(
.data = tn,
.x = y,
.return_type = "vector",
.fns = quantile,
na.rm = TRUE,
probs = c(0.025, 0.5, 0.975)
)
sim_number_1
2.5% -1.59190149
50% -0.07264039
97.5% 1.77074730
List Output with lapply
tidy_stat_tbl(
"list", na.rm = TRUE
tn, y, quantile, )
$sim_number_1
0% 25% 50% 75% 100%
-1.96661716 -0.55931702 -0.07264039 0.69817699 2.16895597
tidy_stat_tbl(
"list", na.rm = TRUE,
tn, y, quantile, probs = c(0.025, 0.5, 0.975)
)
$sim_number_1
2.5% 50% 97.5%
-1.59190149 -0.07264039 1.77074730
Tibble output with tibble
tidy_stat_tbl(
"tibble", na.rm = TRUE
tn, y, quantile, )
# A tibble: 5 × 3
sim_number name quantile
<fct> <chr> <dbl>
1 1 0% -1.97
2 1 25% -0.559
3 1 50% -0.0726
4 1 75% 0.698
5 1 100% 2.17
tidy_stat_tbl(
"tibble", na.rm = TRUE,
tn, y, quantile, probs = c(0.025, 0.5, 0.975)
)
# A tibble: 3 × 3
sim_number name quantile
<fct> <chr> <dbl>
1 1 2.5% -1.59
2 1 50% -0.0726
3 1 97.5% 1.77
Tibble output with data.table The output object is a tibble
but data.table
is used to perform the calculations which can be magnitudes faster when simulations are large. I will showcase down the post.
tidy_stat_tbl(
.use_data_table = TRUE, na.rm = TRUE
tn, y, quantile, )
# A tibble: 5 × 3
sim_number name quantile
<fct> <fct> <dbl>
1 1 0% -1.97
2 1 25% -0.559
3 1 50% -0.0726
4 1 75% 0.698
5 1 100% 2.17
tidy_stat_tbl(
.use_data_table = TRUE, na.rm = TRUE,
tn, y, quantile, probs = c(0.025, 0.5, 0.975)
)
# A tibble: 3 × 3
sim_number name quantile
<fct> <fct> <dbl>
1 1 2.5% -1.59
2 1 50% -0.0726
3 1 97.5% 1.77
Now let’s take a look with multiple simulations.
Multiple Simulations
Let’s set our simulation count to 5. While this is not a large amount it will serve as a good illustration on the outputs.
<- 5
ns <- quantile
f <- TRUE
nr <- c(0.025, 0.975) p
Ok let’s run the same simulations but with the updated params.
Vector Output BE CAREFUL IT USES SAPPLY
set.seed(123)
<- tidy_normal(.num_sims = ns)
tn
tidy_stat_tbl(
.data = tn,
.x = y,
.return_type = "vector",
.fns = f,
na.rm = nr,
probs = p
)
sim_number_1 sim_number_2 sim_number_3 sim_number_4 sim_number_5
2.5% -1.591901 -1.474945 -1.656679 -1.258156 -1.309749
97.5% 1.770747 1.933653 1.894424 2.098923 1.943384
tidy_stat_tbl(
.return_type = "vector",
tn, y, .fns = f, na.rm = nr
)
sim_number_1 sim_number_2 sim_number_3 sim_number_4 sim_number_5
0% -1.96661716 -2.3091689 -2.0532472 -1.31080153 -1.3598407
25% -0.55931702 -0.3612969 -0.9505826 -0.49541417 -0.7140627
50% -0.07264039 0.1525789 -0.3048700 -0.07675993 -0.2240352
75% 0.69817699 0.6294358 0.2900859 0.55145766 0.5287605
100% 2.16895597 2.1873330 2.1001089 3.24103993 2.1988103
List Output with lapply
tidy_stat_tbl(
"list", na.rm = nr
tn, y, f, )
$sim_number_1
0% 25% 50% 75% 100%
-1.96661716 -0.55931702 -0.07264039 0.69817699 2.16895597
$sim_number_2
0% 25% 50% 75% 100%
-2.3091689 -0.3612969 0.1525789 0.6294358 2.1873330
$sim_number_3
0% 25% 50% 75% 100%
-2.0532472 -0.9505826 -0.3048700 0.2900859 2.1001089
$sim_number_4
0% 25% 50% 75% 100%
-1.31080153 -0.49541417 -0.07675993 0.55145766 3.24103993
$sim_number_5
0% 25% 50% 75% 100%
-1.3598407 -0.7140627 -0.2240352 0.5287605 2.1988103
tidy_stat_tbl(
"list", na.rm = nr,
tn, y, f, probs = p
)
$sim_number_1
2.5% 97.5%
-1.591901 1.770747
$sim_number_2
2.5% 97.5%
-1.474945 1.933653
$sim_number_3
2.5% 97.5%
-1.656679 1.894424
$sim_number_4
2.5% 97.5%
-1.258156 2.098923
$sim_number_5
2.5% 97.5%
-1.309749 1.943384
Tibble output with tibble
tidy_stat_tbl(
"tibble", na.rm = nr
tn, y, f, )
# A tibble: 25 × 3
sim_number name f
<fct> <chr> <dbl>
1 1 0% -1.97
2 1 25% -0.559
3 1 50% -0.0726
4 1 75% 0.698
5 1 100% 2.17
6 2 0% -2.31
7 2 25% -0.361
8 2 50% 0.153
9 2 75% 0.629
10 2 100% 2.19
# … with 15 more rows
tidy_stat_tbl(
"tibble", na.rm = nr,
tn, y, f, probs = p
)
# A tibble: 10 × 3
sim_number name f
<fct> <chr> <dbl>
1 1 2.5% -1.59
2 1 97.5% 1.77
3 2 2.5% -1.47
4 2 97.5% 1.93
5 3 2.5% -1.66
6 3 97.5% 1.89
7 4 2.5% -1.26
8 4 97.5% 2.10
9 5 2.5% -1.31
10 5 97.5% 1.94
Tibble output with data.table The output object is a tibble
but data.table
is used to perform the calculations which can be magnitudes faster when simulations are large. I will showcase down the post.
tidy_stat_tbl(
.use_data_table = TRUE, na.rm = nr
tn, y, f, )
# A tibble: 25 × 3
sim_number name f
<fct> <fct> <dbl>
1 1 0% -1.97
2 1 25% -0.559
3 1 50% -0.0726
4 1 75% 0.698
5 1 100% 2.17
6 2 0% -2.31
7 2 25% -0.361
8 2 50% 0.153
9 2 75% 0.629
10 2 100% 2.19
# … with 15 more rows
tidy_stat_tbl(
.use_data_table = TRUE, na.rm = nr,
tn, y, f, probs = p
)
# A tibble: 10 × 3
sim_number name f
<fct> <fct> <dbl>
1 1 2.5% -1.59
2 1 97.5% 1.77
3 2 2.5% -1.47
4 2 97.5% 1.93
5 3 2.5% -1.66
6 3 97.5% 1.89
7 4 2.5% -1.26
8 4 97.5% 2.10
9 5 2.5% -1.31
10 5 97.5% 1.94
Ok, now that we have shown that, let’s ratchet up the simulations so we can see the true difference in using the .use_data_tbl
parameter when simulations are large. We are going to use {rbenchmark}
for
Benchmarking
Here we go. We are going to make a tidy_bootstrap()
of the mtcars$mpg
data which will produce 2000 simulations, we will replicate this 25 times.
library(rbenchmark)
library(TidyDensity)
library(dplyr)
# Get the interesting vector, well for this anyways
<- mtcars$mpg
x
# Bootstrap the vector (2k simulations is default)
<- tidy_bootstrap(x) %>%
tb bootstrap_unnest_tbl()
benchmark(
"tibble" = {
tidy_stat_tbl(tb, y, IQR, "tibble")
},"data.table" = {
tidy_stat_tbl(tb, y, IQR, .use_data_table = TRUE, type = 7)
},"sapply" = {
tidy_stat_tbl(tb, y, IQR, "vector")
},"lapply" = {
tidy_stat_tbl(tb, y, IQR, "list")
},replications = 25,
columns = c("test","replications","elapsed","relative","user.self","sys.self" )
%>%
) arrange(relative)
test replications elapsed relative user.self sys.self
1 data.table 25 4.11 1.000 3.33 0.11
2 lapply 25 24.14 5.873 20.02 0.38
3 sapply 25 25.11 6.109 21.01 0.28
4 tibble 25 33.18 8.073 27.45 0.51
Voila!