<- mtcars$mpg
x
x
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
Steven P. Sanderson II, MPH
November 4, 2022
Many times in the real world we have a data set which is actually a sample as we typically do not know what the actual population is. This is where bootstrapping tends to come into play. It allows us to get a hold on what the possible parameter values are by taking repeated samples of the data that is available to us.
At it’s core it is a resampling method with replacement where it assigns measures of accuracy to the sample estimates. Here is the Wikipedia Article for bootstrapping.
In this post I am going to go over how to use the bootstrap function set with {TidyDensity}
. You can find the pkgdown
site with all function references here: TidyDensity
The first thing we will need is a dataset, and for this we are going to pick on the mtcars
dataset and more specifically the mpg
column. So let’s get to it!
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
We see that x
is a numeric vector, which is what we need for these {TidyDensity}
functions. The first function that x
will be put through is a function called tidy_bootstrap()
Let’s take a look at the full function call and parameters of this function.
What you see above are the defaults for the function. Now lets go through the parameters.
.x
- This is of course the numeric vector that you are passing to the function, in our case right now, it is x
which we set to mtcars$mpg
.num_sims
- This is how many simulations you want to run of x
. This is done with replacement. So this is dictating how many bootstrap samples of x
we want to take.
.proportion
- How much of the data do you want to sample? The default here is 80%
.distribution_type
- What kind of distribution are you sampling from? Is it a continuous or discrete distribution. This is important for plotting.
The function returns a tibble
with the bootstrap column as a list
object. Lets take a look at tidy_bootstrap(x)
. We are going to set simulations to 50 instead of the default 2000.
# A tibble: 50 × 2
sim_number bootstrap_samples
<fct> <list>
1 1 <dbl [25]>
2 2 <dbl [25]>
3 3 <dbl [25]>
4 4 <dbl [25]>
5 5 <dbl [25]>
6 6 <dbl [25]>
7 7 <dbl [25]>
8 8 <dbl [25]>
9 9 <dbl [25]>
10 10 <dbl [25]>
# … with 40 more rows
The column bootstrap_samples
holds the bootstrapped resamples of x
at the given .proportion
, in this instance, 80%.
From this point we can go straight into use the bootstrap_stat_plot()
function if we choose. Under-the-hood it will make use of bootstrap_unnest_tbl()
. All this function does is act as a helper to unnest the bootstrap_samples
column of the returned tibble from tidy_bootstrap()
Let’s take a look below.
# A tibble: 1,250 × 2
sim_number y
<fct> <dbl>
1 1 22.8
2 1 15.5
3 1 21.5
4 1 15
5 1 10.4
6 1 22.8
7 1 16.4
8 1 30.4
9 1 26
10 1 19.2
# … with 1,240 more rows
Now let’s get into the bootstrap_stat_plot()
function of {TidyDensity}
The function bootstrap_stat_plot()
was designed to handle data either from the tidy_bootstrap()
or bootstrap_unnest_tbl()
functions only. This was to ensure that the right type of data was being passed in and to ensure that the right type of output was guaranteed.
Let’s take a full look at the function call.
There are a few interesting parameters here, but like before we will go through all of them.
.data
- This is the data that gets passed from either tidy_bootstrap()
or bootstrap_unnest_tbl()
.value
- This is the column from bootstrap_unnest_tbl()
that you want to visualize, this is typically y
.stat
- There are multiple cumulative stats that will work with this plot. These are all built directly into the {TidyDensity} package. You can find the supported ones that are built into this package at the reference page.
.show_groups
- Do you want to show all of the simulation groups TRUE/FALSE
.show_ci_labels
- If set to TRUE then the confidence interval labels will be shows on the graph as the final value.
.interactive
- Do you want a plotly
plot? Who doesn’t?
Now let’s walk though a few examples.
You can see from this output that the statistic you choose is printed in the chart title and on the y axis, the caption will also tell you how many simulations are present. Lets look at skewness as another example.
Volia!