Compare Empirical Data to Distributions
Source:R/utils-distribution-comparison.R
tidy_distribution_comparison.Rd
Compare some empirical data set against different distributions to help find the distribution that could be the best fit.
Arguments
- .x
The data set being passed to the function
- .distribution_type
What kind of data is it, can be one of
continuous
ordiscrete
- .round_to_place
How many decimal places should the parameter estimates be rounded off to for distibution construction. The default is 3
Details
The purpose of this function is to take some data set provided and
to try to find a distribution that may fit the best. A parameter of
.distribution_type
must be set to either continuous
or discrete
in order
for this the function to try the appropriate types of distributions.
The following distributions are used:
Continuous:
tidy_beta
tidy_cauchy
tidy_chisquare
tidy_exponential
tidy_gamma
tidy_logistic
tidy_lognormal
tidy_normal
tidy_pareto
tidy_uniform
tidy_weibull
Discrete:
tidy_binomial
tidy_geometric
tidy_hypergeometric
tidy_poisson
The function itself returns a list output of tibbles. Here are the tibbles that are returned:
comparison_tbl
deviance_tbl
total_deviance_tbl
aic_tbl
kolmogorov_smirnov_tbl
multi_metric_tbl
The comparison_tbl
is a long tibble
that lists the values of the density
function against the given data.
The deviance_tbl
and the total_deviance_tbl
just give the simple difference
from the actual density to the estimated density for the given estimated distribution.
The aic_tbl
will provide the AIC
for liklehood of the distribution.
The kolmogorov_smirnov_tbl
for now provides a two.sided
estimate of the
ks.test
of the estimated density against the empirical.
The multi_metric_tbl
will summarise all of these metrics into a single tibble.
Examples
xc <- mtcars$mpg
output_c <- tidy_distribution_comparison(xc, "continuous")
#> For the beta distribution, its mean 'mu' should be 0 < mu < 1. The data will
#> therefore be scaled to enforce this.
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> Warning: NaNs produced
#> There was no need to scale the data.
#> Warning: There were 97 warnings in `dplyr::mutate()`.
#> The first warning was:
#> ℹ In argument: `aic_value = dplyr::case_when(...)`.
#> Caused by warning in `actuar::dpareto()`:
#> ! NaNs produced
#> ℹ Run dplyr::last_dplyr_warnings() to see the 96 remaining warnings.
xd <- trunc(xc)
output_d <- tidy_distribution_comparison(xd, "discrete")
#> There was no need to scale the data.
#> Warning: There were 12 warnings in `dplyr::mutate()`.
#> The first warning was:
#> ℹ In argument: `aic_value = dplyr::case_when(...)`.
#> Caused by warning in `actuar::dpareto()`:
#> ! NaNs produced
#> ℹ Run dplyr::last_dplyr_warnings() to see the 11 remaining warnings.
output_c
#> $comparison_tbl
#> # A tibble: 384 × 8
#> sim_number x y dx dy p q dist_type
#> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 1 1 21 2.97 0.000114 0.625 10.4 Empirical
#> 2 1 2 21 4.21 0.000455 0.625 10.4 Empirical
#> 3 1 3 22.8 5.44 0.00142 0.781 13.3 Empirical
#> 4 1 4 21.4 6.68 0.00355 0.688 14.3 Empirical
#> 5 1 5 18.7 7.92 0.00721 0.469 14.7 Empirical
#> 6 1 6 18.1 9.16 0.0124 0.438 15 Empirical
#> 7 1 7 14.3 10.4 0.0192 0.125 15.2 Empirical
#> 8 1 8 24.4 11.6 0.0281 0.812 15.2 Empirical
#> 9 1 9 22.8 12.9 0.0395 0.781 15.5 Empirical
#> 10 1 10 19.2 14.1 0.0516 0.531 15.8 Empirical
#> # ℹ 374 more rows
#>
#> $deviance_tbl
#> # A tibble: 384 × 2
#> name value
#> <chr> <dbl>
#> 1 Empirical 0.451
#> 2 Beta c(1.107, 1.577, 0) 0.287
#> 3 Cauchy c(19.2, 7.375) -0.0169
#> 4 Chisquare c(20.243, 0) -0.106
#> 5 Exponential c(0.05) 0.230
#> 6 Gamma c(11.47, 1.752) -0.0322
#> 7 Logistic c(20.091, 3.27) 0.193
#> 8 Lognormal c(2.958, 0.293) 0.283
#> 9 Pareto c(10.4, 1.624) 0.446
#> 10 Uniform c(8.341, 31.841) 0.242
#> # ℹ 374 more rows
#>
#> $total_deviance_tbl
#> # A tibble: 11 × 2
#> dist_with_params abs_tot_deviance
#> <chr> <dbl>
#> 1 Gamma c(11.47, 1.752) 0.0235
#> 2 Chisquare c(20.243, 0) 0.462
#> 3 Beta c(1.107, 1.577, 0) 0.640
#> 4 Uniform c(8.341, 31.841) 1.11
#> 5 Weibull c(3.579, 22.288) 1.34
#> 6 Cauchy c(19.2, 7.375) 1.56
#> 7 Logistic c(20.091, 3.27) 2.74
#> 8 Lognormal c(2.958, 0.293) 4.72
#> 9 Gaussian c(20.091, 5.932) 4.74
#> 10 Pareto c(10.4, 1.624) 6.95
#> 11 Exponential c(0.05) 7.67
#>
#> $aic_tbl
#> # A tibble: 11 × 3
#> dist_type aic_value abs_aic
#> <fct> <dbl> <dbl>
#> 1 Beta c(1.107, 1.577, 0) NA NA
#> 2 Cauchy c(19.2, 7.375) 218. 218.
#> 3 Chisquare c(20.243, 0) NA NA
#> 4 Exponential c(0.05) 258. 258.
#> 5 Gamma c(11.47, 1.752) 206. 206.
#> 6 Logistic c(20.091, 3.27) 209. 209.
#> 7 Lognormal c(2.958, 0.293) 206. 206.
#> 8 Pareto c(10.4, 1.624) 260. 260.
#> 9 Uniform c(8.341, 31.841) 206. 206.
#> 10 Weibull c(3.579, 22.288) 209. 209.
#> 11 Gaussian c(20.091, 5.932) 209. 209.
#>
#> $kolmogorov_smirnov_tbl
#> # A tibble: 11 × 6
#> dist_type ks_statistic ks_pvalue ks_method alternative dist_char
#> <fct> <dbl> <dbl> <chr> <chr> <chr>
#> 1 Beta c(1.107, 1.577, … 0.75 0.000500 Monte-Ca… two-sided Beta c(1…
#> 2 Cauchy c(19.2, 7.375) 0.469 0.00200 Monte-Ca… two-sided Cauchy c…
#> 3 Chisquare c(20.243, 0) 0.219 0.446 Monte-Ca… two-sided Chisquar…
#> 4 Exponential c(0.05) 0.469 0.00100 Monte-Ca… two-sided Exponent…
#> 5 Gamma c(11.47, 1.752) 0.156 0.847 Monte-Ca… two-sided Gamma c(…
#> 6 Logistic c(20.091, 3.… 0.125 0.976 Monte-Ca… two-sided Logistic…
#> 7 Lognormal c(2.958, 0.… 0.281 0.160 Monte-Ca… two-sided Lognorma…
#> 8 Pareto c(10.4, 1.624) 0.719 0.000500 Monte-Ca… two-sided Pareto c…
#> 9 Uniform c(8.341, 31.8… 0.188 0.621 Monte-Ca… two-sided Uniform …
#> 10 Weibull c(3.579, 22.2… 0.219 0.443 Monte-Ca… two-sided Weibull …
#> 11 Gaussian c(20.091, 5.… 0.156 0.833 Monte-Ca… two-sided Gaussian…
#>
#> $multi_metric_tbl
#> # A tibble: 11 × 8
#> dist_type abs_tot_deviance aic_value abs_aic ks_statistic ks_pvalue ks_method
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 Gamma c(… 0.0235 206. 206. 0.156 0.847 Monte-Ca…
#> 2 Chisquar… 0.462 NA NA 0.219 0.446 Monte-Ca…
#> 3 Beta c(1… 0.640 NA NA 0.75 0.000500 Monte-Ca…
#> 4 Uniform … 1.11 206. 206. 0.188 0.621 Monte-Ca…
#> 5 Weibull … 1.34 209. 209. 0.219 0.443 Monte-Ca…
#> 6 Cauchy c… 1.56 218. 218. 0.469 0.00200 Monte-Ca…
#> 7 Logistic… 2.74 209. 209. 0.125 0.976 Monte-Ca…
#> 8 Lognorma… 4.72 206. 206. 0.281 0.160 Monte-Ca…
#> 9 Gaussian… 4.74 209. 209. 0.156 0.833 Monte-Ca…
#> 10 Pareto c… 6.95 260. 260. 0.719 0.000500 Monte-Ca…
#> 11 Exponent… 7.67 258. 258. 0.469 0.00100 Monte-Ca…
#> # ℹ 1 more variable: alternative <chr>
#>
#> attr(,".x")
#> [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
#> [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
#> [31] 15.0 21.4
#> attr(,".n")
#> [1] 32
output_d
#> $comparison_tbl
#> # A tibble: 160 × 8
#> sim_number x y dx dy p q dist_type
#> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 1 1 21 2.95 0.000120 0.719 10 Empirical
#> 2 1 2 21 4.14 0.000487 0.719 10 Empirical
#> 3 1 3 22 5.34 0.00154 0.781 13 Empirical
#> 4 1 4 21 6.54 0.00383 0.719 14 Empirical
#> 5 1 5 18 7.74 0.00766 0.469 14 Empirical
#> 6 1 6 18 8.93 0.0129 0.469 15 Empirical
#> 7 1 7 14 10.1 0.0194 0.156 15 Empirical
#> 8 1 8 24 11.3 0.0282 0.812 15 Empirical
#> 9 1 9 22 12.5 0.0397 0.781 15 Empirical
#> 10 1 10 19 13.7 0.0524 0.562 15 Empirical
#> # ℹ 150 more rows
#>
#> $deviance_tbl
#> # A tibble: 160 × 2
#> name value
#> <chr> <dbl>
#> 1 Empirical 0.478
#> 2 Binomial c(32, 0.031) 0.145
#> 3 Geometric c(0.048) -0.000463
#> 4 Hypergeometric c(21, 11, 21) -0.322
#> 5 Poisson c(19.688) -0.0932
#> 6 Empirical 0.478
#> 7 Binomial c(32, 0.031) 0.478
#> 8 Geometric c(0.048) 0.361
#> 9 Hypergeometric c(21, 11, 21) -0.122
#> 10 Poisson c(19.688) 0.193
#> # ℹ 150 more rows
#>
#> $total_deviance_tbl
#> # A tibble: 4 × 2
#> dist_with_params abs_tot_deviance
#> <chr> <dbl>
#> 1 Hypergeometric c(21, 11, 21) 2.52
#> 2 Binomial c(32, 0.031) 2.81
#> 3 Poisson c(19.688) 3.19
#> 4 Geometric c(0.048) 6.07
#>
#> $aic_tbl
#> # A tibble: 4 × 3
#> dist_type aic_value abs_aic
#> <fct> <dbl> <dbl>
#> 1 Binomial c(32, 0.031) Inf Inf
#> 2 Geometric c(0.048) 258. 258.
#> 3 Hypergeometric c(21, 11, 21) NaN NaN
#> 4 Poisson c(19.688) 210. 210.
#>
#> $kolmogorov_smirnov_tbl
#> # A tibble: 4 × 6
#> dist_type ks_statistic ks_pvalue ks_method alternative dist_char
#> <fct> <dbl> <dbl> <chr> <chr> <chr>
#> 1 Binomial c(32, 0.031) 0.719 0.000500 Monte-Ca… two-sided Binomial…
#> 2 Geometric c(0.048) 0.5 0.00200 Monte-Ca… two-sided Geometri…
#> 3 Hypergeometric c(21, 1… 0.625 0.000500 Monte-Ca… two-sided Hypergeo…
#> 4 Poisson c(19.688) 0.156 0.850 Monte-Ca… two-sided Poisson …
#>
#> $multi_metric_tbl
#> # A tibble: 4 × 8
#> dist_type abs_tot_deviance aic_value abs_aic ks_statistic ks_pvalue ks_method
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 Hypergeom… 2.52 NaN NaN 0.625 0.000500 Monte-Ca…
#> 2 Binomial … 2.81 Inf Inf 0.719 0.000500 Monte-Ca…
#> 3 Poisson c… 3.19 210. 210. 0.156 0.850 Monte-Ca…
#> 4 Geometric… 6.07 258. 258. 0.5 0.00200 Monte-Ca…
#> # ℹ 1 more variable: alternative <chr>
#>
#> attr(,".x")
#> [1] 21 21 22 21 18 18 14 24 22 19 17 16 17 15 10 10 14 32 30 33 21 15 15 13 19
#> [26] 27 26 30 15 19 15 21
#> attr(,".n")
#> [1] 32