tidyAML: Now supporting gee models

rtip
tidyaml
regression
classification
Author

Steven P. Sanderson II, MPH

Published

December 1, 2023

Introduction

I am happy to announce that a new version of tidyAML is now available on CRAN. This version includes support for gee models. This is a big step forward for tidyAML as it now supports a wide variety of regression and classification models.

What is tidyAML?

The package is meant to be a way to quickly create many models of different algorithms, and this release is a small step forward in supporting that mission. The package is built upon the tidymodels packages.

What is gee?

gee stands for Generalized Estimating Equations. It is a way to model correlated data. For example, if you have a dataset where you have multiple observations per person, you may want to use gee to account for the correlation between observations within a person. This is a very common situation in longitudinal studies. Here is a link to the gee parsnip model.

Example

Let’s see how it works, we will examine how it is setup to faily gracefully and how it works when the multilevelmod package is installed and loaded into your environment.

Load Library

library(tidyAML)

Now, let’s build a model that will fail, it’s important I think to see the failure message so you can understand what is happening. It’s likely because the library is not loaded, let’s face it, it has happened to all of us.

library(recipes)
library(dplyr)

rec_obj <- recipe(mpg ~ ., data = mtcars)
frt_tbl <- fast_regression(
  mtcars, 
  rec_obj, 
  .parsnip_eng = c("lm","glm","gee"),
  .parsnip_fns = "linear_reg"
)

glimpse(frt_tbl)
Rows: 3
Columns: 8
$ .model_id       <int> 1, 2, 3
$ .parsnip_engine <chr> "lm", "gee", "glm"
$ .parsnip_mode   <chr> "regression", "regression", "regression"
$ .parsnip_fns    <chr> "linear_reg", "linear_reg", "linear_reg"
$ model_spec      <list> [~NULL, ~NULL, NULL, regression, TRUE, NULL, lm, TRUE]…
$ wflw            <list> [cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb, mp…
$ fitted_wflw     <list> [cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb, mp…
$ pred_wflw       <list> [<tbl_df[8 x 1]>], <NULL>, [<tbl_df[8 x 1]>]
frt_tbl |> pull(pred_wflw) 
[[1]]
# A tibble: 8 × 1
  .pred
  <dbl>
1 22.9 
2 18.0 
3 21.1 
4 32.9 
5 18.5 
6 10.4 
7  9.59
8 24.7 

[[2]]
NULL

[[3]]
# A tibble: 8 × 1
  .pred
  <dbl>
1 22.9 
2 18.0 
3 21.1 
4 32.9 
5 18.5 
6 10.4 
7  9.59
8 24.7 
frt_tbl |> pull(fitted_wflw) |> purrr::map(broom::tidy)
[[1]]
# A tibble: 11 × 5
   term         estimate std.error statistic p.value
   <chr>           <dbl>     <dbl>     <dbl>   <dbl>
 1 (Intercept) -40.3       30.5       -1.32   0.209 
 2 cyl           0.907      1.09       0.830  0.421 
 3 disp          0.0105     0.0189     0.557  0.587 
 4 hp           -0.00487    0.0248    -0.196  0.847 
 5 drat          1.73       2.04       0.846  0.413 
 6 wt           -5.71       2.35      -2.43   0.0302
 7 qsec          3.39       1.34       2.54   0.0248
 8 vs           -3.85       2.99      -1.29   0.220 
 9 am            2.16       2.29       0.942  0.364 
10 gear          1.40       1.68       0.838  0.417 
11 carb          0.200      0.835      0.239  0.815 

[[2]]
# A tibble: 0 × 0

[[3]]
# A tibble: 11 × 5
   term         estimate std.error statistic p.value
   <chr>           <dbl>     <dbl>     <dbl>   <dbl>
 1 (Intercept) -40.3       30.5       -1.32   0.209 
 2 cyl           0.907      1.09       0.830  0.421 
 3 disp          0.0105     0.0189     0.557  0.587 
 4 hp           -0.00487    0.0248    -0.196  0.847 
 5 drat          1.73       2.04       0.846  0.413 
 6 wt           -5.71       2.35      -2.43   0.0302
 7 qsec          3.39       1.34       2.54   0.0248
 8 vs           -3.85       2.99      -1.29   0.220 
 9 am            2.16       2.29       0.942  0.364 
10 gear          1.40       1.68       0.838  0.417 
11 carb          0.200      0.835      0.239  0.815 

We see that the gee model failed. This is because we did not load the multilevelmod package. Let’s load it and try again.

library(multilevelmod)

frt_tbl <- fast_regression(
  mtcars, 
  rec_obj, 
  .parsnip_eng = c("lm","glm","gee"),
  .parsnip_fns = "linear_reg"
)

glimpse(frt_tbl)
Rows: 3
Columns: 8
$ .model_id       <int> 1, 2, 3
$ .parsnip_engine <chr> "lm", "gee", "glm"
$ .parsnip_mode   <chr> "regression", "regression", "regression"
$ .parsnip_fns    <chr> "linear_reg", "linear_reg", "linear_reg"
$ model_spec      <list> [~NULL, ~NULL, NULL, regression, TRUE, NULL, lm, TRUE]…
$ wflw            <list> [cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb, mp…
$ fitted_wflw     <list> [cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb, mp…
$ pred_wflw       <list> [<tbl_df[8 x 1]>], [<tbl_df[8 x 1]>], [<tbl_df[8 x 1]>…
frt_tbl |> pull(pred_wflw) 
[[1]]
# A tibble: 8 × 1
  .pred
  <dbl>
1  20.6
2  15.6
3  15.8
4  26.1
5  27.8
6  16.6
7  25.4
8  21.6

[[2]]
# A tibble: 8 × 1
  .pred
  <dbl>
1  21.3
2  15.2
3  15.3
4  25.7
5  27.3
6  16.5
7  25.7
8  20.5

[[3]]
# A tibble: 8 × 1
  .pred
  <dbl>
1  20.6
2  15.6
3  15.8
4  26.1
5  27.8
6  16.6
7  25.4
8  21.6
frt_tbl |> pull(fitted_wflw) |> purrr::map(broom::tidy)
[[1]]
# A tibble: 11 × 5
   term        estimate std.error statistic p.value
   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
 1 (Intercept) -3.70      18.9       -0.195  0.848 
 2 cyl          0.668      1.03       0.647  0.529 
 3 disp         0.00831    0.0153     0.544  0.596 
 4 hp          -0.0124     0.0173    -0.717  0.486 
 5 drat         2.87       1.44       2.00   0.0674
 6 wt          -2.87       1.40      -2.05   0.0609
 7 qsec         0.777      0.618      1.26   0.231 
 8 vs           0.169      1.64       0.103  0.920 
 9 am           1.90       1.79       1.07   0.306 
10 gear         1.31       1.44       0.907  0.381 
11 carb        -0.601      0.730     -0.823  0.425 

[[2]]
# A tibble: 10 × 6
   term         estimate std.error statistic p.value        ``
   <chr>           <dbl>     <dbl>     <dbl>   <dbl>     <dbl>
 1 (Intercept)  6.54       10.2     0.643     4.01    1.63    
 2 disp         0.0119      0.0140  0.849     0.0134  0.887   
 3 hp          -0.0149      0.0165 -0.901     0.0104 -1.43    
 4 drat         2.36        1.18    2.01      0.859   2.75    
 5 wt          -3.01        1.35   -2.23      1.35   -2.23    
 6 qsec         0.577       0.524   1.10      0.193   2.99    
 7 vs           0.000922    1.58    0.000582  1.07    0.000860
 8 am           1.35        1.54    0.880     0.886   1.53    
 9 gear         1.00        1.33    0.752     0.527   1.90    
10 carb        -0.355       0.611  -0.582     0.378  -0.940   

[[3]]
# A tibble: 11 × 5
   term        estimate std.error statistic p.value
   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
 1 (Intercept) -3.70      18.9       -0.195  0.848 
 2 cyl          0.668      1.03       0.647  0.529 
 3 disp         0.00831    0.0153     0.544  0.596 
 4 hp          -0.0124     0.0173    -0.717  0.486 
 5 drat         2.87       1.44       2.00   0.0674
 6 wt          -2.87       1.40      -2.05   0.0609
 7 qsec         0.777      0.618      1.26   0.231 
 8 vs           0.169      1.64       0.103  0.920 
 9 am           1.90       1.79       1.07   0.306 
10 gear         1.31       1.44       0.907  0.381 
11 carb        -0.601      0.730     -0.823  0.425 

The recipe object

Let’s take a look at the model spec, the workflow and the fitted workflow to see what has happened behind the scenes. First let’s look at the formula in the recipe object.

formula(prep(rec_obj))
mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
<environment: 0x0000012c521f0c98>

So we see what the formula is, but we also see that there is no id_var listed which means it would fail for the gee specification, but, our model worked! That is because behind the scenes in the fast sense at this point in development the first variable is taken to be the id variable. Let’s see what the workflow looks like.

gee_wflw <- frt_tbl |> slice(2) |> pull(wflw)

gee_wflw[[1]]$fit$actions$model$formula
mpg ~ id_var(cyl) + disp + hp + drat + wt + qsec + vs + am + 
    gear + carb
<environment: 0x0000012c5206d0f8>

As we can see the first predictor variable has been chosen as the id variable. This will change in the future, where you will be able to choose the id_var.

Conclusion

Thank you for reading and I hope you enjoy the new version of tidyAML. Please let me know if you have any questions or comments. We will continue our development of tidyAML and hope to have more updates soon.