Auto Prep data for XGBoost with {healthyR.ai}

code
rtip
healthyrai
xgboost
Author

Steven P. Sanderson II, MPH

Published

November 15, 2022

Introduction

Sometimes we may want to quickly format some data in order to just pass it through some algorithm just to see what happens, how crazy are things, just to get an idea of what may lie ahead…a lot of prep.

With my r package {healthyR.ai} there is a set of prepper functions that will automatically do a ‘best effort’ to format you data to be used in the algorithm you choose (should it be supported).

Today we will talk about [hai_xgboost_data_prepper()

Function

Let’s take a look at the function call.

hai_xgboost_data_prepper(.data, .recipe_formula)

Now let’s go over the arguments that are passed to the function.

  • .data - The data that you are passing to the function. Can be any type of data that is accepted by the data parameter of the recipes::reciep() function.
  • .recipe_formula - The formula that is going to be passed. For example if you are using the diamonds data then the formula would most likely be something like price ~ .

Example

Let’s go over some examples.

library(ggplot2)
library(healthyR.ai)

# Regression
hai_xgboost_data_prepper(.data = diamonds, .recipe_formula = price ~ .)
Recipe

Inputs:

      role #variables
   outcome          1
 predictor          9

Operations:

Factor variables from tidyselect::vars_select_helpers$where(is.charac...
Novel factor level assignment for recipes::all_nominal_predictors()
Dummy variables from recipes::all_nominal_predictors()
Zero variance filter on recipes::all_predictors()
reg_obj <- hai_xgboost_data_prepper(diamonds, price ~ .)
get_juiced_data(reg_obj)
# A tibble: 53,940 × 27
   carat depth table     x     y     z price  cut_1  cut_2  cut_3  cut_4   cut_5
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
 1  0.23  61.5    55  3.95  3.98  2.43   326  0.359 -0.109 -0.522 -0.567 -0.315 
 2  0.21  59.8    61  3.89  3.84  2.31   326  0.120 -0.436 -0.298  0.378  0.630 
 3  0.23  56.9    65  4.05  4.07  2.31   327 -0.359 -0.109  0.522 -0.567  0.315 
 4  0.29  62.4    58  4.2   4.23  2.63   334  0.120 -0.436 -0.298  0.378  0.630 
 5  0.31  63.3    58  4.34  4.35  2.75   335 -0.359 -0.109  0.522 -0.567  0.315 
 6  0.24  62.8    57  3.94  3.96  2.48   336 -0.120 -0.436  0.298  0.378 -0.630 
 7  0.24  62.3    57  3.95  3.98  2.47   336 -0.120 -0.436  0.298  0.378 -0.630 
 8  0.26  61.9    55  4.07  4.11  2.53   337 -0.120 -0.436  0.298  0.378 -0.630 
 9  0.22  65.1    61  3.87  3.78  2.49   337 -0.598  0.546 -0.373  0.189 -0.0630
10  0.23  59.4    61  4     4.05  2.39   338 -0.120 -0.436  0.298  0.378 -0.630 
# … with 53,930 more rows, and 15 more variables: color_1 <dbl>, color_2 <dbl>,
#   color_3 <dbl>, color_4 <dbl>, color_5 <dbl>, color_6 <dbl>, color_7 <dbl>,
#   clarity_1 <dbl>, clarity_2 <dbl>, clarity_3 <dbl>, clarity_4 <dbl>,
#   clarity_5 <dbl>, clarity_6 <dbl>, clarity_7 <dbl>, clarity_8 <dbl>
# Classification
hai_xgboost_data_prepper(Titanic, Survived ~ .)
Recipe

Inputs:

      role #variables
   outcome          1
 predictor          4

Operations:

Factor variables from tidyselect::vars_select_helpers$where(is.charac...
Novel factor level assignment for recipes::all_nominal_predictors()
Dummy variables from recipes::all_nominal_predictors()
Zero variance filter on recipes::all_predictors()
cla_obj <- hai_xgboost_data_prepper(Titanic, Survived ~ .)
get_juiced_data(cla_obj)
# A tibble: 32 × 7
       n Survived Class_X2nd Class_X3rd Class_Crew Sex_Male Age_Child
   <dbl> <fct>         <dbl>      <dbl>      <dbl>    <dbl>     <dbl>
 1     0 No                0          0          0        1         1
 2     0 No                1          0          0        1         1
 3    35 No                0          1          0        1         1
 4     0 No                0          0          1        1         1
 5     0 No                0          0          0        0         1
 6     0 No                1          0          0        0         1
 7    17 No                0          1          0        0         1
 8     0 No                0          0          1        0         1
 9   118 No                0          0          0        1         0
10   154 No                1          0          0        1         0
# … with 22 more rows

Voila!