hai_knn_data_prepper(.data, .recipe_formula)
Introduction
Minimal coding ML is not something that is unheard of and is rather prolific, think h2o
and pycaret
just to name two. There is also no shortage available for R
with the h2o
interface, and tidyfit
. There are also similar low-code workflows in my r
package {healthyR.ai}
. Today I will specifically go through the workflow for Automatic KNN classification for the Iris data set where we will classify the Species.
Function
Let’s take a look at the two {healthyR.ai}
functions that we will be using. First we have the data prepper hai_knn_data_prepper()
which will get the data ready for use with the knn
algorithm, and then we have the actual auto ml function hai_auto_knn()
. Let’s take a look at the function calls.
First hai_knn_data_prepper()
Now let’s look at the arguments to those parameters.
.data
- The data that you are passing to the function. Can be any type of data that is accepted by the data parameter of the recipes::reciep() function..recipe_formula
- The formula that is going to be passed. For example if you are using the iris data then the formula would most likely be something likeSpecies ~ .
Now let’s take a look at the automl function.
hai_auto_knn(
.data,
.rec_obj,.splits_obj = NULL,
.rsamp_obj = NULL,
.tune = TRUE,
.grid_size = 10,
.num_cores = 1,
.best_metric = "rmse",
.model_type = "regression"
)
Again let’s look at the arguments to the parameters.
.data
- The data being passed to the function. The time-series object..rec_obj
- This is the recipe object you want to use. You can usehai_knn_data_prepper()
an automatic recipe_object..splits_obj
- NULL is the default, when NULL then one will be created..rsamp_obj
- NULL is the default, when NULL then one will be created. It will default to creating an rsample::mc_cv() object..tune
- Default is TRUE, this will create a tuning grid and tuned workflow.grid_size
- Default is 10.num_cores
- Default is 1.best_metric
- Default is “rmse”. You can choose a metric depending on the model_type used. If regression then seehai_default_regression_metric_set()
, if classification then seehai_default_classification_metric_set()
..model_type
- Default is regression, can also be classification.
Example
For this example we are going to use the iris
data set where we are going to use the hai_auto_knn()
to classify the Species.
library(healthyR.ai)
<- iris
data
<- hai_knn_data_prepper(data, Species ~ .)
rec_obj
<- hai_auto_knn(
auto_knn .data = data,
.rec_obj = rec_obj,
.best_metric = "f_meas",
.model_type = "classification",
.num_cores = 12
)
Now let’s take a look at the complete output of the auto_knn
object. The outputs are as follows:
- recipe
- model specification
- workflow
- tuned model (grid ect)
Recipe Output
$recipe_info auto_knn
Recipe
Inputs:
role #variables
outcome 1
predictor 4
Operations:
Novel factor level assignment for recipes::all_nominal_predictors()
Dummy variables from recipes::all_nominal_predictors()
Zero variance filter on recipes::all_predictors()
Centering and scaling for recipes::all_numeric()
Model Info
$model_info$was_tuned auto_knn
[1] "tuned"
$model_info$model_spec auto_knn
K-Nearest Neighbor Model Specification (classification)
Main Arguments:
neighbors = tune::tune()
weight_func = tune::tune()
dist_power = tune::tune()
Computational engine: kknn
$model_info$wflw auto_knn
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: nearest_neighbor()
── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps
• step_novel()
• step_dummy()
• step_zv()
• step_normalize()
── Model ───────────────────────────────────────────────────────────────────────
K-Nearest Neighbor Model Specification (classification)
Main Arguments:
neighbors = tune::tune()
weight_func = tune::tune()
dist_power = tune::tune()
Computational engine: kknn
$model_info$fitted_wflw auto_knn
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: nearest_neighbor()
── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps
• step_novel()
• step_dummy()
• step_zv()
• step_normalize()
── Model ───────────────────────────────────────────────────────────────────────
Call:
kknn::train.kknn(formula = ..y ~ ., data = data, ks = min_rows(13L, data, 5), distance = ~1.69935879141092, kernel = ~"rank")
Type of response variable: nominal
Minimal misclassification: 0.03571429
Best kernel: rank
Best k: 13
Tuning Info
$tuned_info$tuning_grid auto_knn
# A tibble: 10 × 3
neighbors weight_func dist_power
<int> <chr> <dbl>
1 4 triangular 0.764
2 11 rectangular 0.219
3 5 gaussian 1.35
4 14 triweight 0.351
5 5 biweight 1.05
6 9 optimal 1.87
7 7 cos 0.665
8 11 inv 1.18
9 13 rank 1.70
10 1 epanechnikov 1.58
$tuned_info$cv_obj auto_knn
# Monte Carlo cross-validation (0.75/0.25) with 25 resamples
# A tibble: 25 × 2
splits id
<list> <chr>
1 <split [84/28]> Resample01
2 <split [84/28]> Resample02
3 <split [84/28]> Resample03
4 <split [84/28]> Resample04
5 <split [84/28]> Resample05
6 <split [84/28]> Resample06
7 <split [84/28]> Resample07
8 <split [84/28]> Resample08
9 <split [84/28]> Resample09
10 <split [84/28]> Resample10
# … with 15 more rows
$tuned_info$tuned_results auto_knn
# Tuning results
# Monte Carlo cross-validation (0.75/0.25) with 25 resamples
# A tibble: 25 × 4
splits id .metrics .notes
<list> <chr> <list> <list>
1 <split [84/28]> Resample01 <tibble [110 × 7]> <tibble [0 × 3]>
2 <split [84/28]> Resample02 <tibble [110 × 7]> <tibble [0 × 3]>
3 <split [84/28]> Resample03 <tibble [110 × 7]> <tibble [0 × 3]>
4 <split [84/28]> Resample04 <tibble [110 × 7]> <tibble [0 × 3]>
5 <split [84/28]> Resample05 <tibble [110 × 7]> <tibble [0 × 3]>
6 <split [84/28]> Resample06 <tibble [110 × 7]> <tibble [0 × 3]>
7 <split [84/28]> Resample07 <tibble [110 × 7]> <tibble [0 × 3]>
8 <split [84/28]> Resample08 <tibble [110 × 7]> <tibble [0 × 3]>
9 <split [84/28]> Resample09 <tibble [110 × 7]> <tibble [0 × 3]>
10 <split [84/28]> Resample10 <tibble [110 × 7]> <tibble [0 × 3]>
# … with 15 more rows
$tuned_info$grid_size auto_knn
[1] 10
$tuned_info$best_metric auto_knn
[1] "f_meas"
$tuned_info$best_result_set auto_knn
# A tibble: 1 × 9
neighbors weight_func dist_power .metric .estima…¹ mean n std_err .config
<int> <chr> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 13 rank 1.70 f_meas macro 0.957 25 0.00655 Prepro…
# … with abbreviated variable name ¹.estimator
$tuned_info$tuning_grid_plot auto_knn
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
$tuned_info$plotly_grid_plot auto_knn
Voila!
Thank you for reading.