<- formula(mpg ~ wt + hp)
my_formula my_formula
mpg ~ wt + hp
Steven P. Sanderson II, MPH
June 14, 2023
The formula()
function in R is a generic function that is used to create and manipulate formulas. Formulas are used to specify the relationship between variables in statistical models. The basic syntax for a formula is:
The response is the variable that you are trying to predict, and the predictors are the variables that you are using to predict the response. You can use multiple predictors by separating them with + signs. For example, the following formula predicts the mpg (miles per gallon) of a car based on the wt (weight) and hp (horsepower) of the car:
The formula()
function can be used to create formulas from scratch, or it can be used to extract formulas from existing objects. For example, the following code creates a formula object called my_formula
that predicts the mpg of a car based on the wt and hp of the car:
The formula()
function can also be used to manipulate formulas. For example, the following code adds a new predictor called drat (drive ratio) to the my_formula formula:
The formula()
function is a powerful tool that can be used to create, manipulate, and analyze formulas in R.
Here are some additional things to know about the formula()
function:
Now that we have a decent understanding of the function, I want to shift focus a little bit and show how we can use the generics function formula()
in order to extract a formula from a recipe object.
Here is the full code that we are going to look at:
# A tibble: 11 × 4
variable type role source
<chr> <list> <chr> <chr>
1 cyl <chr [2]> predictor original
2 disp <chr [2]> predictor original
3 hp <chr [2]> predictor original
4 drat <chr [2]> predictor original
5 wt <chr [2]> predictor original
6 qsec <chr [2]> predictor original
7 vs <chr [2]> predictor original
8 am <chr [2]> predictor original
9 gear <chr [2]> predictor original
10 carb <chr [2]> predictor original
11 mpg <chr [2]> outcome original
mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
<environment: 0x0000013e3255d2f0>
Let’s break down each line and understand what it does:
The first line imports the recipes
package, which is a powerful tool for preparing and preprocessing data in a structured and reproducible manner.
Here, we create a recipe
object named rec_obj
. This object represents a set of instructions for data transformation. In this case, we specify the formula mpg ~ .
, which means we want to predict the miles per gallon (mpg
) using all other variables in the mtcars
dataset.
The next line leverages the magrittr pipe operator (|>
) to chain multiple operations. Let’s break it down:
rec_obj
is passed to the prep()
function. This function performs data preparation steps specified in the recipe object, such as handling missing values, feature scaling, or encoding categorical variables.prep()
is then piped to the formula()
function, which extracts the formula representation from the preprocessed recipe object. The resulting formula can be used in subsequent modeling steps.That’s it! With just a few lines of code, we have defined a recipe, prepared the data accordingly, and obtained the formula representation for further modeling.
Now, let’s dive into a couple more examples to showcase the versatility of the recipes
package:
rec_obj <- recipe(Species ~ ., data = iris) |>
step_normalize(all_predictors())
rec_obj |> prep() |> formula()
Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
<environment: 0x0000013e2e8346f0>
In this example, we create a recipe to predict the species (Species
) using all other variables in the iris
dataset. We then use the step_normalize()
function to standardize all predictor variables in the recipe. This step ensures that variables are on a similar scale, which can be beneficial for certain machine learning algorithms.
Here, we define a recipe to predict the sale price (SalePrice
) using all other variables in the train_data
dataset. The step_dummy()
function is used to convert all nominal variables in the recipe into dummy variables. The all_nominal()
argument specifies that all variables should be considered, while the -all_outcomes()
argument ensures that the outcome variable (SalePrice
) is not transformed.
These examples provide a glimpse into the power and flexibility of the recipes
package for data preprocessing in R. It enables you to define a clear and reproducible data transformation pipeline that can greatly simplify your machine learning workflows.
Happy coding! 🚀