# Library Loading
::p_load(tidyverse, glue, unglue) pacman
Introduction
Regular expressions, or regex, are incredibly powerful tools for pattern matching and extracting specific information from text data. Today, we’ll explore how to harness the might of regex in R with a practical example.
Let’s dive into a scenario where we have data that needs cleaning and extracting numerical values from strings. Our data, stored in a dataframe named df
, consists of four columns (x1
, x2
, x3
, x4
) with strings containing numerical values along with percentage values enclosed in parentheses. Our goal is to extract these numerical values and compute a total for each row.
Loading Libraries
Before we begin, we need to load the necessary libraries. We’ll be using the tidyverse
package for data manipulation, along with glue
and unglue
for string manipulation.
Exploring the Data
Let’s take a sneak peek at our data using the head()
function to understand its structure.
<- tibble(
df x1 = rep("Unit A", 11),
x2 = c(glue("{11:20} ({1:10}%)"), glue("{251} ({13}%)")),
x3 = c(glue("{21:30} ({11:20}%)"), glue("{252} ({14}%)")),
x4 = c(glue("{31:40} ({21:30}%)"), glue("{253} ({15}%)"))
)
head(df, 3)
# A tibble: 3 × 4
x1 x2 x3 x4
<chr> <chr> <chr> <chr>
1 Unit A 11 (1%) 21 (11%) 31 (21%)
2 Unit A 12 (2%) 22 (12%) 32 (22%)
3 Unit A 13 (3%) 23 (13%) 33 (23%)
This command displays the first three rows of our dataframe df
, giving us an idea of how our data looks like.
Creating a Regex Function
Now, we’ll define a custom function named reg_val_fns
to extract numerical values from strings using regular expressions. This function takes two parameters: .col_data
(column data) and .pattern
(regex pattern). If no pattern is provided, it defaults to extracting any sequence of digits followed by non-word characters or the end of the string.
# Make regex function
<- function(.col_data, .pattern = NULL){
reg_val_fns <- .pattern
ptrn if(is.null(ptrn)){
<- "\\d+(?=\\W|$)"
ptrn
}
<- .col_data |>
reged_val str_extract(ptrn) |>
as.numeric()
return(reged_val)
}
Applying the Regex Function
With our regex function defined, we apply it across desired columns using the mutate(across())
function from the dplyr
package. This extracts numerical values from strings in each column, converting them into numeric format. Additionally, we compute the total value for each row using rowSums()
.
# Apply the function across the desired columns
|>
df mutate(across(-x1, reg_val_fns)) |>
mutate(total_val = rowSums(across(-x1)))
# A tibble: 11 × 5
x1 x2 x3 x4 total_val
<chr> <dbl> <dbl> <dbl> <dbl>
1 Unit A 11 21 31 63
2 Unit A 12 22 32 66
3 Unit A 13 23 33 69
4 Unit A 14 24 34 72
5 Unit A 15 25 35 75
6 Unit A 16 26 36 78
7 Unit A 17 27 37 81
8 Unit A 18 28 38 84
9 Unit A 19 29 39 87
10 Unit A 20 30 40 90
11 Unit A 251 252 253 756
Alternative Approach: Using unglue
An alternative method to extract values from strings is using the unglue
package. Here, we apply the unglue_data()
function across columns (excluding x1
) to extract values and percentages separately, then unnest the resulting dataframe and compute the total value for each row.
# Use unglue
|>
df mutate(across(-x1, \(x) unglue_data(x, "{val} ({pct}%)"))) |>
unnest(cols = everything(), names_sep = "_") |>
mutate(across(.cols = contains("val"), \(x) as.numeric(x))) |>
mutate(total_val = rowSums(across(where(is.numeric))))
# A tibble: 11 × 8
x1 x2_val x2_pct x3_val x3_pct x4_val x4_pct total_val
<chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 Unit A 11 1 21 11 31 21 63
2 Unit A 12 2 22 12 32 22 66
3 Unit A 13 3 23 13 33 23 69
4 Unit A 14 4 24 14 34 24 72
5 Unit A 15 5 25 15 35 25 75
6 Unit A 16 6 26 16 36 26 78
7 Unit A 17 7 27 17 37 27 81
8 Unit A 18 8 28 18 38 28 84
9 Unit A 19 9 29 19 39 29 87
10 Unit A 20 10 30 20 40 30 90
11 Unit A 251 13 252 14 253 15 756
Conclusion
In this tutorial, we’ve explored how to leverage the power of regular expressions in R to extract numerical values from strings within a dataframe. By defining custom regex functions and using packages like dplyr
and unglue
, we can efficiently clean and manipulate text data for further analysis.
I encourage you to try out these techniques on your own datasets and explore the endless possibilities of regex in R. Happy coding!