Extracting Numbers from Strings in R

code
rtip
operations
strings
Author

Steven P. Sanderson II, MPH

Published

June 11, 2024

Introduction

Hello! Today, we’ll jump into something I think is a pretty neat task in data processing: extracting numbers from strings. We’ll explore three different methods using base R, the stringr package, and the stringi package. Each method has its own strengths, so let’s get started!

Examples

Extracting Numbers with Base R

Base R provides powerful tools to manipulate strings, and you can use regular expressions to extract numbers. Here’s a simple example:

# Sample string
text <- "The price is 45 dollars and 50 cents."

# Extract numbers using regular expressions
numbers <- gregexpr("[0-9]+", text)
result <- regmatches(text, numbers)

# Convert to numeric
numeric_result <- as.numeric(unlist(result))

print(numeric_result)
[1] 45 50

Explanation:

  1. gregexpr("[0-9]+", text) finds all sequences of digits in the text.
  2. regmatches(text, numbers) extracts these sequences from the text.
  3. unlist(result) flattens the list of matches.
  4. as.numeric() converts the character strings to numeric values.

Extracting Numbers with stringr

The stringr package offers a more user-friendly approach to string manipulation. Here’s how you can extract numbers:

library(stringr)

# Sample string
text <- "The price is 45 dollars and 50 cents."

# Extract numbers using stringr
numbers <- str_extract_all(text, "\\d+")

# Convert to numeric
numeric_result <- as.numeric(unlist(numbers))

print(numeric_result)
[1] 45 50

Explanation:

  1. str_extract_all(text, "\\d+") extracts all sequences of digits from the text. \\d+ is a regular expression that matches one or more digits.
  2. unlist(numbers) and as.numeric() convert the result to numeric, as explained in the base R method.

Extracting Numbers with stringi

The stringi package is another excellent tool for string manipulation, providing robust and efficient functions. Here’s an example:

library(stringi)

# Sample string
text <- "The price is 45 dollars and 50 cents."

# Extract numbers using stringi
numbers <- stri_extract_all_regex(text, "\\d+")

# Convert to numeric
numeric_result <- as.numeric(unlist(numbers))

print(numeric_result)
[1] 45 50

Explanation:

  1. stri_extract_all_regex(text, "\\d+") extracts all sequences of digits from the text using regular expressions.
  2. As before, unlist(numbers) and as.numeric() are used to convert the result to numeric values.

Comparison and Conclusion

  • Base R is flexible and does not require additional packages, but the syntax can be a bit cumbersome.
  • stringr simplifies the process with intuitive functions, making the code easier to read and write.
  • stringi offers powerful and efficient string operations, suitable for performance-critical tasks.

Try It Yourself!

I encourage you to try these methods on your own data. Extracting numbers from strings is a useful skill, especially when working with messy data. Experiment with different strings and see which method you prefer. Happy coding!