Summarizing Data in R: tapply() vs. group_by() and summarize()

rtip
Author

Steven P. Sanderson II, MPH

Published

July 26, 2023

Introduction

Are you tired of manually calculating summary statistics for your data in R? Look no further! In this blog post, we will explore two powerful ways to summarize data: using the tapply() function and the group_by() and summarize() functions from the dplyr package. Both methods are incredibly useful and can save you time and effort in your data analysis projects.

Using tapply() Function:

The tapply() function in R allows you to apply a function to subsets of a vector or array, split by one or more factors. It’s a fundamental tool for aggregating data in R. The basic syntax for tapply() is as follows:

tapply(data, INDEX, FUN, ...)
  • data: The vector or array you want to summarize.
  • INDEX: A list of factors or grouping variables used to split the data.
  • FUN: The function you want to apply to each subset.
  • ...: are additional arguments that you want to pass to FUN.

Example 1: Summarizing a Numeric Vector with tapply()

Suppose you have a dataset with students’ exam scores and their corresponding grades. You want to calculate the average score for each grade.

# Sample data
scores <- c(85, 90, 78, 92, 88, 76, 84, 92, 95, 89)
grades <- c("A", "A", "B", "A", "B", "C", "B", "A", "A", "B")

# Using tapply() to calculate the average score for each grade
avg_scores <- tapply(scores, grades, mean)

print(avg_scores)
    A     B     C 
90.80 84.75 76.00 

Or using the built in iris dataset:

mean_width_by_species <- tapply(iris$Sepal.Width, iris$Species, mean)

print(mean_width_by_species)
    setosa versicolor  virginica 
     3.428      2.770      2.974 

In this example, tapply() splits the scores vector based on the different grades in the grades vector and calculates the average score for each grade. The same type of thing is done with the second example, splitting the data by Species.

Using group_by() and summarize() functions from dplyr:

The dplyr package is a powerful tool for data manipulation in R. It provides the group_by() function to group data based on specific variables and the summarize() function to calculate summary statistics for each group.

Example 2: Summarizing a Data Frame with group_by() and summarize()

Suppose you have a dataset with information about employees, including their department, salary, and years of experience. You want to find the average salary and the maximum years of experience for each department.

The group_by() and summarize() functions from the dplyr package provide a more concise way to summarize data. The syntax for these functions is as follows:

data %>%
  group_by(INDEX) %>%
  summarize(FUN(...))

Where:

  • data is the data frame that you want to summarize.
  • INDEX is the vector that you want to group by.
  • FUN is the function that you want to apply to data.
  • ... are additional arguments that you want to pass to FUN.
# Assuming you have already installed and loaded the 'dplyr' package
library(dplyr)

# Sample data frame
employees <- data.frame(
  department = c("HR", "Engineering", "HR", "Engineering", "Marketing", "Marketing"),
  salary = c(50000, 65000, 48000, 70000, 55000, 60000),
  experience = c(3, 5, 2, 7, 4, 6)
)

# Using group_by() and summarize() to calculate average salary 
# and max experience by department
summary_data <- employees %>%
  group_by(department) %>%
  summarize(
    avg_salary = mean(salary), 
    max_experience = max(experience)
  )

print(summary_data)
# A tibble: 3 × 3
  department  avg_salary max_experience
  <chr>            <dbl>          <dbl>
1 Engineering      67500              7
2 HR               49000              3
3 Marketing        57500              6

The group_by() function groups the data by the department variable, and then summarize() calculates the average salary and maximum years of experience for each group.

Now let’s also see how the functions can produce the same results and what it looks like side by side:

tapply(iris$Sepal.Width, iris$Species, mean)
    setosa versicolor  virginica 
     3.428      2.770      2.974 
iris %>%
  group_by(Species) %>%
  summarize(mean_width = mean(Sepal.Width))
# A tibble: 3 × 2
  Species    mean_width
  <fct>           <dbl>
1 setosa           3.43
2 versicolor       2.77
3 virginica        2.97

Which method should you use?

The tapply() function is a more versatile function, as it can be used to apply any function to a vector, grouped by another vector. However, the group_by() and summarize() functions are more concise and easier to read.

In general, I would recommend using the group_by() and summarize() functions if you are only interested in calculating simple summary statistics. However, if you need to apply a more complex function to a vector, or if you need to group by multiple variables, then the tapply() function may be a better choice.

Encouragement

Summarizing data is an essential skill in data analysis, and using the tapply() function and the group_by() and summarize() functions from dplyr can significantly simplify your workflow. I encourage you to experiment with your own datasets and try different summary functions (e.g., median(), sd(), etc.) to gain deeper insights into your data.

Feel free to explore other functions and packages in R that offer powerful data manipulation and summarization capabilities. R provides a vast ecosystem of packages to make your data analysis journey even more enjoyable. Happy coding!