How to Count Duplicates in R: A Comprehensive Guide with Base R, dplyr, and data.table Examples

Learn multiple methods to count duplicates in R using base R, dplyr, and data.table. Includes practical examples, performance tips, and best practices for efficient duplicate detection.
code
rtip
Author

Steven P. Sanderson II, MPH

Published

January 27, 2025

Keywords

Programming, Count duplicates in R, R programming duplicates, R data analysis, R data cleaning, R duplicate detection, dplyr count duplicates, data.table duplicate counting, base R duplicate functions, identify duplicates in R, R unique values, How to count duplicates in R using dplyr, Efficient methods for counting duplicates in large datasets in R, Step-by-step guide to finding duplicate values in R, Comparing base R and data.table for duplicate detection, Best practices for handling duplicates in R data frames

Introduction

Counting duplicates is a fundamental task in data analysis and cleaning. As an R programmer working with healthcare data at Stony Brook Medicine, I’ve encountered numerous scenarios where identifying and counting duplicates is crucial for data quality assurance. This guide covers multiple approaches using base R, dplyr, and data.table.

Understanding Duplicates in R

Before diving into methods, let’s create sample data to work with:

# Sample patient data
patient_data <- data.frame(
  patient_id = c(101, 102, 101, 103, 102, 104),
  visit_date = c("2025-01-01", "2025-01-01", "2025-01-02", 
                 "2025-01-02", "2025-01-03", "2025-01-03")
)

Base R Methods

Using duplicated() Function

The most straightforward approach in base R:

# Count all duplicates
sum(duplicated(patient_data$patient_id))
[1] 2
# Get duplicate counts for each value
table(patient_data$patient_id)[table(patient_data$patient_id) > 1]

101 102 
  2   2 

Using table() Function

A more detailed view of frequencies:

# Get frequency count of all values
patient_counts <- table(patient_data$patient_id)
print(patient_counts)

101 102 103 104 
  2   2   1   1 

Modern Approaches with dplyr

Using group_by() and count()

library(dplyr)

patient_data %>%
  group_by(patient_id) %>%
  count() %>%
  filter(n > 1)
# A tibble: 2 × 2
# Groups:   patient_id [2]
  patient_id     n
       <dbl> <int>
1        101     2
2        102     2

Advanced dplyr Techniques

# Count duplicates across multiple columns
patient_data %>%
  group_by(patient_id, visit_date) %>%
  summarise(count = n(), .groups = 'drop') %>%
  filter(count > 1)
# A tibble: 0 × 3
# ℹ 3 variables: patient_id <dbl>, visit_date <chr>, count <int>

High-Performance Solutions with data.table

For large healthcare datasets, data.table offers superior performance:

library(data.table)
dt_patients <- as.data.table(patient_data)

# Count duplicates
dt_patients[, .N, by = patient_id][N > 1]
   patient_id     N
        <num> <int>
1:        101     2
2:        102     2

Your Turn!

Try this exercise:

Problem: Create a function that returns both the count of duplicates and the duplicate values from a vector.

# Your code here
Click here for Solution!

Solution:

count_duplicates <- function(x) {
  dup_counts <- table(x)
  list(
    duplicate_values = names(dup_counts[dup_counts > 1]),
    counts = dup_counts[dup_counts > 1]
  )
}

# Test the function
test_vector <- c(1, 2, 2, 3, 3, 3, 4)
count_duplicates(test_vector)
$duplicate_values
[1] "2" "3"

$counts
x
2 3 
2 3 

Quick Takeaways

  • Base R’s duplicated() is perfect for simple cases
  • dplyr offers readable and chainable operations
  • data.table provides the best performance for large datasets
  • Consider memory usage when working with large healthcare datasets

Conclusion

Choosing the right method for counting duplicates depends on your specific needs. For healthcare data analysis, I recommend using data.table for large datasets and dplyr for better code readability in smaller datasets.

Frequently Asked Questions

  1. Q: Which method is fastest for large datasets? A: data.table consistently outperforms other methods for large datasets.

  2. Q: Can these methods handle missing values? A: Yes, all methods can handle NA values, but you may need to specify na.rm = TRUE.

  3. Q: How do I count duplicates across multiple columns? A: Use group_by() with multiple columns in dplyr or multiple columns in data.table’s by parameter.

  4. Q: Will these methods work with character vectors? A: Yes, all methods work with character, numeric, and factor data types.

  5. Q: How can I improve performance when working with millions of rows? A: Use data.table and consider indexing frequently used columns.

Share Your Experience

If you’ve found this guide helpful, consider sharing it with your R programming colleagues. Have you discovered other efficient methods for counting duplicates? Share your approaches in the comments below.


Happy Coding! 🚀

Duplicates?

You can connect with me at any one of the below:

Telegram Channel here: https://t.me/steveondata

LinkedIn Network here: https://www.linkedin.com/in/spsanderson/

Mastadon Social here: https://mstdn.social/@stevensanderson

RStats Network here: https://rstats.me/@spsanderson

GitHub Network here: https://github.com/spsanderson

Bluesky Network here: https://bsky.app/profile/spsanderson.com

My Book: Extending Excel with Python and R here: https://packt.link/oTyZJ