Introduction

Counting duplicates is a fundamental task in data analysis and cleaning. As an R programmer working with healthcare data at Stony Brook Medicine, I’ve encountered numerous scenarios where identifying and counting duplicates is crucial for data quality assurance. This guide covers multiple approaches using base R, dplyr, and data.table.

Understanding Duplicates in R

Before diving into methods, let’s create sample data to work with:

# Sample patient data
patient_data <- data.frame(
  patient_id = c(101, 102, 101, 103, 102, 104),
  visit_date = c("2025-01-01", "2025-01-01", "2025-01-02", 
                 "2025-01-02", "2025-01-03", "2025-01-03")
)

Base R Methods

Using duplicated() Function

The most straightforward approach in base R:

# Count all duplicates
sum(duplicated(patient_data$patient_id))

[1] 2

# Get duplicate counts for each value
table(patient_data$patient_id)[table(patient_data$patient_id) > 1]


101 102 
  2   2

Using table() Function

A more detailed view of frequencies:

# Get frequency count of all values
patient_counts <- table(patient_data$patient_id)
print(patient_counts)


101 102 103 104 
  2   2   1   1

Modern Approaches with dplyr

Using group_by() and count()

library(dplyr)

patient_data %>%
  group_by(patient_id) %>%
  count() %>%
  filter(n > 1)

# A tibble: 2 × 2
# Groups:   patient_id [2]
  patient_id     n
       <dbl> <int>
1        101     2
2        102     2

Advanced dplyr Techniques

# Count duplicates across multiple columns
patient_data %>%
  group_by(patient_id, visit_date) %>%
  summarise(count = n(), .groups = 'drop') %>%
  filter(count > 1)

# A tibble: 0 × 3
# ℹ 3 variables: patient_id <dbl>, visit_date <chr>, count <int>

High-Performance Solutions with data.table

For large healthcare datasets, data.table offers superior performance:

library(data.table)
dt_patients <- as.data.table(patient_data)

# Count duplicates
dt_patients[, .N, by = patient_id][N > 1]

   patient_id     N
        <num> <int>
1:        101     2
2:        102     2

Your Turn!

Try this exercise:

Problem: Create a function that returns both the count of duplicates and the duplicate values from a vector.

# Your code here

Click here for Solution!

Solution:

count_duplicates <- function(x) {
  dup_counts <- table(x)
  list(
    duplicate_values = names(dup_counts[dup_counts > 1]),
    counts = dup_counts[dup_counts > 1]
  )
}

# Test the function
test_vector <- c(1, 2, 2, 3, 3, 3, 4)
count_duplicates(test_vector)

$duplicate_values
[1] "2" "3"

$counts
x
2 3 
2 3

Quick Takeaways

Base R’s duplicated() is perfect for simple cases
dplyr offers readable and chainable operations
data.table provides the best performance for large datasets
Consider memory usage when working with large healthcare datasets

Conclusion

Choosing the right method for counting duplicates depends on your specific needs. For healthcare data analysis, I recommend using data.table for large datasets and dplyr for better code readability in smaller datasets.

Frequently Asked Questions

Q: Which method is fastest for large datasets? A: data.table consistently outperforms other methods for large datasets.
Q: Can these methods handle missing values? A: Yes, all methods can handle NA values, but you may need to specify na.rm = TRUE.
Q: How do I count duplicates across multiple columns? A: Use group_by() with multiple columns in dplyr or multiple columns in data.table’s by parameter.
Q: Will these methods work with character vectors? A: Yes, all methods work with character, numeric, and factor data types.
Q: How can I improve performance when working with millions of rows? A: Use data.table and consider indexing frequently used columns.