# Sample patient data
<- data.frame(
patient_data patient_id = c(101, 102, 101, 103, 102, 104),
visit_date = c("2025-01-01", "2025-01-01", "2025-01-02",
"2025-01-02", "2025-01-03", "2025-01-03")
)
Introduction
Counting duplicates is a fundamental task in data analysis and cleaning. As an R programmer working with healthcare data at Stony Brook Medicine, I’ve encountered numerous scenarios where identifying and counting duplicates is crucial for data quality assurance. This guide covers multiple approaches using base R, dplyr, and data.table.
Understanding Duplicates in R
Before diving into methods, let’s create sample data to work with:
Base R Methods
Using duplicated() Function
The most straightforward approach in base R:
# Count all duplicates
sum(duplicated(patient_data$patient_id))
[1] 2
# Get duplicate counts for each value
table(patient_data$patient_id)[table(patient_data$patient_id) > 1]
101 102
2 2
Using table() Function
A more detailed view of frequencies:
# Get frequency count of all values
<- table(patient_data$patient_id)
patient_counts print(patient_counts)
101 102 103 104
2 2 1 1
Modern Approaches with dplyr
Using group_by() and count()
library(dplyr)
%>%
patient_data group_by(patient_id) %>%
count() %>%
filter(n > 1)
# A tibble: 2 × 2
# Groups: patient_id [2]
patient_id n
<dbl> <int>
1 101 2
2 102 2
Advanced dplyr Techniques
# Count duplicates across multiple columns
%>%
patient_data group_by(patient_id, visit_date) %>%
summarise(count = n(), .groups = 'drop') %>%
filter(count > 1)
# A tibble: 0 × 3
# ℹ 3 variables: patient_id <dbl>, visit_date <chr>, count <int>
High-Performance Solutions with data.table
For large healthcare datasets, data.table offers superior performance:
library(data.table)
<- as.data.table(patient_data)
dt_patients
# Count duplicates
= patient_id][N > 1] dt_patients[, .N, by
patient_id N
<num> <int>
1: 101 2
2: 102 2
Your Turn!
Try this exercise:
Problem: Create a function that returns both the count of duplicates and the duplicate values from a vector.
# Your code here
Click here for Solution!
Solution:
<- function(x) {
count_duplicates <- table(x)
dup_counts list(
duplicate_values = names(dup_counts[dup_counts > 1]),
counts = dup_counts[dup_counts > 1]
)
}
# Test the function
<- c(1, 2, 2, 3, 3, 3, 4)
test_vector count_duplicates(test_vector)
$duplicate_values
[1] "2" "3"
$counts
x
2 3
2 3
Quick Takeaways
- Base R’s duplicated() is perfect for simple cases
- dplyr offers readable and chainable operations
- data.table provides the best performance for large datasets
- Consider memory usage when working with large healthcare datasets
Conclusion
Choosing the right method for counting duplicates depends on your specific needs. For healthcare data analysis, I recommend using data.table for large datasets and dplyr for better code readability in smaller datasets.
Frequently Asked Questions
Q: Which method is fastest for large datasets? A: data.table consistently outperforms other methods for large datasets.
Q: Can these methods handle missing values? A: Yes, all methods can handle NA values, but you may need to specify na.rm = TRUE.
Q: How do I count duplicates across multiple columns? A: Use group_by() with multiple columns in dplyr or multiple columns in data.table’s by parameter.
Q: Will these methods work with character vectors? A: Yes, all methods work with character, numeric, and factor data types.
Q: How can I improve performance when working with millions of rows? A: Use data.table and consider indexing frequently used columns.