How to Remove Duplicate Rows in R: A Complete Guide to Data Cleaning

Learn how to effectively remove duplicate rows in R using both Base R and dplyr methods. Complete guide with practical examples and best practices for data cleaning.
code
rtip
Author

Steven P. Sanderson II, MPH

Published

January 30, 2025

Keywords

Programming, remove duplicate rows in R, R duplicate removal, remove duplicates R dataframe, R data cleaning duplicates, unique rows R, distinct function R, duplicated function R, dplyr remove duplicates, base R duplicate removal, R data frame unique values, how to remove duplicate rows in R using dplyr, remove duplicates from multiple columns in R, fastest way to remove duplicates in R dataframe, compare unique vs distinct function in R, how to keep track of removed duplicates in R, remove duplicate rows R dplyr, unique rows in R dataframe, R remove duplicates multiple columns, distinct() function R, duplicated() function base R, data cleaning R duplicates, R data frame unique rows, remove duplicate observations R, R data manipulation duplicates, efficient duplicate removal R

Introduction

Dealing with duplicate rows is a common challenge in data analysis. Whether you’re working with large datasets or small data frames, knowing how to effectively remove duplicates in R is crucial for maintaining data quality and ensuring accurate analyses.

Understanding Duplicate Rows in R

Duplicate rows are identical observations that appear multiple times in your dataset. They can occur due to data collection errors, system glitches, or merging operations. Identifying and removing these duplicates is essential for accurate data analysis.

Base R Methods for Removing Duplicates

Using unique() Function

The unique() function is the simplest way to remove duplicate rows in base R. Here’s how to use it:

# Remove all duplicate rows
clean_data <- unique(data)

This function identifies and removes all duplicate rows, leaving only distinct rows in the dataset.

Using duplicated() Function

The duplicated() function provides more control over duplicate removal:

# Remove duplicates using duplicated()
clean_data <- data[!duplicated(data), ]

This approach returns a logical vector that can be used to subset the data frame, keeping only unique rows.

Using dplyr for Duplicate Removal

The distinct() Function

The dplyr package offers the distinct() function, which is particularly efficient for large datasets:

library(dplyr)
clean_data <- data %>% distinct()

This method performs faster than base R functions when working with large datasets.

Working with Multiple Columns

To remove duplicates based on specific columns:

# Remove duplicates based on selected columns
clean_data <- data %>% distinct(column1, column2, .keep_all = TRUE)

Best Practices for Handling Duplicates

  1. Always inspect your data before removal
  2. Consider which columns should determine uniqueness
  3. Document your duplicate removal process
  4. Verify results after removal

Your Turn!

Try this practice problem:

Create a data frame with duplicate rows and remove them using both base R and dplyr methods:

Click here for Solution!
library(dplyr)

# Problem
# Create this data frame:
df <- data.frame(
  id = c(1, 2, 2, 3, 3),
  value = c("A", "B", "B", "C", "C")
)

# Remove duplicates using both methods
# Your code here...

# Solution
# Base R
unique(df)
  id value
1  1     A
2  2     B
4  3     C
# dplyr
df %>% distinct()
  id value
1  1     A
2  2     B
3  3     C

Quick Takeaways

  • Use unique() for simple cases in base R
  • Choose distinct() for better performance with large datasets
  • Always verify your results after duplicate removal
  • Consider column-specific duplicate removal when needed

FAQs

Q: Which method is faster for large datasets? A: The distinct() function from dplyr typically performs faster with large datasets

Q: Can I remove duplicates based on specific columns? A: Yes, using either distinct() with column selection or duplicated() with specific columns.

Q: Will duplicate removal maintain the original row order? A: Both unique() and distinct() generally preserve the order of first appearance.

Q: Can I keep track of removed duplicates? A: Yes, by using duplicated() to create a logical vector before removal.

Q: How do I handle missing values when removing duplicates? A: Both methods treat NA values as equal when comparing rows.

Conclusion

Mastering duplicate row removal in R is essential for data cleaning and analysis. Whether you choose base R functions or dplyr methods, understanding these techniques will help you maintain clean, accurate datasets.

Engage!

Have you tried these methods in your data analysis? Share your experience in the comments below and let us know which approach works best for your needs. Don’t forget to bookmark this guide for future reference!

References

  1. How to Remove Duplicate Rows in R (With Examples)
  2. Remove Duplicate Rows in R using Dplyr - GeeksforGeeks
  3. How Can I Remove All Duplicate Rows in R So That None Are Left?

Happy Coding! 🚀

Remove rows in R

You can connect with me at any one of the below:

Telegram Channel here: https://t.me/steveondata

LinkedIn Network here: https://www.linkedin.com/in/spsanderson/

Mastadon Social here: https://mstdn.social/@stevensanderson

RStats Network here: https://rstats.me/@spsanderson

GitHub Network here: https://github.com/spsanderson

Bluesky Network here: https://bsky.app/profile/spsanderson.com

My Book: Extending Excel with Python and R here: https://packt.link/oTyZJ