hi so im working with a dataset and im trying to create the data. theres over 500 rows of data and i have 3 rows of duplicate data. ran the function Duplicated() function and it gave me a whole list of false and true (obvisouly the true is the duplicate) but was wondering if there was an more efficient way to do this step where it could potentially highlight or tell me exactly where the duplicate is. thanks!
(exampledata <- data.frame(a=c(3,1:3,1),
b=letters[c(3,1:3,1)]))
(tf_dup <- duplicated(exampledata))
(location_dup <- which(tf_dup))
exampledata[location_dup,]
If you want to know not just which rows are duplicates but which rows are the corresponding "originals", you can perform an inner join using the dplyr
library.
library(dplyr)
# Create source data.
df <- data.frame(a = c(3, 1, 1, 2, 3, 1, 3), b = c("c", "a", "a", "b", "c", "a", "c"))
# Find the indices of duplicated rows.
dup <- df |> duplicated() |> which()
# Split the source data into two data frames.
df1 <- df[-dup, ] # originals (rows 1, 2 and 4)
df2 <- df[dup, ] # duplicates (rows 3, 5, 6 and 7)
# The row names are the row indices in the original data frame df. Assign them to columns.
df1$Original <- row.names(df1)
df2$Duplicate <- row.names(df2)
# Perform an inner join to find the original/duplicate pairings. The "NULL" value for "by"
# (which is actually the default and can be omitted) means rows of df1 and df2 are paired
# based on identical values in all columns they have in common (i.e., all the original
# columns of df).
inner_join(df1, df2, by = NULL) |> select(Original, Duplicate)
# Result:
# Original Duplicate
# 1 1 5
# 2 1 7
# 3 2 3
# 4 2 6
This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.
If you have a query related to it or one of the replies, start a new topic and refer back with a link.