Hi!
I've a large dataset, where there are lots of duplicates. For the analysis, I need to know:
- which records are duplicates
- of which record they are duplicates of
Let me provide something similar to what I have, what I want and what I've done till now:
# setting seed for reproducibility
set.seed(seed = 1701)
# generating random data
x <- y <- z <- rep(x = 1:4, each = 4)
d <- cbind(sample(x = x),
sample(x = y),
sample(x = z))
# checking duplicates
(t <- duplicated(d))
#> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
#> [12] FALSE FALSE TRUE FALSE FALSE
# desired result
s <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 7, 10, 11, 12, 10, 13, 14)
(e <- data.frame(d, t = t, s = s))
#> X1 X2 X3 t s
#> 1 1 3 4 FALSE 1
#> 2 3 4 3 FALSE 2
#> 3 1 2 4 FALSE 3
#> 4 2 4 4 FALSE 4
#> 5 4 2 1 FALSE 5
#> 6 1 3 3 FALSE 6
#> 7 2 1 2 FALSE 7
#> 8 2 4 3 FALSE 8
#> 9 4 3 2 FALSE 9
#> 10 2 1 2 TRUE 7
#> 11 3 2 1 FALSE 10
#> 12 4 3 4 FALSE 11
#> 13 4 1 2 FALSE 12
#> 14 3 2 1 TRUE 10
#> 15 1 4 3 FALSE 13
#> 16 3 1 1 FALSE 14
# attempt
start <- 1
index <- integer(length = nrow(x = d))
for (i in 1:nrow(x = d))
{
if (t[i] == FALSE)
{
index[i] <- start
start <- (start + 1)
} else
{
for (j in 1:(i - 1))
{
if (identical(x = d[i,], y = d[j,]))
{
index[i] <- index[j]
break
}
}
}
}
# result
(f <- data.frame(e, r = index))
#> X1 X2 X3 t s r
#> 1 1 3 4 FALSE 1 1
#> 2 3 4 3 FALSE 2 2
#> 3 1 2 4 FALSE 3 3
#> 4 2 4 4 FALSE 4 4
#> 5 4 2 1 FALSE 5 5
#> 6 1 3 3 FALSE 6 6
#> 7 2 1 2 FALSE 7 7
#> 8 2 4 3 FALSE 8 8
#> 9 4 3 2 FALSE 9 9
#> 10 2 1 2 TRUE 7 7
#> 11 3 2 1 FALSE 10 10
#> 12 4 3 4 FALSE 11 11
#> 13 4 1 2 FALSE 12 12
#> 14 3 2 1 TRUE 10 10
#> 15 1 4 3 FALSE 13 13
#> 16 3 1 1 FALSE 14 14
Created on 2019-02-09 by the reprex package (v0.2.1)
So, basically my problem is solved, but I suppose there are better ways to do it.
Any suggestions?
Thanks.