Is there a way to count the number of unique rows in a data.frame accounting for unknown values. I need uniqueness under the assumption that missing values can be anything. Here are a few examples of what I mean:
df1 <- data.frame(
V1 = c("A","B","C","D"),
V2 = c("X","Y","Z","W")
)
> df1
V1 V2
1 A X
2 B Y
3 C Z
4 D W
Would return 4
, as there are 4 unique values, this is the same as nrow(unique(df1))
. However the following:
df2 <- data.frame(
V1 = c("A","B","C","C"),
V2 = c("X","Y","Z",NA)
)
> df2
V1 V2
1 A X
2 B Y
3 C Z
4 C <NA>
Would return 3
as the bottom row could be identical to the 3rd row.
df3 <- data.frame(
V1 = c("A","B","C","C","B","B","A",NA),
V2 = c("X","Y","Z",NA,"W","W",NA,"X")
)
> df3
V1 V2
1 A X
2 B Y
3 C Z
4 C <NA>
5 B W
6 B W
7 A <NA>
8 <NA> X
Would return 4
since we would count rows 1,2 & 3 as unique, plus rows 5 & 6 are the same. Row 4 could be the same as row 3 and so does not increase the count and rows 7 & 8 could be the same as row 1 and so also do not affect the count.