compare objects in a dataframe

StephanieBR · October 5, 2020, 6:56am

If I have a data frame X col and Y row. I want to compare all the rows, column by column and sum the mismatches. For example:

. A B C D
1 1 4 3 2
2 3 1 5 3
3 5 2 4 3
And I want to compare 1 and 2
1 and 3
2 and 3

And for each mismatch record 1 and if they are equal 0 and sum the mismatches
1 1 1 1 = 4
1 1 1 1 = 4
1 1 1 0 = 3

And return the object with the smallest number

woodward · October 5, 2020, 8:32am

Are they all numbers? I think it might be more efficient to use matrices than data frames. You can loop through the rows and then use matrix operations to calculate the mismatches.

nirgrahamuk · October 5, 2020, 8:43am

(mydf <- structure(list(A = c(1L, 3L, 5L), B = c(4L, 1L, 2L), C = c(
  3L,
  5L, 4L
), D = c(2L, 3L, 3L)), row.names = c(NA, -3L), class = c(
  "tbl_df",
  "tbl", "data.frame"
)))

(index_combs <- combn(seq_len(nrow(mydf)),2,simplify = FALSE))

distfunc <- function(x,y){
   as.integer(x!=y)
}

(raw_evals <- purrr::map(index_combs,
           ~distfunc(mydf[.[[1]],],
                     mydf[.[[2]],])) )
(sum_evals <-  purrr::map(raw_evals,sum))

# which is the min ?
(theminis <- which.min(sum_evals))

raw_evals[[theminis]]

jmcvw · October 5, 2020, 9:41am

This is really quick to do using base R. I added an extra row to you example dataframe.
This will return a vector of number of non-matches. If you need the whole matrix of 1s and 0s, you could replace the sum function with as.integer

# create data
structure(list(A = c(1L, 3L, 5L, 5L), B = c(4L, 1L, 2L, 1L), 
    C = c(3L, 5L, 4L, 4L), D = c(2L, 3L, 3L, 4L)), class = "data.frame", row.names = c(NA, 
-4L))

 # Matrix of all combinations of rows
com <- combn(nrow(df1), 2) 

# Loop through all the row combos and add the sum number that match
apply(com, 2, function(i) sum(df1[i[1], ] != df1[i[2], ])) 

#> [1] 4 4 4 3 3 2

StephanieBR · October 5, 2020, 8:47pm

I am using numbers. Thank you very much. It is easier with matrices

StephanieBR · October 5, 2020, 8:59pm

Hi, thank you very much. It is very fast. But what if I wanted to select the first one and the compare it with the rest of the data points. From your example:
-select A
-compare: A and B, A and C, A and D.
-Sum mismatches for each pair: 4, 4, 3
-Sum all them: 11
-then do it with the next object B and so on

jmcvw · October 7, 2020, 9:21am

Hi @StephanieBR, I hope you have already managed to find a solution for this yourself.

I am not sure if I quite understand what you're looking for, but maybe this, using gtools::combinations?

# Data
df1 <- structure(list(A = c(1L, 3L, 5L, 2L), B = c(4L, 1L, 2L, 3L), C = c(3L, 5L, 4L, 3L), D = c(2L, 3L, 3L, 4L)), class = "data.frame", row.names = c(NA, -4L))

# Get ALL combinations using gtools::combinations
combs <- gtools::permutations(nrow(df1), 2)

# Loop through all the row combos and sum the numbers that match
# Note that we use `1` here instead of `2` as in the previous answer - you can compare them to see the difference
result <- apply(combs, 1, function(i) as.integer(df1[i[1], ] != df1[i[2], ]))

# Identify the results if needed
colnames(result) <- paste(combs[, 1], combs[, 2], sep = '_')

# Sum the mismatches
colSums(result)
#> 1_2 1_3 1_4 2_1 2_3 2_4 3_1 3_2 3_4 4_1 4_2 4_3 
#>   4   4   3   4   3   4   4   3   4   3   4   4

# Or view the whole matrix of results. I have transposed the results here with `t()` because I think it is easier to view
t(result)

#>     [,1] [,2] [,3] [,4]
#> 1_2    1    1    1    1
#> 1_3    1    1    1    1
#> 1_4    1    1    0    1
#> 2_1    1    1    1    1
#> 2_3    1    1    1    0
#> 2_4    1    1    1    1
#> 3_1    1    1    1    1
#> 3_2    1    1    1    0
#> 3_4    1    1    1    1
#> 4_1    1    1    0    1
#> 4_2    1    1    1    1
#> 4_3    1    1    1    1

system · October 28, 2020, 9:21am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.