Comparing Multiple Columns Amongst Different Rows

Hi,

Welcome to the RStudio community!

First of all, next time try to generate a reprex if you post a question with data and code. This will greatly help us get started quickly and help you out. A reprex consists of the minimal code and data needed to recreate the issue/question you're having. You can find instructions how to build and share one here:

That said, what you are trying to do is not 'unconventional' at all :slight_smile: It can be solved with some basic dplyr magic.

library(dplyr)

#Get the data
myData = data.frame(
  fruitNumber = 1:7,
  fruit = c("Apple", "Banana", "Orange", "Peach", "Guava", "Banana", "Orange"),
  length = c(2,4,2,3,4,4,2),
  colorComplexion= c(0.34,0.23,0.68,0.11,0.47,0.25, 0.42)
)
myData
#>   fruitNumber  fruit length colorComplexion
#> 1           1  Apple      2            0.34
#> 2           2 Banana      4            0.23
#> 3           3 Orange      2            0.68
#> 4           4  Peach      3            0.11
#> 5           5  Guava      4            0.47
#> 6           6 Banana      4            0.25
#> 7           7 Orange      2            0.42

#Calculate the similarities per group
myData = myData %>% group_by(fruit, length) %>% 
  mutate(
    similarity = min(colorComplexion) / max(colorComplexion),
    n = n()) %>% ungroup()
myData
#> # A tibble: 7 x 6
#>   fruitNumber fruit  length colorComplexion similarity     n
#>         <int> <chr>   <dbl>           <dbl>      <dbl> <int>
#> 1           1 Apple       2            0.34      1         1
#> 2           2 Banana      4            0.23      0.92      2
#> 3           3 Orange      2            0.68      0.618     2
#> 4           4 Peach       3            0.11      1         1
#> 5           5 Guava       4            0.47      1         1
#> 6           6 Banana      4            0.25      0.92      2
#> 7           7 Orange      2            0.42      0.618     2

#Remove those with high similarity
#i.e. keep the ones with low similarity or only one sample
myData = myData %>% filter(similarity < 0.75 | n == 1)
myData
#> # A tibble: 5 x 6
#>   fruitNumber fruit  length colorComplexion similarity     n
#>         <int> <chr>   <dbl>           <dbl>      <dbl> <int>
#> 1           1 Apple       2            0.34      1         1
#> 2           3 Orange      2            0.68      0.618     2
#> 3           4 Peach       3            0.11      1         1
#> 4           5 Guava       4            0.47      1         1
#> 5           7 Orange      2            0.42      0.618     2

Created on 2021-07-14 by the reprex package (v2.0.0)

EXPLANATION

  • I have added an extra Orange to show the case where the similarity between two would be low (an thus kept)
  • I grouped data by fruit and length as requested
  • I defined similarity in a group as the max percentage difference between the largest and smallest colorComplexion (so this would also work if there are more than 2 in a group), though you can easily change this function if needed
  • I also calculated the number of items in a group
  • Finally, if there are more than one item per group and the group's similarity < 0.75 they get removed.

If you don't know the dplyr package (part of the Tidyverse), just check out what it all can do here.

Hope this helps,
PJ