# Comparing Multiple Columns Amongst Different Rows

Hi,

Welcome to the RStudio community!

First of all, next time try to generate a reprex if you post a question with data and code. This will greatly help us get started quickly and help you out. A reprex consists of the minimal code and data needed to recreate the issue/question you're having. You can find instructions how to build and share one here:

That said, what you are trying to do is not 'unconventional' at all It can be solved with some basic dplyr magic.

``````library(dplyr)

#Get the data
myData = data.frame(
fruitNumber = 1:7,
fruit = c("Apple", "Banana", "Orange", "Peach", "Guava", "Banana", "Orange"),
length = c(2,4,2,3,4,4,2),
colorComplexion= c(0.34,0.23,0.68,0.11,0.47,0.25, 0.42)
)
myData
#>   fruitNumber  fruit length colorComplexion
#> 1           1  Apple      2            0.34
#> 2           2 Banana      4            0.23
#> 3           3 Orange      2            0.68
#> 4           4  Peach      3            0.11
#> 5           5  Guava      4            0.47
#> 6           6 Banana      4            0.25
#> 7           7 Orange      2            0.42

#Calculate the similarities per group
myData = myData %>% group_by(fruit, length) %>%
mutate(
similarity = min(colorComplexion) / max(colorComplexion),
n = n()) %>% ungroup()
myData
#> # A tibble: 7 x 6
#>   fruitNumber fruit  length colorComplexion similarity     n
#>         <int> <chr>   <dbl>           <dbl>      <dbl> <int>
#> 1           1 Apple       2            0.34      1         1
#> 2           2 Banana      4            0.23      0.92      2
#> 3           3 Orange      2            0.68      0.618     2
#> 4           4 Peach       3            0.11      1         1
#> 5           5 Guava       4            0.47      1         1
#> 6           6 Banana      4            0.25      0.92      2
#> 7           7 Orange      2            0.42      0.618     2

#Remove those with high similarity
#i.e. keep the ones with low similarity or only one sample
myData = myData %>% filter(similarity < 0.75 | n == 1)
myData
#> # A tibble: 5 x 6
#>   fruitNumber fruit  length colorComplexion similarity     n
#>         <int> <chr>   <dbl>           <dbl>      <dbl> <int>
#> 1           1 Apple       2            0.34      1         1
#> 2           3 Orange      2            0.68      0.618     2
#> 3           4 Peach       3            0.11      1         1
#> 4           5 Guava       4            0.47      1         1
#> 5           7 Orange      2            0.42      0.618     2
``````

Created on 2021-07-14 by the reprex package (v2.0.0)

EXPLANATION

• I have added an extra Orange to show the case where the similarity between two would be low (an thus kept)
• I grouped data by fruit and length as requested
• I defined similarity in a group as the max percentage difference between the largest and smallest colorComplexion (so this would also work if there are more than 2 in a group), though you can easily change this function if needed
• I also calculated the number of items in a group
• Finally, if there are more than one item per group and the group's similarity < 0.75 they get removed.

If you don't know the dplyr package (part of the Tidyverse), just check out what it all can do here.

Hope this helps,
PJ