Comparing Multiple Columns Amongst Different Rows

pieterjanvc · July 14, 2021, 12:15pm

Hi,

Welcome to the RStudio community!

First of all, next time try to generate a reprex if you post a question with data and code. This will greatly help us get started quickly and help you out. A reprex consists of the minimal code and data needed to recreate the issue/question you're having. You can find instructions how to build and share one here:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

That said, what you are trying to do is not 'unconventional' at all It can be solved with some basic dplyr magic.

library(dplyr)

#Get the data
myData = data.frame(
  fruitNumber = 1:7,
  fruit = c("Apple", "Banana", "Orange", "Peach", "Guava", "Banana", "Orange"),
  length = c(2,4,2,3,4,4,2),
  colorComplexion= c(0.34,0.23,0.68,0.11,0.47,0.25, 0.42)
)
myData
#>   fruitNumber  fruit length colorComplexion
#> 1           1  Apple      2            0.34
#> 2           2 Banana      4            0.23
#> 3           3 Orange      2            0.68
#> 4           4  Peach      3            0.11
#> 5           5  Guava      4            0.47
#> 6           6 Banana      4            0.25
#> 7           7 Orange      2            0.42

#Calculate the similarities per group
myData = myData %>% group_by(fruit, length) %>% 
  mutate(
    similarity = min(colorComplexion) / max(colorComplexion),
    n = n()) %>% ungroup()
myData
#> # A tibble: 7 x 6
#>   fruitNumber fruit  length colorComplexion similarity     n
#>         <int> <chr>   <dbl>           <dbl>      <dbl> <int>
#> 1           1 Apple       2            0.34      1         1
#> 2           2 Banana      4            0.23      0.92      2
#> 3           3 Orange      2            0.68      0.618     2
#> 4           4 Peach       3            0.11      1         1
#> 5           5 Guava       4            0.47      1         1
#> 6           6 Banana      4            0.25      0.92      2
#> 7           7 Orange      2            0.42      0.618     2

#Remove those with high similarity
#i.e. keep the ones with low similarity or only one sample
myData = myData %>% filter(similarity < 0.75 | n == 1)
myData
#> # A tibble: 5 x 6
#>   fruitNumber fruit  length colorComplexion similarity     n
#>         <int> <chr>   <dbl>           <dbl>      <dbl> <int>
#> 1           1 Apple       2            0.34      1         1
#> 2           3 Orange      2            0.68      0.618     2
#> 3           4 Peach       3            0.11      1         1
#> 4           5 Guava       4            0.47      1         1
#> 5           7 Orange      2            0.42      0.618     2

^{Created on 2021-07-14 by the reprex package (v2.0.0)}

EXPLANATION

I have added an extra Orange to show the case where the similarity between two would be low (an thus kept)
I grouped data by fruit and length as requested
I defined similarity in a group as the max percentage difference between the largest and smallest colorComplexion (so this would also work if there are more than 2 in a group), though you can easily change this function if needed
I also calculated the number of items in a group
Finally, if there are more than one item per group and the group's similarity < 0.75 they get removed.

If you don't know the dplyr package (part of the Tidyverse), just check out what it all can do here.

Hope this helps,
PJ