Hi,
Welcome to the RStudio community!
First of all, next time try to generate a reprex if you post a question with data and code. This will greatly help us get started quickly and help you out. A reprex consists of the minimal code and data needed to recreate the issue/question you're having. You can find instructions how to build and share one here:
That said, what you are trying to do is not 'unconventional' at all It can be solved with some basic dplyr magic.
library(dplyr)
#Get the data
myData = data.frame(
fruitNumber = 1:7,
fruit = c("Apple", "Banana", "Orange", "Peach", "Guava", "Banana", "Orange"),
length = c(2,4,2,3,4,4,2),
colorComplexion= c(0.34,0.23,0.68,0.11,0.47,0.25, 0.42)
)
myData
#> fruitNumber fruit length colorComplexion
#> 1 1 Apple 2 0.34
#> 2 2 Banana 4 0.23
#> 3 3 Orange 2 0.68
#> 4 4 Peach 3 0.11
#> 5 5 Guava 4 0.47
#> 6 6 Banana 4 0.25
#> 7 7 Orange 2 0.42
#Calculate the similarities per group
myData = myData %>% group_by(fruit, length) %>%
mutate(
similarity = min(colorComplexion) / max(colorComplexion),
n = n()) %>% ungroup()
myData
#> # A tibble: 7 x 6
#> fruitNumber fruit length colorComplexion similarity n
#> <int> <chr> <dbl> <dbl> <dbl> <int>
#> 1 1 Apple 2 0.34 1 1
#> 2 2 Banana 4 0.23 0.92 2
#> 3 3 Orange 2 0.68 0.618 2
#> 4 4 Peach 3 0.11 1 1
#> 5 5 Guava 4 0.47 1 1
#> 6 6 Banana 4 0.25 0.92 2
#> 7 7 Orange 2 0.42 0.618 2
#Remove those with high similarity
#i.e. keep the ones with low similarity or only one sample
myData = myData %>% filter(similarity < 0.75 | n == 1)
myData
#> # A tibble: 5 x 6
#> fruitNumber fruit length colorComplexion similarity n
#> <int> <chr> <dbl> <dbl> <dbl> <int>
#> 1 1 Apple 2 0.34 1 1
#> 2 3 Orange 2 0.68 0.618 2
#> 3 4 Peach 3 0.11 1 1
#> 4 5 Guava 4 0.47 1 1
#> 5 7 Orange 2 0.42 0.618 2
Created on 2021-07-14 by the reprex package (v2.0.0)
EXPLANATION
- I have added an extra Orange to show the case where the similarity between two would be low (an thus kept)
- I grouped data by fruit and length as requested
- I defined similarity in a group as the max percentage difference between the largest and smallest colorComplexion (so this would also work if there are more than 2 in a group), though you can easily change this function if needed
- I also calculated the number of items in a group
- Finally, if there are more than one item per group and the group's similarity < 0.75 they get removed.
If you don't know the dplyr package (part of the Tidyverse), just check out what it all can do here.
Hope this helps,
PJ