Remove rows using filter() and conditions

sluyet · January 11, 2022, 10:12am

Dear Community Members,

I have a database with over 20 million entries that capture the behavior (history of buying and selling) of more than 600'000 consumers. The database looks like that :

year <-  c(1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1)
month <- c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5)
client <- c("C1", "C1", "C1", "C1", "C1", "C1", "C1", "C1", "C1", "C1", 
            "C2", "C2", "C2", "C2", "C2", 
            "C3", "C3", "C3", "C3", "C3")
code_prod <- c(101100, 101100, 101100, 101100, 1035500, 1038000, 1038000, 105400,  105400, 105400, 
               101100, 101100, 1038000, 105400, 1035500, 
               1038000, 105400, 105400, 101100, 101100)
story <- c("SELL", "SELL", "SELL", "SELL", "BUY < 100", "SELL", "SELL", "SELL", "SELL", "BUY > 99", 
           "SELL", "SELL", "SELL", "BUY < 100", "SELL", 
           "SELL", "SELL", "SELL", "SELL", "BUY > 99")

   year month client code_prod     story
1     1     1     C1    101100      SELL
2     1     2     C1    101100      SELL
3     1     3     C1    101100      SELL
4     1     4     C1    101100      SELL
5     1     5     C1   1035500 BUY < 100
6     2     1     C1   1038000      SELL
7     2     2     C1   1038000      SELL
8     2     3     C1    105400      SELL
9     2     4     C1    105400      SELL
10    2     5     C1    105400  BUY > 99
11    2     1     C2    101100      SELL
12    2     2     C2    101100      SELL
13    2     3     C2   1038000      SELL
14    2     4     C2    105400 BUY < 100
15    2     5     C2   1035500      SELL
16    1     1     C3   1038000      SELL
17    1     2     C3    105400      SELL
18    1     3     C3    105400      SELL
19    1     4     C3    101100      SELL
20    1     5     C3    101100  BUY > 99

I need to keep only those customers who have a "BUY >99" story code, as well as their "SELL" history. In my database example above, this means that I need to delete customer C1 with "BUY < 100", as well as its "SELL" corresponding history, but keep that customer and its "SELL" corresponding history when "BUY > 99". Similarly, I should keep client C3 and delete client C2. As a result, the database should look like that

   year month client code_prod    story
1     2     1     C1   1038000     SELL
2     2     2     C1   1038000     SELL
3     2     3     C1    105400     SELL
4     2     4     C1    105400     SELL
5     2     5     C1    105400 BUY > 99
6     1     1     C3   1038000     SELL
7     1     2     C3    105400     SELL
8     1     3     C3    105400     SELL
9     1     4     C3    101100     SELL
10    1     5     C3    101100 BUY > 99

I tried with filter(), but I'm going in circles. The last solution I used is the following:

new.df  <-  df %>%
  group_by(year, client) %>%
  filter(story == "SELL" | (story == "BUY > 99"))

Obviously, this is not the right solution since only "BUY < 100" rows are removed, but not the corresponding "SELL" rows. In a last desperate attempt, I also tried with ifelse():

filter(ifelse(story == "BUY > 99", keep(story), drop(story)))

Any help would be greatly appreciated!

Many thanks!

JackDavison · January 11, 2022, 10:36am

From my perspective, it looks as though the key identifier here is the client-year combination. If that key has any value that inclues "BUY < 100" we want to get rid of that whole client-year chunk.

To that end, I've used anti_join(), which looks at the values in tibble "y" and removes the corresponding rows in tibble "x". If we detect the string "100" in the dataframe, anti_join()-ing that by "year" and "client" with the original dataframe gets you your result:

library(tidyverse)

year <-  c(1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1)
month <- c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5)
client <- c("C1", "C1", "C1", "C1", "C1", "C1", "C1", "C1", "C1", "C1", 
            "C2", "C2", "C2", "C2", "C2", 
            "C3", "C3", "C3", "C3", "C3")
code_prod <- c(101100, 101100, 101100, 101100, 1035500, 1038000, 1038000, 105400,  105400, 105400, 
               101100, 101100, 1038000, 105400, 1035500, 
               1038000, 105400, 105400, 101100, 101100)
story <- c("SELL", "SELL", "SELL", "SELL", "BUY < 100", "SELL", "SELL", "SELL", "SELL", "BUY > 99", 
           "SELL", "SELL", "SELL", "BUY < 100", "SELL", 
           "SELL", "SELL", "SELL", "SELL", "BUY > 99")

df = tibble(year = year, month = month, client = client, code_prod = code_prod, story = story)

anti_join(df, filter(df, str_detect(story, "100")), by = c("year", "client"))
#> # A tibble: 10 x 5
#>     year month client code_prod story   
#>    <dbl> <dbl> <chr>      <dbl> <chr>   
#>  1     2     1 C1       1038000 SELL    
#>  2     2     2 C1       1038000 SELL    
#>  3     2     3 C1        105400 SELL    
#>  4     2     4 C1        105400 SELL    
#>  5     2     5 C1        105400 BUY > 99
#>  6     1     1 C3       1038000 SELL    
#>  7     1     2 C3        105400 SELL    
#>  8     1     3 C3        105400 SELL    
#>  9     1     4 C3        101100 SELL    
#> 10     1     5 C3        101100 BUY > 99

^{Created on 2022-01-11 by the reprex package (v2.0.1)}

sluyet · January 11, 2022, 1:47pm

Your solution is just magic! It works perfectly well! Many many thanks!

system · January 18, 2022, 1:47pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.