Dear Community Members,
I have a database with over 20 million entries that capture the behavior (history of buying and selling) of more than 600'000 consumers. The database looks like that :
year <- c(1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1)
month <- c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5)
client <- c("C1", "C1", "C1", "C1", "C1", "C1", "C1", "C1", "C1", "C1",
"C2", "C2", "C2", "C2", "C2",
"C3", "C3", "C3", "C3", "C3")
code_prod <- c(101100, 101100, 101100, 101100, 1035500, 1038000, 1038000, 105400, 105400, 105400,
101100, 101100, 1038000, 105400, 1035500,
1038000, 105400, 105400, 101100, 101100)
story <- c("SELL", "SELL", "SELL", "SELL", "BUY < 100", "SELL", "SELL", "SELL", "SELL", "BUY > 99",
"SELL", "SELL", "SELL", "BUY < 100", "SELL",
"SELL", "SELL", "SELL", "SELL", "BUY > 99")
year month client code_prod story
1 1 1 C1 101100 SELL
2 1 2 C1 101100 SELL
3 1 3 C1 101100 SELL
4 1 4 C1 101100 SELL
5 1 5 C1 1035500 BUY < 100
6 2 1 C1 1038000 SELL
7 2 2 C1 1038000 SELL
8 2 3 C1 105400 SELL
9 2 4 C1 105400 SELL
10 2 5 C1 105400 BUY > 99
11 2 1 C2 101100 SELL
12 2 2 C2 101100 SELL
13 2 3 C2 1038000 SELL
14 2 4 C2 105400 BUY < 100
15 2 5 C2 1035500 SELL
16 1 1 C3 1038000 SELL
17 1 2 C3 105400 SELL
18 1 3 C3 105400 SELL
19 1 4 C3 101100 SELL
20 1 5 C3 101100 BUY > 99
I need to keep only those customers who have a "BUY >99" story code, as well as their "SELL" history. In my database example above, this means that I need to delete customer C1 with "BUY < 100", as well as its "SELL" corresponding history, but keep that customer and its "SELL" corresponding history when "BUY > 99". Similarly, I should keep client C3 and delete client C2. As a result, the database should look like that
year month client code_prod story
1 2 1 C1 1038000 SELL
2 2 2 C1 1038000 SELL
3 2 3 C1 105400 SELL
4 2 4 C1 105400 SELL
5 2 5 C1 105400 BUY > 99
6 1 1 C3 1038000 SELL
7 1 2 C3 105400 SELL
8 1 3 C3 105400 SELL
9 1 4 C3 101100 SELL
10 1 5 C3 101100 BUY > 99
I tried with filter(), but I'm going in circles. The last solution I used is the following:
new.df <- df %>%
group_by(year, client) %>%
filter(story == "SELL" | (story == "BUY > 99"))
Obviously, this is not the right solution since only "BUY < 100" rows are removed, but not the corresponding "SELL" rows. In a last desperate attempt, I also tried with ifelse():
filter(ifelse(story == "BUY > 99", keep(story), drop(story)))
Any help would be greatly appreciated!
Many thanks!