Products per customer ID -> pair permutations of products into two columns

pathos · July 23, 2023, 7:10pm

I'm using this data:

df = df |>
  mutate(customer_id = str_replace(customer_id, '.0$', '')) |>
  filter(customer_id != 'no customer id')

My focus is on two columns -- customer_id and stock_code.

df |> 
  distinct(customer_id, stock_code) |> 
  arrange(customer_id, stock_code)

# A tidytable: 261,019 × 2
   customer_id stock_code
   <chr>       <chr>     
 1 12369       23166AP   
 2 12370       16008AP   
 3 12370       17021AP   
 4 12370       20665AP   
 5 12370       20719AP   
 6 12370       20780AP   
 7 12370       20782AP   
 8 12370       20966AP   
 9 12370       21035AP   
10 12370       21041AP   
# ℹ 261,009 more rows

Above is products per customer ID.
You can see that customer_id 12370 has multiple stock_codes.
The end result I would like is two columns of products.
A combination of two products per row, but the pairs exist as a purchase per customer_id.
However, because it's in alphabetical order, there doesn't need to be a reversal (i.e. X -> Y and Y -> X)
End result I would like, just from above 10 rows of data, would be something like this:

column1|column2|customer_id
16008AP|17021AP|12370
16008AP|20665AP|12370
16008AP|20719AP|12370
16008AP|20780AP|12370
16008AP|20782AP|12370
16008AP|20966AP|12370
16008AP|21035AP|12370
16008AP|21041AP|12370
20665AP|20719AP|12370
20665AP|20780AP|12370
20665AP|20782AP|12370
20665AP|20966AP|12370
20665AP|21035AP|12370
20665AP|21041AP|12370
20719AP|20780AP|12370
20719AP|20782AP|12370
20719AP|20966AP|12370
20719AP|21035AP|12370
20719AP|21041AP|12370
20780AP|20782AP|12370
20780AP|20782AP|12370
20780AP|20782AP|12370
20780AP|20782AP|12370
20780AP|20782AP|12370
20966AP|21035AP|12370
20966AP|21041AP|12370
21035AP|21041AP|12370

Would it be just a for loop within a for loop?

MarekGierlinski · July 24, 2023, 8:31am

You can split the data frame into customer chunks and use expand_grid to create all combinations of stock codes. Filtering by column1 < column2 ensures unique pairs.

df |>
  group_split(customer_id) |>
  map(function(d) {
    expand_grid(column1 = d$stock_code, column2 = d$stock_code) |>
      filter(column1 < column2) |>
      add_column(customer_id = first(d$customer_id))
  }) |>
  list_rbind()

# A tibble: 57,701 × 3
   column1 column2 customer_id
   <chr>   <chr>   <chr>      
 1 16008AP 17021AP 12370      
 2 16008AP 20665AP 12370      
 3 16008AP 20719AP 12370      
 4 16008AP 20780AP 12370      
 5 16008AP 20782AP 12370      
 6 16008AP 20966AP 12370      
 7 16008AP 21035AP 12370      
 8 16008AP 21041AP 12370      
 9 16008AP 21064AP 12370      
10 16008AP 21154AP 12370      
...

pathos · July 24, 2023, 5:02pm

rbind_list()seems to be obsolete, from dplyr version 0.3, but I'm using bind_rows(), which seems to work.

Thanks for the solution, it's actually pretty creative. I would not have approached this like this.

I like map(), but I will try to find an alternative for first(customer_id), group_split(), the (anonymous) function, and $.

MarekGierlinski · July 25, 2023, 6:44am

list_rbind() is not obsolete, it is actually recommended by tidyverse team. There is a group of functions map_()*, e.g., map_dfr()which would do the mapping and then binding the list by rows in one go. In the most recent documentation map_dfr() is now marked "superseded", with a recommendation to use map() with list_rbind() instead.

I agree that group_split() and first(customer_id) are not elegant. If you find a better solution, post it here for everyone to see.

system · August 1, 2023, 6:44am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.