Group one set of values into another set of values (categorize a set of values into another)

expense <- as.tibble (data_frame(date =(Sys.Date() + sort(sample(1:30, 18))), 
                shops = c("Brooklyn", "SunriseMart","Fairway","Westside","Hapanowicz", "Applestone", "Pyramid","BP","Speedway", 
         "Sunoco","LaEsquina", "Xian", "EMo","Mos","HughEsan","MVM","Budget","Uber")))

categories = as.tibble (c("Groceries","Takeout", "Petrol","Travel"))

takeout_master = as_tibble(data_frame(shops= c("LaEsquina", "Xian", "EMo","Mos","HughEsan"), categ = c("Takeout")))

Assume I have 10,000 observations in expenses which are more or less repeats of the above values. How can I group them into the categories?

For example: Brooklyn & Sunrise into Groceries, MVM & Uber into Travel, Xian & Mos into Takeout, Speedway & Sunoco in Petrol etc.

Thanks in advance

Edit: After reading FJCC's response below, I realise my query may be confusing. It is not just Xian & Mos that need to be categorised as Takeout but also LaEsquina, EMo, HughEsan and any other Takeouts that may pop up in the remaining 10,000 values under shops.

It maybe that I need another data_frame like takeout_master and then match these.

I was hoping for a cleaner, advanced and sophisticated way to do this

I think you have to make a table listing each unique vendor and its category. I did a few examples in the code below. The NA values are due to the incomplete list of vendors.

expense <- as.tibble (data_frame(date =(Sys.Date() + sort(sample(1:30, 18))), 
                                 shops = c("Brooklyn", "SunriseMart","Fairway","Westside","Hapanowicz", "Applestone", "Pyramid","BP","Speedway", 
                                           "Sunoco","LaEsquina", "Xian", "EMo","Mos","HughEsan","MVM","Budget","Uber")))
#> Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
#> This warning is displayed once per session.
#> Warning: `data_frame()` is deprecated, use `tibble()`.
#> This warning is displayed once per session.

Correl <- tribble(~Vendor, ~Category,
        "Brooklyn", "Groceries",
        "SunriseMart", "Groceries",
        "MVM", "Travel",
        "Uber", "Travel",
        "Xian", "Takeout",
        "Mos", "Takeout")
expense <- left_join(expense, Correl, by = c(shops = "Vendor"))
#> # A tibble: 18 x 3
#>    date       shops       Category 
#>    <date>     <chr>       <chr>    
#>  1 2020-04-07 Brooklyn    Groceries
#>  2 2020-04-08 SunriseMart Groceries
#>  3 2020-04-09 Fairway     <NA>     
#>  4 2020-04-10 Westside    <NA>     
#>  5 2020-04-11 Hapanowicz  <NA>     
#>  6 2020-04-13 Applestone  <NA>     
#>  7 2020-04-14 Pyramid     <NA>     
#>  8 2020-04-15 BP          <NA>     
#>  9 2020-04-17 Speedway    <NA>     
#> 10 2020-04-19 Sunoco      <NA>     
#> 11 2020-04-22 LaEsquina   <NA>     
#> 12 2020-04-23 Xian        Takeout  
#> 13 2020-04-25 EMo         <NA>     
#> 14 2020-04-28 Mos         Takeout  
#> 15 2020-05-01 HughEsan    <NA>     
#> 16 2020-05-02 MVM         Travel   
#> 17 2020-05-04 Budget      <NA>     
#> 18 2020-05-06 Uber        Travel

Created on 2020-04-06 by the reprex package (v0.3.0)

Thanks FJCC. After reading your response, I have edited my question to make it a bit more clearer.

Regards, OR

I don't think you want to have one data frame for each category of shop. Use one data frame to list every shop and its category. This is what I started with the tibble named Correl. I did not list every vendor because I am lazy and I do not know the categories of the shops you have listed.

There are variations possible in this approach but somewhere you will have to write the appropriate category down for every shop.

1 Like

FJCC, I misunderstood your earlier post. I have marked its as the solution. I'll try it out on the larger dataset. Thank you for taking the time out to respond. Much appreciated.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.