Counting number of times a row value in one column matches a row value in another column in a data frame

AndersdoesR · July 27, 2023, 3:46pm

Lets say that I have a data.frame with fruits and prices.

df <- data.frame(
  fruit = c("bananas", "melons", "bananas", "melons", "apples", "bananas",
            "apples", "melons", "lemons", "melons", "bananas", "apples",
            "oranges", "apples", "apples", "lemons", "melons", "apples"),
  prize = c(1,1,3,2,3,2,1,3,1,2,2,1,3,2,2,1,2,1)
)

For every unique value in the categorical "fruits" column, I want to count the number of times it is equal to each unique value of my other column "prize".

I have tried to do i as follows:

smry_bananas <- df %>%
  filter(fruit == 'bananas') %>%
  group_by(prize) %>%
  summarise(n_1 = sum(prize == '1'),
            n_2 = sum(prize == '2'),
            n_3 = sum(prize == '3')) %>%
  mutate(freq_1 = n_1/sum(n_1 + n_2 + n_3)) %>%
  mutate(freq_2 = n_2/sum(n_1 + n_2 + n_3)) %>%
  mutate(freq_3 = n_3/sum(n_1 + n_2 + n_3))

This works, but I would like to iterate over the unique values of the "fruits" column and not have to write all these lines for each unique value of the column.

I suppose I should use a for loop ir lapply, but I'm not that experienced with these. Thanks in advance, it is much appreciated

scottyd22 · July 27, 2023, 6:00pm

Welcome to the community @AndersdoesR! There may be a more elegant solution, but below is one way to achieve the desired output using the count() and pivot_wider()/pivot_longer() functions.

library(tidyverse)

out = df |>
  group_by(fruit) |>
  mutate(total = n()) |>
  ungroup() |>
  count(fruit, total, prize) |>
  mutate(freq = n/total) |>
  select(fruit, prize, n, freq) |>
  pivot_longer(cols = c(-'fruit', -'prize', -'freq')) |>
  mutate(prize_label = prize) |>
  pivot_wider(names_from = c(name, prize_label), values_from = value) |>
  mutate(prize_label = prize,
         name = 'freq') |>
  pivot_wider(names_from = c(name, prize_label), values_from = freq) |>
  mutate_if(is.numeric, ~replace_na(., 0)) 

# bananas only
out |> filter(fruit == 'bananas')
#> # A tibble: 3 × 8
#>   fruit   prize   n_1   n_2   n_3 freq_1 freq_2 freq_3
#>   <chr>   <dbl> <int> <int> <int>  <dbl>  <dbl>  <dbl>
#> 1 bananas     1     1     0     0   0.25    0     0   
#> 2 bananas     2     0     2     0   0       0.5   0   
#> 3 bananas     3     0     0     1   0       0     0.25

# all fruit
out
#> # A tibble: 11 × 8
#>    fruit   prize   n_1   n_2   n_3 freq_1 freq_2 freq_3
#>    <chr>   <dbl> <int> <int> <int>  <dbl>  <dbl>  <dbl>
#>  1 apples      1     3     0     0   0.5   0      0    
#>  2 apples      2     0     2     0   0     0.333  0    
#>  3 apples      3     0     0     1   0     0      0.167
#>  4 bananas     1     1     0     0   0.25  0      0    
#>  5 bananas     2     0     2     0   0     0.5    0    
#>  6 bananas     3     0     0     1   0     0      0.25 
#>  7 lemons      1     2     0     0   1     0      0    
#>  8 melons      1     1     0     0   0.2   0      0    
#>  9 melons      2     0     3     0   0     0.6    0    
#> 10 melons      3     0     0     1   0     0      0.2  
#> 11 oranges     3     0     0     1   0     0      1

Created on 2023-07-27 with reprex v2.0.2

FactOREO · July 28, 2023, 5:49pm

What about using data.table? First transform the data.frame to a data.table with data.table::as.data.table() or collapse::qDT() and then do a oneliner:

res <- dt[, .(count = .N), .(fruit, prize)][, freq := count / sum(count), .(fruit)][order(fruit, prize)]
res
      fruit prize count      freq
 1:  apples     1     3 0.5000000
 2:  apples     2     2 0.3333333
 3:  apples     3     1 0.1666667
 4: bananas     1     1 0.2500000
 5: bananas     2     2 0.5000000
 6: bananas     3     1 0.2500000
 7:  lemons     1     2 1.0000000
 8:  melons     1     1 0.2000000
 9:  melons     2     3 0.6000000
10:  melons     3     1 0.2000000
11: oranges     3     1 1.0000000

If it is desired to have a column for each frequency and count as well, use data.table::dcast.data.table():

> dcast(res, fruit ~ prize, value.var = c("count", "freq"), fill = 0)
     fruit count_1 count_2 count_3 freq_1    freq_2    freq_3
1:  apples       3       2       1   0.50 0.3333333 0.1666667
2: bananas       1       2       1   0.25 0.5000000 0.2500000
3:  lemons       2       0       0   1.00 0.0000000 0.0000000
4:  melons       1       3       1   0.20 0.6000000 0.2000000
5: oranges       0       0       1   0.00 0.0000000 1.0000000

Explanation
1.) dt[, .(count = .N), .(fruit, prize)] Gets the count by group with data.table syntax (Groups are combinations of fruit and prize, indicated by .(fruit, prize) which is the data.table syntax for list()
2.) [, freq := count / sum(count), .(fruit)] adds a column freq by reference to the result from 1.) also by group, but this time only by fruits to obtain the correct result of sum(count) from the data.table
3.) [order(fruit, prize)] just rearranges the output for a better looking result
4.) dcast(res, fruit ~ prize, value.var = c("count", "freq"), fill = 0)uses the formula fruit ~ prize which maps fruit to the "y-axis" and prize to the "x-axis". The value.varnow spans all combinaitons of prize with count and freq respectively and fills the entries with the values from count/freq

AndersdoesR · July 29, 2023, 8:46am

Thank you very much scottyd22! I really appreciate the help!

AndersdoesR · July 29, 2023, 9:05am

Thanks FactOREO this is a great solution also! Didn't know the data.table but thats really great. Thanks a lot

system · September 9, 2023, 9:05am

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.