I want to use count with group_by but it's not giving me the right answer

zennarooh · July 28, 2022, 6:08pm

I have this dataset: chocolate_bars_dataset
I want to know how many bars were reviewd for each country(bean_origin).
I tried this code:

bars_reviewd <- chocolate_bars %>%
 group_by(bean_origin)%>%
 count(review)

bars_reviewd

But it's not giving me the right answer. Could someone please help me ? I would really appreciate it.

jrkrideau · July 28, 2022, 7:29pm

Can you show us some of the data in chocolate_bars.

A handy way to supply some sample data is the dput() function. In the case of a large dataset something like dput(head(mydata, 100)) should supply the data we need.

fcas80 · July 28, 2022, 8:13pm

library(dplyr)
df <- read.delim("C:/Users/Jerry/Desktop/R_files/chocolate.txt", header = TRUE, sep = "\t")
head(df)
colnames(df)
bars_reviewed <- df %>%
group_by(Country.of.Bean.Origin) %>%
count()
bars_reviewed

1 Australia 2
2 Bali 1
3 Belize 56
4 Blend 92
5 Bolivia 47
6 Brazil 60

vkatti · July 29, 2022, 6:23am

Try only using count()

bars_reviewd <- chocolate_bars %>%
 count(bean_origin)

mikecrobp · July 29, 2022, 7:32am

Agreed. Omit the argument to count.

Or when you look at the man page you see you can omit the group_by and name the grouping column in the count() the way OP did

I had been treating count() as a shorthand for summarise(n = n()) with optional sort.
Have only just realised its advertised function includes the group_by and none of the examples include a group_by beforehand. Count observations by group — count • dplyr

Ven · July 29, 2022, 9:33am

Here's two solutions using data.table and dplyr.

data.table

# data.table
library(data.table)

dt_1 <- fread("~/R/20220729_chocolate.csv", header = TRUE)

dt_2 <- dt_1[, .(count = .N), .(`Country of Bean Origin`)] |>
  setorder(-count)

head(dt_2)

   Country of Bean Origin count
1:              Venezuela   254
2:                   Peru   248
3:     Dominican Republic   234
4:                Ecuador   223
5:             Madagascar   184
6:                  Blend   156

dplyr

# dplyr
library(dplyr)

df_1 <- read.csv("~/R/20220729_chocolate.csv", header = TRUE)

df_2 <- df_1 |>
  group_by(Country.of.Bean.Origin) |>
  summarise(count = n()) |>
  arrange(-count)

head(df_2)

# A tibble: 6 x 2
  Country.of.Bean.Origin count
  <chr>                  <int>
1 Venezuela                254
2 Peru                     248
3 Dominican Republic       234
4 Ecuador                  223
5 Madagascar               184
6 Blend                    156

olibravo · August 3, 2022, 8:03am

This answer should be the best answer. It shows the simplest way of getting the result.
And this is the documentation of count: "count() lets you quickly count the unique values of one or more variables: df %>% count(a, b) is roughly equivalent to df %>% group_by(a, b) %>% summarise(n = n())". I wonder why the author of the question didn't even take a look on it (?)

Andrzej · August 3, 2022, 8:10am

How to download this data from this website - first post ?

system · August 10, 2022, 8:10am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.