Graph clumping together even with group_by() and uniqu() function

nathasyaeca · November 10, 2023, 6:12am

There's df called Youtube video trending, and I tried to find the most likes and dislikes video based on category_id. But I have some problem that hard to crack and find solution because I'm kinda new in Data Analyst world. If you guys can helps me that would be great! Here's the code:

youtube %>% 
  group_by(category_id) %>%
  mutate(ratio = likes/views,
         category_id = fct_reorder(category_id, ratio)) %>% 
  unique() %>%
  head(7) %>% 
  ggplot(aes(category_id, ratio)) +
  geom_segment(aes(x = category_id, xend = category_id,
                   y = 0, yend = ratio), color = "skyblue") +
  geom_point(color = "blue", size = 4, alpha = .7) +
  labs(x = "Category ID",
       y = "Ratio",
       title = "Categories of Video That Generated Most Likes",
       subtitle = "Calculating Ratio Between Likes in Videos and Total Viewers") +
  theme_classic()

What I want is the graph to make different and unique x-axis each values, but instead, why the x-axis value of 24 clumping each other like that?
trendingVideo

p.s: I already check unique values in categories_id and indeed there's only one 24 value in categories_id meaning there's no duplicate. But somehow my calculating still add 24 as four different values, beside unique() I already used dplyr::distinct() and the result still the same.

nirgrahamuk · November 10, 2023, 9:34am

It may be consider style choice, but for me its just more reliably easier to write better code and avoid mistakes to explicitly seperate out the data manipulations that lead up to drawing a graph, from the graphing code itself.
(primarily because its trivial to look at the frame you pass to the graph code to understand what you are doing)

In my own code, I practically never use a pipe to go into ggplot; I always pass a data.frame with a name.

That said, you didnt provide example data, but it was simple enough for me to quickly produce a trivial example frame to show the effects of the code you wrote and my proposal for an alternative. Hope it helps you.

library(tidyverse)
(youtube <- data.frame(
  category_id = c("a", "b", "b"),
  views = rep(100, 3),
  likes = 10 * (1:3)
))


(what_you_are_doing <- youtube |>
  group_by(category_id) |>
  mutate(
    ratio = likes / views,
    category_id = fct_reorder(category_id, ratio)
  ) |>
  unique() |>
  head(7))

#  category_id views likes ratio
# <fct>       <dbl> <dbl> <dbl>
#1 a             100    10   0.1
#2 b             100    20   0.2
#3 b             100    30   0.3

(what_you_probably_intended <- group_by(
  youtube,
  category_id
) |>
  summarise(ratio = sum(likes) / sum(views)) |>
  mutate(
    category_id = fct_reorder(category_id, ratio, .desc = TRUE)
  ) |>
  head(7))

#  category_id ratio
# <fct>       <dbl>
#1 a            0.1 
#2 b            0.25

bvbun · November 10, 2023, 9:54am

In addition to the explanation nirgrahamuk already gave:
group_by() is (almost) always followed by a functions that aggregate groups, such as summarise(), group_map(), group_nest(), group_split(), group_trim()

martin.R · November 10, 2023, 1:58pm

This is a perfectly legitimate order of operations: group_by() %>% mutate(). This would be used to calculate a value per row based on a grouped stat, e.g. a grouped ratio.

The actions were just not done correctly in this instance.

nathasyaeca · November 10, 2023, 2:29pm

You explained it so well (and your code is so clean and easy to understand!). Thank you so much for the answer!

nathasyaeca · November 10, 2023, 2:31pm

Oh, thanks for the tips. Would definitely remember it!

system · November 17, 2023, 2:32pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.