Plotting bar graph for average rating of all movies in their respective genres? (ggplot - data-wrangling)

technocrat · March 18, 2020, 8:48am

Hi, and welcome!

Two preliminaries:

Please see the FAQ: What's a reproducible example (`reprex`) and how do I do one? Using a reprex, complete with representative data will attract quicker and more answers.
Check the community homework policy, which requires some disclosure of the assignment and explains members are here to help you get unstuck, but not to "give you the answer"

Let's start by looking at the structure of the movies data set

library(ggplot2movies)
str(movies)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    58788 obs. of  24 variables:
#>  $ title      : chr  "$" "$1000 a Touchdown" "$21 a Day Once a Month" "$40,000" ...
#>  $ year       : int  1971 1939 1941 1996 1975 2000 2002 2002 1987 1917 ...
#>  $ length     : int  121 71 7 70 71 91 93 25 97 61 ...
#>  $ budget     : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ rating     : num  6.4 6 8.2 8.2 3.4 4.3 5.3 6.7 6.6 6 ...
#>  $ votes      : int  348 20 5 6 17 45 200 24 18 51 ...
#>  $ r1         : num  4.5 0 0 14.5 24.5 4.5 4.5 4.5 4.5 4.5 ...
#>  $ r2         : num  4.5 14.5 0 0 4.5 4.5 0 4.5 4.5 0 ...
#>  $ r3         : num  4.5 4.5 0 0 0 4.5 4.5 4.5 4.5 4.5 ...
#>  $ r4         : num  4.5 24.5 0 0 14.5 14.5 4.5 4.5 0 4.5 ...
#>  $ r5         : num  14.5 14.5 0 0 14.5 14.5 24.5 4.5 0 4.5 ...
#>  $ r6         : num  24.5 14.5 24.5 0 4.5 14.5 24.5 14.5 0 44.5 ...
#>  $ r7         : num  24.5 14.5 0 0 0 4.5 14.5 14.5 34.5 14.5 ...
#>  $ r8         : num  14.5 4.5 44.5 0 0 4.5 4.5 14.5 14.5 4.5 ...
#>  $ r9         : num  4.5 4.5 24.5 34.5 0 14.5 4.5 4.5 4.5 4.5 ...
#>  $ r10        : num  4.5 14.5 24.5 45.5 24.5 14.5 14.5 14.5 24.5 4.5 ...
#>  $ mpaa       : chr  "" "" "" "" ...
#>  $ Action     : int  0 0 0 0 0 0 1 0 0 0 ...
#>  $ Animation  : int  0 0 1 0 0 0 0 0 0 0 ...
#>  $ Comedy     : int  1 1 0 1 0 0 0 0 0 0 ...
#>  $ Drama      : int  1 0 0 0 0 1 1 0 1 0 ...
#>  $ Documentary: int  0 0 0 0 0 0 0 1 0 0 ...
#>  $ Romance    : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ Short      : int  0 0 1 0 0 0 0 1 0 0 ...

^{Created on 2020-03-18 by the reprex package (v0.3.0)}

Ok, it's a data frame with 24 variables capturing various aspects of the 58,788 movies it describes.

What's needed? Average rating by genre. Which variable holds the rating for a movie? I'm going to call that SCORE to not spoil the fun.

Which variables indicate the genre? No spoilers here: Action, Animation, Comedy, Drama, Documentary, Romance, and Short.

Using the dplyr package's select function, you can create a skinnier data frame to work with for this problem

movies %>% select(SCORE, Action, Animation, Comedy, Drama, Documentary, Romance, and Short) -> genres

Not needed strictly, but easier on the eyes.

genres <- structure(list(SCORE = c(6.4, 6, 8.2, 8.2, 3.4, 4.3), Action = c(0L, 0L, 0L, 0L, 0L, 0L), Animation = c(0L, 0L, 1L, 0L, 0L, 0L), Comedy = c(1L, 1L, 0L, 1L, 0L, 0L), Drama = c(1L, 0L, 0L, 0L, 0L, 1L), Documentary = c(0L, 0L, 0L, 0L, 0L, 0L), Romance = c(0L, 0L, 0L, 0L, 0L, 0L), Short = c(0L, 0L, 1L, 0L, 0L, 0L)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L))
genres
#>   SCORE Action Animation Comedy Drama Documentary Romance Short
#> 1   6.4      0         0      1     1           0       0     0
#> 2   6.0      0         0      1     0           0       0     0
#> 3   8.2      0         1      0     0           0       0     1
#> 4   8.2      0         0      1     0           0       0     0
#> 5   3.4      0         0      0     0           0       0     0
#> 6   4.3      0         0      0     1           0       0     0

^{Created on 2020-03-18 by the reprex package (v0.3.0)}

(These are just the first few rows, of course.)

Assuming you were just interested in Comedy, how would you further reduce genres to just those films?

suppressPackageStartupMessages(library(dplyr)) 
# OMITTED genres <- structure(list ...
comedies <- genres %>% filter(Comedy == 1) %>% select(SCORE,Comedy)
comedies
#> # A tibble: 3 x 2
#>   SCORE Comedy
#>   <dbl>  <int>
#> 1   6.4      1
#> 2   6        1
#> 3   8.2      1

^{Created on 2020-03-18 by the reprex package (v0.3.0)}

The function mean() will find your average SCORE, so back to you to fill in the blank

mean(_____)