Plot Error - Summarize function

A_Naik · June 12, 2020, 2:56pm

Whenever I try to run the plot function, it shows me a error. How do I get rid of this error? Please help.

edx %>% group_by(movieId) %>% summarize(n = n()) %>% ggplot(aes(n)) +
geom_histogram(fill = "rosybrown2", col .... [TRUNCATED]
summarise() ungrouping output (override with .groups argument)

Code:

edx %>% group_by(movieId) %>%
summarize(n = n()) %>%
ggplot(aes(n)) +
geom_histogram(fill = "rosybrown2", color = "black", bins = 10) +
scale_x_log10() +
ggtitle("Total number of movies Ratings")

martin.R · June 12, 2020, 2:57pm

The bit in bold is info, not an error.

A_Naik · June 12, 2020, 6:14pm

But I don't get a plot. What must I do to get a plot? How do I modify the code?

andresrcs · June 12, 2020, 7:36pm

To help us help you, could you please prepare a reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

A_Naik · June 13, 2020, 5:39am

This is also one of the errors I keep getting. I cannot understand my mistake. Could anybody please help?

edx %>%

group_by(userId) %>% ggplot(aes(n)) +
geom_histogram(color = "cyan", bins = 10) +
scale_x_log10() + xlab("Number of ratings") + .... [TRUNCATED]
Error: Aesthetics must be valid data columns. Problematic aesthetic(s): x = n.
Did you mistype the name of a data column or forget to add after_stat()?

FJCC · June 13, 2020, 6:06am

In your latest example, it seems that the edx data frame does not have a column named n. Did you leave out the summarize step by mistake?

Please post a reproducible example, as requested by andresrcs, if you need more help. It is difficult to help you if we cannot work with the same data set you are using. It is a good idea to make a simplified data set to illustrate your problem. Please see the link provided earlier.

A_Naik · June 13, 2020, 12:02pm

This is my code and at the bottom, the information pops up of summarise().
The plot is not created.

rsz_picture_1

A_Naik · June 13, 2020, 12:05pm

Image 2 shows the error I'm getting for Aesthetics.

rsz_picture_2

What modifications should I do in both these codes to get a plot?

FJCC · June 13, 2020, 12:46pm

Try this simplified reproducible example. Does it work for you?

Please notice that I have included the data and the complete code, not images of them.

library(ggplot2)
library(dplyr)

DF <- data.frame(movieId = sample(1:100,size = 500, replace = TRUE),
                 rating = sample(1:5, size = 500, replace = TRUE))

#This works
DF %>% group_by(movieId) %>% 
  summarize(n = n()) %>% 
ggplot(aes(n)) + geom_histogram(fill = "rosybrown2")
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


#This does not work because of the missing summarize()
DF %>% group_by(movieId) %>% 
  ggplot(aes(n)) + geom_histogram(fill = "rosybrown2")
#> Don't know how to automatically pick scale for object of type function. Defaulting to continuous.
#> Error: Aesthetics must be valid data columns. Problematic aesthetic(s): x = n. 
#> Did you mistype the name of a data column or forget to add after_stat()?

^{Created on 2020-06-13 by the reprex package (v0.3.0)}

A_Naik · June 13, 2020, 6:42pm

Should I do this for every plot further in the code?
The code did run but it's taking hours to form the plot. It's been 5 hours, the plot still hasn't formed.
My computer's speed is less.
Could you explain why you created DF instead of sticking to edx?

FJCC · June 13, 2020, 7:22pm

I created DF as a convenience because I do not have your data. You should substitute edx where I wrote

DF %>% group_by() %>%

This calculation should not take so much time. I believe you have 9 million rows, so the code could take a noticeable amount of time but nothing like five hours. I increased my data frame DF to 9 million rows and the calculation ran in about 1 second on my laptop that has 8 GB of memory.

A_Naik · June 14, 2020, 7:00am

The code runs but the plot section remains empty.
How do I increase the speed of the plot formation? It's been over half an hour now since I run the above code provided by you. The plot section is still empty.

FJCC · June 14, 2020, 12:27pm

This code

library(ggplot2)
library(dplyr)

DF <- data.frame(movieId = sample(1:100,size = 500, replace = TRUE),
                 rating = sample(1:5, size = 500, replace = TRUE))


DF %>% group_by(movieId) %>% 
  summarize(n = n()) %>% 
ggplot(aes(n)) + geom_histogram(fill = "rosybrown2")

should run very quickly. The data frame has 500 rows and the plotted data has 100 rows. Try breaking it up into three steps, run each step individually, and find which step is taking so long.

library(ggplot2)
library(dplyr)

DF <- data.frame(movieId = sample(1:100,size = 500, replace = TRUE),
                 rating = sample(1:5, size = 500, replace = TRUE))

DF2 <- DF %>% group_by(movieId) %>% 
  summarize(n = n()) 

ggplot(DF2, aes(n)) + geom_histogram(fill = "rosybrown2")

A_Naik · June 15, 2020, 6:00am

Still plot's not forming. Would you know how to speed up the plot formation?

nirgrahamuk · June 15, 2020, 11:13am

do sessionInfo() in the console to share the info with us, it might reveal an issue.

also try dev.off() to see if it temporarily affects the plot window or not
if it does, run the plotting again.

A_Naik · June 16, 2020, 5:39am

The plot formed. It took over an hour though.

vinaychuri · June 16, 2020, 6:08am

Quick question on your "group_by(movieId) %>% summarize(n = n())"

Since you have single column in group_by, is that not the same as "count(movieId)" and do away with group_by() + summarise()?

A_Naik · June 16, 2020, 1:32pm

Yes both are the same.

A_Naik · June 17, 2020, 6:28pm

Heyy could you please help me by running this code below on your rstudio and tell me if you get the same error?

avg_users <- edx %>%
left_join(avg_movie_rating, by='movieId') %>%
group_by(userId) %>%
filter(n() >= 100) %>%
summarize(b_u = mean(rating - mu - b_i))
Error: cannot allocate a vector of size 52.9 MB

FJCC · June 17, 2020, 6:44pm

I do not have the objects edx or avg_movie_rating, so I cannot conclude anything from running that command. Try this

nrow(edx)
tmp <- edx %>%
left_join(avg_movie_rating, by='movieId')
nrow(tmp)

Does tmp have the number of rows you expect? You may get the error just running that part of the code. If so, I would suspect that each movieId appears more than once in avg_movie_rating and the left_join ends up making multiple versions of each row in edx. On the other hand, 52.9MB is not very large. Do you have a computer with little memory?