For anyone who can help me out! I am struggling with this situation.
Somehow my R studio just runs the dplyr package with function 'summarise_at' really slow. The problem is solved for two weeks after I update the R studio but it just comes back again. I try to reinstall R and R studio as well as reinstall my computer. Nothing can really help. The function still works but with an extremely long time.
Forgot to say. This thing also happens for my aggregate function. I try to use the aggregate function to avoid the summarise_at. But it runs really really slow.
Hi Mingmei, welcome!
Do you have the same problem if you run your code in R (not in RStudio)? What are your versions of R, RStudio, dplyr and OS? Any chance that you could make a minimal REPRoducible EXample (reprex)? A reprex makes it much easier for others to understand your issue and figure out how to help.
If you've never heard of a reprex before, you might want to start by reading this FAQ:
Thank you so much for your advice. Briefly introduction, I do test it on R also and it runs really slow as well. But last time, the problem is solved by updating R studio no the R, that's why I come here for help. My version of R is 3.5.3 and the version of my R studio is 1.2.1335. I use a Windows 10 computer and last week I run the Novabench score test it's still fine.
I try to create some reproducible example with the same dimension as the data I use. So how the simple example (mydata and test) doesn't have a problem but with the real data RX part. The problem still exits. I wish I could upload the data online for you to check since I'm desperate to fix this problem. LOL. Please let me know if I could or maybe paste the data on the forum. Thank you.
New updates, to 100% sure match the two cases. I numeric the DUPERSID and EVNTIDX from factor to number and the problem is solved. However, I don't think this is a desirable way to do that. BTW, my DUPERSID has a format like 60001101 and EVNTIDX has a similar but longer format 600011011361.
On my desktop, creating mydata does feel a little slow. It has 10mil values in it.
Have a look at the reprex below. (On making a good reproducible example, note that it's important to include any libraries you call upon. Also note I needed to hide the line referring to RX object since it was never set-up before it was called. =)
library(dplyr)
system.time({
mydata<-as.data.frame(replicate(100,rnorm(100000,0,1)))
})
#> user system elapsed
#> 0.779 0.101 0.886
system.time({
mydata$group=rpois(100000,5)
mydata$group.2=rpois(100000,500)
})
#> user system elapsed
#> 0.010 0.001 0.011
system.time({
test <- mydata %>%
group_by(group,group.2) %>%
summarise_at(vars(V1:V10),sum)%>%
ungroup
})
#> user system elapsed
#> 0.023 0.001 0.024
# RX <- RX %>%
# group_by(DUPERSID,EVNTIDX) %>%
# summarise_at(vars(RXXP[i]),sum) %>%
# ungroup
Sorry for late respoding and thank you for the reponse. Acutally, I figured the reason out. The thing is that my original data (real data:RX from MEPS dataset) has numeric data with labels. Eg:
AMOUNT PAID, PRIVATE INSURANCE (IMPUTED)
[1] 0 0 0 0 0 0 0 0 0 0
If I move the label and trun everything into numberic value as the example I created, the computational time is back to normal.