Should I rbind a list of dataframes as a 'BIG' dataframe

spidey12354 · February 28, 2020, 4:47am

Hi all,
I use R for only half a year and have no basics for the programming. (total newbie)
Sincerely I appreciate any suggestions.

The list I process contains 400 elements of data frame with each data frame containing about 2000 rows. If I rbindlist() it, then it will contain 1956,970 rows.
The beauty with the "big data frame" is that I can use handy function in tidyverse directly, but have to group_by years many times for different calculations; And with lapply, it is split by "years" nicely, but I have to write lousy functions. So which is better? Is there any other measure to deal with big data frame?

(And to my surprise, the dplyr package is super fast dealing with a million rows data frame.)
My code is like the below, two methods seem comparable in terms of their speed.

elementone=data.frame(occurrence=100:109)
elementtwo=elementone
names_list=list(elementone,elementtwo)
names(names_list)=c("year1","year2")

names_list%>%                            #element-wise 
  lapply(FUN=.%>%
           mutate(total=sum(.[,1]))%>%
           top_n(1,wt=occurrence))%>%
  do.call(rbind,.)

names_list%>%                                 #big data.frame
  data.table::rbindlist(use.names =F,idcol = "year")%>%
  group_by(year)%>%
  mutate(total=sum(occurrence))%>%
  top_n(1,wt=occurrence)

dromano · February 28, 2020, 6:04am

You might want to consider using map_dfr(), which row-binds to yield a data frame:

elementone=data.frame(occurrence=100:109)
elementtwo=elementone
names_list=list(elementone,elementtwo)
names(names_list)=c("year1","year2")

library(tidyverse)
# row-bind data frames, add names in new 'year' column
map_dfr(names_list, ~ .x, .id = 'year') %>% 
  group_by(year) %>% 
  mutate(total = sum(occurrence)) %>% 
  top_n(1, wt = occurrence) %>% 
  ungroup()
#> # A tibble: 2 x 3
#>   year  occurrence total
#>   <chr>      <int> <int>
#> 1 year1        109  1045
#> 2 year2        109  1045

^{Created on 2020-02-27 by the reprex package (v0.3.0)}

I'm not sure how fast it is, but it seems to combine benefits of both of your approaches.

spidey12354 · February 28, 2020, 1:39pm

Thank u for your time. After the in this way there will be a big data frame, and group_by will be frequently used. I was wondering which way will be a good habit? List-wise operations or always converting into one data frame?

system · March 20, 2020, 1:52pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.