I am trying to write a R dataframe(50K * 206) with mutiple dataype columns as table in Hadoop(Cloudera, Impala, Hue) with dbWriteTable. It is exceptionally slow. Even if I split the dataframe into cunks of 1000, i have been able to write only 5000 rows in last 2 hour.
Is there a way or method to speed this up?
Following is a reprex
> #create a big dataframe
> type <- sample(c("integer()", "character()"),205, replace = T)
> col <- paste0("col", c(1:205))
>
> create_big_df <- function(x){
>
> if(x == "integer()"){
>
> sample(letters, 50000, replace = T)
> }else{
>
> sample(1:10000, 50000, replace = T)
> }
> }
>
> df <- data.frame(map(type, create_big_df))
> colnames(df) <- col
>
> > dim(df)
> [1] 50000 205
>
>
> #splitting df into row chunks
> >split_df <-
> split(df , cut(1:nrow(df), nrow(df) / 100))
>
> #function to append chunks of df
> write_tbl_sandbox_1 <- function(df, n) {
> print(n)
> dbAppendTable(con,
> SQL("sandbox_schema.df"),
> df, append = TRUE)