sapply function doesn't work with long lists

schianod · July 13, 2020, 10:08am

i've this code where it takes the data from a database and since each six rows i've got informations about a single unit, i've used the function split to group in each element of a list, in tibbles, these infos. After this i've done in such a way that if it is found the string "#ERROR", in one of the tibble it will delete the element of the list (the tibble of 6 rows

 library(tidyverse) 
    dataset <- read_excel('test.xlsx')


my.list<-split(dataset, rep(1:119, each = 6)) 
new.list <- my.list[sapply(1:length(my.list), function(i) 
  all(my.list[[i]] != "#ERROR"))]

This code is working but in the moment in which i use another database (the real one, i've been using this to be faster in trials) which would result in a my.list which is split in more than 17k elements, i think creates problems with the sapply function as it gives back new.list which have reduced the number of rows but they are empty. I've already checked possible errors in the different formattation in the database and that's not the problem. how differently can i write this to make it works?
I've also thought it could have been a problem of memory as my.list is already almost 2 gigas heavy, i've got 8 gigas of RAM and an i7 processor

nirgrahamuk · July 13, 2020, 11:31am

I'm skeptical that it is sapply and not your functions interaction with the data....
here is a reprex where a 250k item list size of 500mb is processed with sapply and doesnt 'break'.

library(tidyverse) 
dataset <- iris


my.list_short<-split(dataset, rep(1:(nrow(dataset)/6), each = 6)) 

#make it biiiiig
my.list <- rep(my.list_short,10000)

new.list <- my.list[sapply(1:length(my.list), function(i) 
  all(my.list[[i]] != "#ERROR"))]

identical(my.list,new.list)
#TRUE

schianod · July 13, 2020, 1:32pm

ok but still, why it does exactly what it should do if i use a database with 1000 rows and 360 columns, but won't do it if, the data are disposed in the same exact way but i have 120k rows and 360 columns?

nirgrahamuk · July 13, 2020, 1:41pm

either your database returns abnormal results when large number of rows are involved and thats the issue to deal with.
or your expectations for the data in 119k rows you haven't observed before don't quite match up with what your function expects.

system · August 3, 2020, 1:44pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.