Loading and sorting several ugly XLS files

AlexisW · August 22, 2024, 10:11pm

Am I missing something or this can be done quite easily with an if/else?

Instead of

read_dat <- function(file_list){
dat1 <- read_csv(file_list, skip = 13)
}

Have a function like

mycols <- c("ID", "AA", "BB", "CC")

read_dat <- function(file_list){
  dat1 <- read_csv(file_list, skip = 13)
  
  if(ncol(dat1) == 20){
  
    stopifnot(names(dat1) == mycols)
    return(dat1)
  } else if(ncol(dat1) == 29){
    
    dat1_reordered <- dat1[mycols]
    return(dat1_reordered)
  }
}

Or if you want aggregated statistics before merging anything, with your previous read_dat():

measurement_list <- lapply(file_list, read_dat)
nb_cols <- sapply(measurement_list, ncol)
table(nb_cols)

jrkrideau · August 26, 2024, 1:14pm

NO I don'th think you are missing anything though I have not had a chance to try it. I think I just had a complete mental block.

I'll try to mock up some data and run it later today.

DrZoidberg · August 26, 2024, 1:33pm

I am sorry that I didnt get back to you earlier. I had night shifts and slept almost true the whole sunday.

I tried your function in a new script but i didn't seem to work and I still enden up with one list that had several files with differnt amounts of colums. I tried it with the numbers switched but everytime I used this funtion I got all files together. Or do I have to load a special package for this function?

AlexisW · August 27, 2024, 10:10pm

Which one? If you mean the second part (with table(nb_cols)), that's on purpose: I read each file into a list without assumptions on the columns, then we can analyze this list with ncol() or colnames() to see what is in the files and decide about the right way to process them. Then, we could either reprocess this list before assembling with bind_rows(), or write a new function to read and pre-process each file, and start in a fresh script.

If it's my alternate read_dat(), this is an example of such a function that pre-processes each file when reading, and before trying to assemble them. The assumption is that some files have 20 columns, of names given by mycols, and some files have 29 columns, including the 20 columns from mycols plus others that we're not interested in. In that case, this function first reads the file, if it had 20 columns we're done, if it had 29 columns, we select the ones with names in mycols and return this subset dataframe.

I might be misunderstanding what you want: I thought you had a bunch of files with 20 or 29 columns, and wanted to assemble them in a single dataframe. For this you first need to ensure they have the same columns, then assemble them with dplyr::bind_rows() or do.call(rbind) or purrr::list_rbind() or something.

If you want to have separate lists depending on the number of columns in the input file, I think the easiest is to first read all files in a list (where the list elements have varying numbers of columns), then use split() to split the list into multiple lists, then you can bind_rows each of these lists. Something like that (untested code):

measurement_list <- lapply(file_list, read_dat)
nb_cols <- sapply(measurement_list, ncol)
list_of_list_of_dfs <- split(measurement_list, nb_cols)
list_dfs_by_ncol <- lapply(list_of_list_of_dfs, bind_rows)

If I'm still misunderstanding, maybe it would help to re-explain what the input files look like and what the desired output is.

system · November 25, 2024, 10:11pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.