Loading and sorting several ugly XLS files

DrZoidberg · August 19, 2024, 5:48pm

I am trying to write a code for R that helps me load about 2000 xls files from an folder and makes one large table out of it.
The problem that I am facing is the way the files are built when we download them from the software.
I already managed to import the files, but the table only shows me the names of the files, type of file, ect., but not the content of the files. And since I am very much a beginner at programming, I don't know where I should start fixing things.

This is how the tables look in Excel.

What functions do I need to use and is there a good tutorial/ article how I can manage to load all the files in the folder in to one working file

DrZoidberg · August 19, 2024, 5:50pm

This is how I want them rearranged. I couldn't attach it directly.

jrkrideau · August 19, 2024, 7:08pm

Hi, welcome to the forum.
The best idea would be to upload that Excel file to a file sharing service such as Dropbox or Mediafire and post a link to it here. Or, perhaps even better to zip 3 or 4 files together and upload the zip file.

It is weird that you are getting "the names of the files, type of file, ect." but nothing else.
Could you post the code that you are using here? Copy the code and paste it here between
```

```

And remember to say a pious little curse every night on spreadsheet designers who produce such monstrosities.

FJCC · August 20, 2024, 12:47pm

@jrkrideau is right that uploading some files would be very helpful for providing specific guidance. For a general answer, I would read the data in two steps. Read rows 1:11 and columns 1:2 in one step and rows 14:15 and columns 1:20 in a second step. The read.xlsx() function in the openxlsx package allows you to pick rows and column to read and other packages probably have similar options. You can then use the pivot_wider() function from the tidyr package to pivot the first table to be a single row of 11 columns. Finally, you can column bind the pivoted table and the second table you read into a single table of one row with all the information.

DrZoidberg · August 20, 2024, 3:31pm

Thank you verry much for the quick responses
i dont know why but the R skipt I worked on till yesterday cant be opend anymore.
But even though I think I used the same code as before the data dosn't open at all anymore. It only shows me "# A tibble: 0 × 0" at the end of the console.

I can't upload the original data sets, since they contain Informations on patients and students we tested on but I made three examples and a final form example and uploadded it to this folder: Dropbox

#load library----
library(tidyverse)
library(ggpubr)
library(readxl)



#load files Sandesneben ----

folder_path <- "C:/Users/paula/OneDrive/Desktop/Mobil-O-Graph/data/Schulmessungen/Messung Sandesneben"
file_list <- list.files(path = folder_path, pattern = ".xls", full.names = TRUE)
data_list <- lapply(file_list, read_excel) 
combined_data <- bind_rows(data_list)  
print(combined_data)

View(combined_data)

I thought I found my mistake that i forgot to move the data into my R Studio project folder but moving it to the specific folder did not help. It still can not open the file. Could it be because it is a xls file and not a more modern xlsx file?

Error: 
  filepath: C:\Users\paula\OneDrive\Desktop\Mobil-O-Graph\data\Schulmessungen\Messung Sandesneben\2024-05-30_09-56_22_ID-2024-0000000005_PWA.xls
  libxls error: Unable to open file

Or is there an easier way to do it in Excel and then transfer it to R Studio as a complete table?

DrZoidberg · August 20, 2024, 3:54pm

How do i select just specific rows and move them. I only ever used pivot_longer and pivot wider to switch whole tables but i never used it it on specific parts and I dont remember this part in the online leaning courses I did

jrkrideau · August 20, 2024, 7:16pm

Thank you for the data.

I think one of the things that was giving you a problem is that

data_list <- lapply(file_list, read_excel)

is returning a list and, bind_rows is for data.frames and tibbles and data.tables.

I *think"*something like

dat1 <- do.call(rbind.data.frame, data_list)

would work but you would still be getting gibberish.

The code below is not a solution to your problem but it is a simple example of how you can deal with a single file.

library(tidyverse)
library(readxl)

dat1 <- read_excel("Example1.xls", skip = 13) ## read the data on lines 14 & 15

pd1 <- read_excel("Example1.xls", n_max = 11, col_names = c("xx1", "xx2")) ## Read identifier data

pdt1 <- as_tibble(t(pd1[,2])) ## Turn identifier data into one row tibble
names(pdt1) <- pull(pd1, xx1)  ##  Assign column names

full_data <- cbind(pdt1, dat1)

full_data

DrZoidberg · August 20, 2024, 8:40pm

Thank you verry much. At first it didnt work due to the old xls formate but after I converted a few files to csv and adapted the code it gave me a complete table.

With the CSV files the first part also works. But I now have to figure out how to connect the two parts to get one complete table

My code so far that I will hopefully improve tomorrow

#load library----
library(tidyverse)
library(ggpubr)
library(readxl)


#load files ----

folder_path <- "C:/Users/paula/OneDrive/Desktop/Mobil-O-Graph/data/Schulmessungen/Test"
file_list <- list.files(path = folder_path, pattern = ".csv", full.names = TRUE)
data_list <- lapply(file_list, read_csv)
combined_data <- bind_rows(data_list)  
print(combined_data)


dat2 <- read_csv("C:/Users/paula/OneDrive/Desktop/Mobil-O-Graph/data/Schulmessungen/Test/2024-06-14_10-19_00_ID-2024-0000000065_PWA.csv", skip = 13) 
pd2 <- read_csv("C:/Users/paula/OneDrive/Desktop/Mobil-O-Graph/data/Schulmessungen/Test/2024-06-14_10-19_00_ID-2024-0000000065_PWA.csv", n_max = 11, col_names = c("xx3", "xx4"))
pdt2 <- as_tibble(t(pd2[,2])) ## Turn identifier data into one row tibble
names(pdt2) <- pull(pd2, xx4)  ##  Assign column names

full_data2 <- cbind(pdt2, dat2)

full_data

jrkrideau · August 20, 2024, 9:02pm

Ah yes the horrors of old formats. A former colleague of mine had all his dissertation data on 11 inch floppy disks.

With the CSV files the first part also works. But I now have to figure out how to connect the two parts to get one complete table

Can you elaborate? I don't understand. I would have thought that

full_data <- cbind(pdt2, dat2)

in my example would have worked but I do not trust that

bind_rows(data_list)

I still think that you are getting data_list as a list and not a tibble or data.frame.

I am glad to have helped.

DrZoidberg · August 20, 2024, 9:21pm

folder_path <- "C:/Users/paula/OneDrive/Desktop/Mobil-O-Graph/data/Schulmessungen/Test"
file_list <- list.files(path = folder_path, pattern = ".csv", full.names = TRUE)
data_list <- lapply(file_list, read_csv)
combined_data <- bind_rows(data_list)  
print(combined_data)

This part does give me a overview list of all the files and one large table of all the files without them being in the right format. I have 21 files of the old format behind each other.

full_data2 <- cbind(pdt2, dat2)

It does work. I copied yours and doubled it to test it out with two different files and how to combine them. So I have one with final_data1 and one with final_data2 and I adapted all the data names from before. But the final_data1 copies file name contains sensitive data and with the final_data I tried to combine final_data1 and 2.

But I am to tired now and will look into it after work tomorrow.

jrkrideau · August 21, 2024, 2:54pm

This is not complete but I wonder if it might help? You will need to adjust your file path

suppressMessages(library(data.table))
suppressMessages(library(tidyverse))
library(readxl)

# Personal Data Function --------------------------------------------------

read_pd <- function(mylist){
pd1 <- read_excel(mylist, n_max = 11, col_names = c("xx1", "xx2")) ## Read identifier data
pdt1 <- as_tibble(t(pd1[,2])) ## Turn identifier data into one row tibble
}


# Read personal data ------------------------------------------------------
file_list <- list.files(path = "Raw_data", pattern = ".xls", full.names = TRUE)
pd_list <- lapply(file_list, read_pd)

persons <- do.call(rbind.data.frame, pd_list)

DrZoidberg · August 21, 2024, 4:47pm

That helped immensely.
My code is now:

library(data.table)
library(tidyverse)
library(ggpubr)
library(readxl)

#load files ----

folder_path <- "C:/Users/paula/OneDrive/Desktop/Mobil-O-Graph/data/Schulmessungen/Test"
file_list <- list.files(path = folder_path, pattern = ".csv", full.names = TRUE)
data_list <- lapply(file_list, read_csv)

# Personal Data Function --------------------------------------------------

read_pd <- function(file_list){
pd1 <- read_csv(file_list, n_max = 11, col_names = c("xx1", "xx2"))
pdt1 <- as_tibble(t(pd1[,2]))
}

read_dat <- function(file_list){
dat1 <- read_csv(file_list, skip = 13)
}

# Read personal data ------------------------------------------------------
file_list <- list.files(path = "C:/Users/paula/OneDrive/Desktop/Mobil-O-Graph/data/Schulmessungen/Test", pattern = ".csv", full.names = TRUE)
pd_list <- lapply(file_list, read_pd)
measurement_list <- lapply(file_list, read_dat)

persons <- do.call(rbind.data.frame, pd_list)
measurements <- do.call(rbind.data.frame, measurements_list)
Mobil-O-Graph <- cbind(persons, measurements)

The problem I am facing now is that some measurements have 29 columns while most have 20 columns, and they are arranged differently, so I can't just join them together.

After that, I think I only have to work on combining the correct measurements with the correct person file.

jrkrideau · August 21, 2024, 7:07pm

This is nasty.

Is there any way to easily distinguish which are which? If so it should be fairly easy to read them into separate tibbles, re-arrange the column order of the "29-column" file to conform to the 20 column one and then "rbind" them together.

It might make sense to have two separate folders.

You will just end up with those 9 columns column having a lot of "NA"s but that is not a problem.

DrZoidberg · August 21, 2024, 7:58pm

I will try that tomorrow. The problem with splitting them in 20 and 29 collums is merging them back together and bringing them into the right order. The IDs are only in the person's part and we need the IDs to connect the measurements to the other test we performed

jrkrideau · August 21, 2024, 8:21pm

I think we are visualizing this in different ways.

As I visualize it you simply split all the .xls (or .csv) files into two groups. Say two folders. One has all the 20-columns and one has all the 29-columns. We process them independently, but . Sort the 29-columns output into the same order as the 20-column order.

rbind the two data.frames together. This should not affect the ID variable at all. We never touch the 9 columns independently.

Am I misunderstanding the data structures?

DrZoidberg · August 21, 2024, 8:38pm

That might work. I will keep that as a last resort because I would still have to sort 2000 files by hand

jrkrideau · August 21, 2024, 10:39pm

Actually I will have think about it a bit more but unless I am seriously misunderstanding the data structure it is a "simple" approach.

If you are sorting 2000 files by hand(!!!) just toss the 20-column ones into one folder and the 29-column ones into another folder.

Conceptually, you are doing the same thing to each file. Then at the end you have two data.frames or tibbles. You rearrange one to get the same column order as in the other but with the nine extra columns stuck on in one and do an rbind.

First set of files
ID, AA, BB, CC

Second set of files

XX, ID, AA, ZZ, BB, KK, CC

Reorder Second set of files
  
ID, AA, BB, CC,  KK, XX, ZZ

You can then do an rbind.  The files from the first set will have NA's in the KK, XX, ZZ columns but that is no problem

jrkrideau · August 22, 2024, 11:08am

Cancel most of that last message as I misread it and thought you "were" sorting by hand.

DrZoidberg · August 22, 2024, 3:59pm

It actually wasn't that hard to seperate the 29 colum files by hand. The 29 colum files had a slight different in the name. With the 29 colum files everything works after I used a python programm to transfer them to csv. Only now, using all of the 20 colum files I found several 28 colum files that have no difference in their name and there are a lot more of them.

Can I sort the files by amount of colums?

jrkrideau · August 22, 2024, 6:14pm

I don't know. I've been trying to think of an obvious way and I cannot see one.

I'd suggest starting a new thread on exactly this problem. I think it should be possible but I don't have enough knowledge of .xls file structure to know if it can be done there.

It may be possible to load all the files into one R data.frame or tibble, with the varying numbers of columns and then filter out the three different column-length sets into different tibbles, clean them up and rebuild the original tibble but I have no idea how at the moment.

I was just thinking of creating a couple of example files with 29 columns and seeing if I could come up with anything.

I'd really suggest a new post that some real munging experts might find a challenge. This tread is old enough that a lot of people will ot be following it.

Sorry not to be of more help.