Error when loading .xlsx files in a for loop

Carlos_Cajas · May 27, 2021, 10:12pm

Hi everybody, I have almost 500 .xlsx files that I want to load to R but I need to make a lot of changes to them once they´re on R because there are a lot of rows and columns that I don´t need. At the end they must be unite with rbind() to get one data.frame.

One of these changes is to create a column with the date of the file. The thing is that the date is located in a certain cell datos[2,3] so I take it into an object named fecha.dia and transform it to get a date object. But it´s an object with just one element. What I do is create a new variable in the data.frame call Fecha and I put into it the object fecha.dia so it fills the whole column with the date.

Other thing is that the .xlsx files don´t have the same number of columns, that´s why I turn it into a long format and then do the rbind().

When I run the code without the loop, just for one file, it worked.

This is a sample of one file, it' really a small sample because I can't copy all the columns, I think it's to big.

datos <- tibble::tribble(
           ~DESPACHO.ECONOMICO.DIARIO.PROGRAMADO, ~`HORA.PUBLICACION.WEB:`, ~`14:25`,
                                              NA,                       NA,       NA,
                                              NA,                       NA,       "miercoles, 1 de enero de 2020",
                                         "CCSIH",                  "PUCAH",  "NORMH",
                                              NA,                       NA,       NA,
                                              NA,                       NA,       NA,
                                         "991.9",                   "20.0",   "49.6",
                                         "926.7",                   "20.0",   "49.6",
                                         "749.5",                   "20.0",   "49.6",
                                         "530.9",                   "20.0",   "49.6",
                                         "464.6",                   "20.0",   "49.6",
                                         "356.6",                   "20.0",   "49.6",
                                         "200.0",                   "20.0",   "49.6",
                                         "200.0",                   "20.0",   "49.6",
                                         "205.1",                   "20.0",   "49.6",
                                         "429.1",                   "20.0",   "49.6",
                                         "705.7",                   "20.0",   "49.6",
                                         "908.6",                   "20.0",   "49.6",
                                         "917.1",                   "20.0",   "49.6"
           )




head(datos)
#> # A tibble: 6 x 3
#>   DESPACHO.ECONOMICO.DIARIO.PROGR~ `HORA.PUBLICACION.WE~ `14:25`                
#>   <chr>                            <chr>                 <chr>                  
#> 1 <NA>                             <NA>                  <NA>                   
#> 2 <NA>                             <NA>                  miercoles, 1 de enero ~
#> 3 CCSIH                            PUCAH                 NORMH                  
#> 4 <NA>                             <NA>                  <NA>                   
#> 5 <NA>                             <NA>                  <NA>                   
#> 6 991.9                            20.0                  49.6

Here is my complete code with the loop.

# Cargar Librerias----
library(fs)
library(tidyverse)
library(readxl)
library(readr)

# OBJETOS----
path <- "Datos"
datos <- list()
fecha.dia <- list()
encabezados <- list()
i <- 1

# FOR LOOP----
for (file in list.files(path, pattern = "*.xlsx")) {
  datos[[i]] <- read_excel(paste(path, file, sep = "/"), 
                           sheet = "DESPACHO", 
                           n_max = 29)
  ## 1. Eliminar columnas vacías----
    datos[[i]] <- datos[[i]][1:(length(datos[[i]])-6)]
  
  ## 2. Tomar fecha de celda----
    fecha.dia[[i]] <- datos[[i]][2,3]
    fecha.dia[[i]] <- regmatches(fecha.dia[[i]], regexpr(pattern = "/d.*.*/d$", fecha.dia[[i]]))
    fecha.dia[[i]] <- gsub(" de ", " ", fecha.dia[[i]])
    fecha.dia[[i]] <- readr::parse_date(fecha.dia[[i]], format = "%d %B %Y", locale = locale("es"))
    datos[[i]]$Fecha <- fecha.dia[[i]]
  
  ## 3. Manejo de encabezados----
    encabezados[[i]] <- datos[[i]][3, ]
    encabezados[[i]][1,1] <- "HORA"
    encabezados[[i]] <- encabezados[[i]] %>%
      pivot_longer(1:length(datos[[i]]), "colum", values_to =  "unidad")
    encabezados[[i]] <- na.omit(encabezados[[i]]["unidad"])
    encabezados[[i]] <- encabezados[[i]][["unidad"]]

    colnames(datos[[i]]) <- encabezados[[i]]
    
  ## 4. Manipular y preparar tabla----
    datos[[i]] <- datos[[i]][-c(1:5), ]
    datos[[i]]$DEMANDA <- NULL
    datos[[i]]$PRECIO <- NULL
    datos[[i]]$RSF <- NULL
    datos[[i]]$ECUACOLO230 <- NULL
    datos[[i]]$ECUACOLO138 <- NULL
    datos[[i]]$ECUAPERU230 <- NULL
    datos[[i]]$HORA <- seq(0,23,1)
    datos[[i]] <- select(datos[[i]], Fecha, everything()) #ordenar tabla
    datos[[i]] <- pivot_longer(datos[[i]], 3:length(datos[[i]]), "unidad", values_to = "potencia")
    datos[[i]]$potencia <- round(as.numeric(datos[[i]]$potencia),2)
    datos[[i]] <- rbind(datos[[i]])
  i <- i + 1
}

When I run it I get this error:

Error: Assigned data fecha.dia[[i]] must be compatible with existing data.
x Existing data has 29 rows.
x Assigned data has 0 rows.
i Only vectors of size 1 are recycled.
Run rlang::last_error() to see where the error occurred.

I don´t know how to deal with that object. It's my first loop so I hope you can help me.

Please any advice.

FJCC · May 28, 2021, 2:49am

I am not sure what you are trying to do with the code above. The pattern does not make sense to me, though I am not a regex expert, so it could be ignorance on my part. If I try to run the date from your example data through that line, I get a zero length vector.

Text <- "miercoles, 1 de enero de 2020"
regmatches(Text, regexpr(pattern = "/d.*.*/d$", Text))
character(0)

I think that would explain the error you are getting that says "Assigned data has 0 rows". Can you test whether that line of code works for you using miercoles, 1 de enero de 2020?

Do you perhaps mean something like this?

regmatches(Text, regexpr(pattern = "\\d{1,2}.*\\d$", Text))
[1] "1 de enero de 2020"

Carlos_Cajas · May 28, 2021, 3:47am

@FJCC
Thanks a lot!!! That was all the problem.

I don't know why this pattern = "/d.*.*/d$" worked when I tried.

Now I have another problem but it's because my xlsx files have different number of columns.

datos[[i]] <- datos[[i]][1:(length(datos[[i]])-6)]

I use this line to delete the last 6 columns because they all have variables names like NA so I don't need them. But I just found a file that has 7 columns from the last variable.

This is an example, from the column BT to BZ there are 6 columns I want to delete. In this case my data ends in column BT.

But I found this file that has 7 columns from BR to BY as you can see in the image. In this case my data ends in column BR.

Any idea how can I delete those column if I don't know how many are there at the end?

Thanks a lot again!!

andresrcs · May 28, 2021, 4:13am

Any chance you could share a link to a few representative xlsx files? If needed you could sanitize them first (replace any sensible information).

I think it is very likely that there is a better approach than using a for loop, but we need sample data to try.

Carlos_Cajas · May 28, 2021, 4:36am

@andresrcs

Here is a link with 2 files.

https://mega.nz/folder/TjwQEIBK#jIcZZ22GjZnFxRQhSOdJ4w

Both files have different number of columns.

I hope this help.

Thank you.

DavoWW · May 28, 2021, 5:59am

Hi @Carlos_Cajas,
Those Excel files are quite "messy" especially given the merged cells in the header rows, but the openxlsx package can do a reasonable job:

setwd("~/DATA")
lst_files <- list.files(pattern="\\.xlsx$")

library(openxlsx)

# Header rows have merged cells = problems!
#----- Package {openxlsx} has a trick when reading merged cells -----
dat <- read.xlsx(xlsxFile=lst_files[1],
                  startRow=9,
                  fillMergedCells=TRUE, 
                  colNames=TRUE)

The resultant data frame is still pretty messy but it should be possible to trim the unrequired columns and rows using non-NA values as the key. Then put the working code in a user-defined function, and run it for all your Excel files using lapply (or a dplyr/purrr equivalent).

Hope this helps.

DavoWW · May 28, 2021, 6:40am

Couldn't stop myself.....

library(openxlsx)

input_files <- function(path) {
  require(stringr)
  dat <- read.xlsx(xlsxFile=path,
                  startRow=9,
                  fillMergedCells=TRUE, 
                  colNames=TRUE)
  cols <- names(dat)
  cutoff <- min(which(str_detect(cols, "^X[0-9][0-9]")))
  col_end <- length(cols)

  dat <- dat[-c(1:2), -c(cutoff:col_end)]
  names(dat)[1] <- "HORA"
  return(dat)
}

paths <- list.files(path="~/DATA/", 
                    pattern="\\.xlsx$",
                    full.names=TRUE)

# Testing with one file (not run here)
# input_files(paths[1])

out.lst <- lapply(paths, input_files)
#> Loading required package: stringr

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
# All columns are preserved whether or not their names match
full.df <- bind_rows(out.lst, .id="ID")
head(full.df)
#>   ID    HORA              CCSIH              PAUTH MAZAH SOPLH
#> 1  1   0-  1             991.94 571.66156000000001     0   212
#> 2  1   1-  2 926.71991000000003 560.37750000000005     0   212
#> 3  1   2-  3 749.50549000000001          549.39337     0   212
#> 4  1   3-  4 530.94244000000003 534.26153999999997     0   212
#> 5  1   4-  5 464.62414999999999 548.28332999999998     0   212
#> 6  1   5-  6 356.63895000000002 538.22942999999998     0   212
#>                MSFOH MLANH AGOYH SFRAH DELSH PUCAH MANDH BABAH NORMH SBARH
#> 1 142.14090999999999   100    90   110    45    20    60     0 49.56  49.9
#> 2          101.85377   100    90   110    45    20    60     0 49.56  49.9
#> 3 94.642750000000007   100    90   110    45    20    60     0 49.56  49.9
#> 4 88.723550000000003   100    90   110    45    20    60     0 49.56  49.9
#> 5 84.830699999999993   100    90   110    45    20    60     0 49.56  49.9
#> 6 74.675870000000003   100    90   110    45    20    60     0 49.56  49.9
#>   TOPOH SIGCH DUEGH AZANH CUMBH0A NAYOH0A GUANH0A HVICH PASOH0A CHILH0A OCAÑH
#> 1    26     9  49.7     3      18    11.7    10.5     7       3     1.8    26
#> 2    26     9  49.7     3      18    11.7    10.5     7       3     1.8    26
#> 3    26     9  49.7     3      18    11.7    10.5     7       3     1.8    26
#> 4    26     9  49.7     3      18    11.7    10.5     7       3     1.8    26
#> 5    26     9  49.7     3      18    11.7    10.5     7       3     1.8    26
#> 6    26     9  49.7     3      18    11.7    10.5     7       3     1.8    26
#>   PUS1H PUS2H SAYMH0A SAY5H0A ALAOH0A RBLAH0A AMBIH0A LPLAH0A
#> 1    22  16.2       3    7.52       8       1       6     0.8
#> 2    22  16.2       3    7.52       8       1       6     0.8
#> 3    22  16.2       3    7.52       8       1       6     0.8
#> 4    22  16.2       3    7.52       8       1       6     0.8
#> 5    22  16.2       3    7.52       8       1       6     0.8
#> 6    22  16.2       3    7.52       8       1       6     0.8
#>                RVERH ILL1H0A ILL2H0A CARMH0A LOREH0A PAPAH0A HABAH SJTAH SIBIH
#> 1 9.9998000000000005     2.1     1.8     7.4     1.5     2.4  30.8     3    10
#> 2 9.9998000000000005     2.1     1.8     7.4     1.5     2.4  30.8     3    10
#> 3 9.9998000000000005     2.1     1.8     7.4     1.5     2.4  30.8     3    10
#> 4 9.9998000000000005     2.1     1.8     7.4     1.5     2.4  30.8     3    10
#> 5 9.9998000000000005     2.1     1.8     7.4     1.5     2.4  30.8     3    10
#> 6 9.9998000000000005     2.1     1.8     7.4     1.5     2.4  30.8     3    10
#>   CALOH PENIH0A CMORH0A              PALMH CALEH0A          VILLRAG01 SALFRFV01
#> 1    11     1.5     2.4 9.8000000000000007     0.8 2.2999999999999998         0
#> 2    11     1.5     2.4 9.8000000000000007     0.8                2.1         0
#> 3    11     1.5     2.4 9.8000000000000007     0.8                1.8         0
#> 4    11     1.5     2.4 9.8000000000000007     0.8                  2         0
#> 5    11     1.5     2.4 9.8000000000000007     0.8                1.8         0
#> 6    11     1.5     2.4 9.8000000000000007     0.8                2.6         0
#>   GINGTMC01 PCAYTMC01 SASCTNC01          TGMATTG02 TGM2TTG05 TGM2TTG01
#> 1         5      0.85        36 49.389000000000003        13        13
#> 2         5      0.85        36 49.389000000000003        13        13
#> 3         5      0.85        36 49.389000000000003        13        13
#> 4         5      0.85        36 49.389000000000003        13        13
#> 5         5      0.85        36 49.389000000000003        13        13
#> 6         5      0.85        36 49.389000000000003        13        13
#>   TGM2TTG02 GZEVTVA03 GZEVTVA02 ECOETNC01 DAYUTMC02            DEMANDA PRECIO
#> 1        13        26        26      11.5         0 2718.4612700000012    0.2
#> 2        13        26        26      11.5         0 2630.4699800000008    0.2
#> 3        13        26        26      11.5         0 2519.7604100000008    0.2
#> 4        13        26        26      11.5         0 2426.3463300000008    0.2
#> 5        13        26        26      11.5         0 2369.9569800000008    0.2
#> 6        13        26        26      11.5         0 2242.5630500000002    0.2
#>     RSF ECUACOLO230 ECUACOLO138 ECUAPERU230 PERLH SAUCH0A SJMCH TGMATTG01
#> 1 187.5         260        <NA>        <NA>  <NA>    <NA>  <NA>      <NA>
#> 2 187.5         231        <NA>        <NA>  <NA>    <NA>  <NA>      <NA>
#> 3 187.5         146        <NA>        <NA>  <NA>    <NA>  <NA>      <NA>
#> 4 187.5           0        <NA>        <NA>  <NA>    <NA>  <NA>      <NA>
#> 5 187.5           0        <NA>        <NA>  <NA>    <NA>  <NA>      <NA>
#> 6 187.5           0        <NA>        <NA>  <NA>    <NA>  <NA>      <NA>
#>   TGM2TTG04 SEL3TMC02 TRINTVA01 JARATMC01 MAN2TMC01 ESM2TMC02
#> 1      <NA>      <NA>      <NA>      <NA>      <NA>      <NA>
#> 2      <NA>      <NA>      <NA>      <NA>      <NA>      <NA>
#> 3      <NA>      <NA>      <NA>      <NA>      <NA>      <NA>
#> 4      <NA>      <NA>      <NA>      <NA>      <NA>      <NA>
#> 5      <NA>      <NA>      <NA>      <NA>      <NA>      <NA>
#> 6      <NA>      <NA>      <NA>      <NA>      <NA>      <NA>

^{Created on 2021-05-28 by the reprex package (v2.0.0)}

Carlos_Cajas · May 28, 2021, 4:26pm

Hi @DavoWW ,

Thanks a lot!!!

Your code is easier than a for loop.

I run it and it was faster than the loop.

Now I have all my data (around 500 xlsx files) in one df.

Thanks again.

system · June 4, 2021, 4:27pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.