POSIX timestamp not recognized after 109000 data points

Tom1961 · August 5, 2025, 10:52am

I have R code that analyses solar panel data. Today when I added a new day's data I received the following message:

Error in as.POSIXlt.character(x, tz = tz(x)) : 
  character string is not in a standard unambiguous format

The data are in the correct timestamp format; the code has run perfectly for the last 6 months.
The line of code that fails is the call to year(timestamp) in the second row below. This code runs perfectly until the imported csv file (and subsequent data frame) have 109000 rows!

processed_Measures<-raw_Measures%>%
   mutate(SID=paste0(as.character(year(timestamp)),"-",sprintf("%03d",yday(timestamp))),theTime=as_hms(timestamp),
   theDate=date(timestamp),
   theHour=sprintf("%02d",hour(timestamp)),
   DoW=wday(timestamp, label = TRUE),
   theWeek=sprintf("%02d",week(timestamp)),
   yearWeek=paste0(as.character(year(timestamp)),"-",sprintf("%02d",week(timestamp))),
   theMonth=sprintf("%02d",month(timestamp)),
   MoY=month(timestamp, label = TRUE),
   theYear=year(timestamp),
   yearMonth=paste0(as.character(year(timestamp)),"-",sprintf("%02d",month(timestamp))),
   thePower=4*WattHour
   )%>%relocate(SID)

I have found that, by eliminating rows one at a time in the import csv file, the code runs correctly with 109000 rows but always fails on adding one further row.

2025-07-28 18:15:00;;Smart Meter;Production;85;Wh
2025-07-28 18:30:00;;Smart Meter;Production;61;Wh
2025-07-28 18:45:00;;Smart Meter;Production;53;Wh

The bottom row above is row 109001 in the import csv file and data frame and it is this row that returns the error.

Beyond the addition of a new row nothing else has changed.

The data in the the csv file are imported (CSV2) to a data frame raw_measures that I then process further into a data frame that I have named processed_measures

I would appreciate any advice you might provide.

Tom

FJCC · August 6, 2025, 1:56am

Please post the output of

dput(raw_measures[108995:109005, ])

That will make code that replicates 10 rows of data around the point where you have trouble.

Tom1961 · August 6, 2025, 6:49am

Thank you for your response. The requested output is shown below

structure(list(timestamp = c("2025-07-28 17:30:00", "2025-07-28 17:45:00", 
"2025-07-28 18:00:00", "2025-07-28 18:15:00", "2025-07-28 18:30:00", 
"2025-07-28 18:45:00", "2025-07-28 19:00:00", "2025-07-28 19:15:00", 
"2025-07-28 19:30:00", "2025-07-28 19:45:00", "2025-07-28 20:00:00"
), WattHour = c(176, 139, 103, 85, 61, 53, 47, 44, 35, 29, 24
)), row.names = c(NA, -11L), class = c("tbl_df", "tbl", "data.frame"
), spec = structure(list(cols = list(timestamp = structure(list(), class = c("collector_character", 
"collector")), meter.identification = structure(list(), class = c("collector_skip", 
"collector")), meter.name = structure(list(), class = c("collector_skip", 
"collector")), meter.type = structure(list(), class = c("collector_skip", 
"collector")), value = structure(list(), class = c("collector_double", 
"collector")), meter.unit = structure(list(), class = c("collector_skip", 
"collector"))), default = structure(list(), class = c("collector_guess", 
"collector")), delim = ";"), class = "col_spec"))

The raw_Measures data frame displays data correctly but the code stops when the next step (which creates processed_Measures) executes. I changed the name of the field "value" to "WattHour".

This has worked correctly for the last six months.

Tom

Tom1961 · August 6, 2025, 10:23am

Here's a listing of all the code up to the failure point. I have added dputs at different points in the code. I hope that helps. This run is with a csv file with a number of lines that executes correctly.

#start==== 
library(tidyverse)
library(readxl)
library(readr)
library(lubridate)
library(hms)
library(nortest)
library("viridis")

setwd("C:\\Users\\tom_s\\OneDrive\\Documents\\Solar Energy Project")
getwd()
source("Tukey_Plot_Function.R")
source("Useful Functions.R")
# DATA IMPORT====
#Set up year for analysis
# Choose a year in the range 2020-2025 or 1 for analysis of all years data
analyse_year<-2025

# Import Raw Data====
# Import the raw data from downloaded Brusol data. Select only columns 1 and 5.
# The function read_csv2 uses semi-colons as the separator and stores the data in a variable called raw_data

raw_Measures <- read_csv2("measures_solar_all.csv",col_select = c(1,5),show_col_types = FALSE)

# we can test to see of there are any errors (na) in the imported data. We find these using the following instruction
which(is.na(raw_Measures$timestamp), arr.ind=TRUE)
# In this case there are two at 5996 and 13980
# We can check the source of the nas (they are on month boundaries and are, in fact, user error!
# They are both copy and paste errors made while aggregrating monthly data)
raw_Measures[c(5995:5997,13979:13981),]
dput(raw_Measures[108995:109005, ])
# Rather than read in data each time from a csv, we can create an rds object and then read that in for the analysis.
write_rds(raw_Measures,"all_raw_measures.rds")
# In future we use the data in the rds object
rm(raw_Measures)

# We can read in the rds object as follows. using the same variable (dataframe)
# name means we can use all the following code without modification
# First, I rename value to WattHour, a useful name in the context
# Omit the "known" errors used to illustrate how to find nas in the input file
raw_Measures <- read_rds("all_raw_measures.rds")%>% rename(WattHour = value)%>%na.omit()
summary(raw_Measures$WattHour)
energy_raw_1=sum(raw_Measures$WattHour)
dput(raw_Measures[108995:109005, ])
# We need now to remove the zeros that occur before sunrise and after sunset
# (while accounting for occlusion in the built environment!)

dput(raw_Measures[108995:109005, ])
# We are now able to extract information for use in the grouping and plotting of data for the analysis and charts.
# In this section I use a number of functions from the lubridate library.
# The plots are prepared using the ggplot library in tidyverse.
# The variable called processed_measures will hold the modified data for subsequent editing.

#Processed Data====
# Add a sunshine ID (SID) to allow easy joining of datasets or filtering by joins
#Extract variety of data from the timestamp using lubridate and pass the data to a new variable called processed_Measures
processed_Measures<-raw_Measures%>%
   mutate(SID=paste0(as.character(year(timestamp)),"-",sprintf("%03d",yday(timestamp))),theTime=as_hms(timestamp),
   theDate=date(timestamp),
   theHour=sprintf("%02d",hour(timestamp)),
   DoW=wday(timestamp, label = TRUE),
   theWeek=sprintf("%02d",week(timestamp)),
   yearWeek=paste0(as.character(year(timestamp)),"-",sprintf("%02d",week(timestamp))),
   theMonth=sprintf("%02d",month(timestamp)),
   MoY=month(timestamp, label = TRUE),
   theYear=year(timestamp),
   yearMonth=paste0(as.character(year(timestamp)),"-",sprintf("%02d",month(timestamp))),
   thePower=4*WattHour
   )%>%relocate(SID)

The output screen looks like this:

#start==== 
> library(tidyverse)
> library(readxl)
> library(readr)
> library(lubridate)
> library(hms)
> library(nortest)
> library("viridis")
> 
> setwd("C:\\Users\\tom_s\\OneDrive\\Documents\\Solar Energy Project")
> getwd()
[1] "C:/Users/tom_s/OneDrive/Documents/Solar Energy Project"
> source("Tukey_Plot_Function.R")
> source("Useful Functions.R")
> # DATA IMPORT====
> #Set up year for analysis
> # Choose a year in the range 2020-2025 or 1 for analysis of all years data
> analyse_year<-2025
> 
> # Import Raw Data====
> # Import the raw data from downloaded Brusol data. Select only columns 1 and 5.
> # The function read_csv2 uses semi-colons as the separator and stores the data in a variable called raw_data
> 
> raw_Measures <- read_csv2("measures_solar_all.csv",col_select = c(1,5),show_col_types = FALSE)
ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
Warning message:                                                                                                                                                                             
One or more parsing issues, call `problems()` on your data frame for details, e.g.:
  dat <- vroom(...)
  problems(dat) 
> 
> # we can test to see of there are any errors (na) in the imported data. We find these using the following instruction
> which(is.na(raw_Measures$timestamp), arr.ind=TRUE)
[1]   5996  13980 106620
> # In this case there are two at 5996 and 13980
> # We can check the source of the nas (they are on month boundaries and are, in fact, user error!
> # They are both copy and paste errors made while aggregrating monthly data)
> raw_Measures[c(5995:5997,13979:13981),]
# A tibble: 6 × 2
  timestamp           value
  <dttm>              <dbl>
1 2021-02-28 19:45:00     0
2 NA                      0
3 2021-03-01 06:30:00     0
4 2021-06-30 22:30:00     0
5 NA                      0
6 2021-07-01 05:45:00     0
> dput(raw_Measures[108995:109005, ])
structure(list(timestamp = structure(c(1753723800, 1753724700, 
1753725600, 1753726500, 1753727400, NA, NA, NA, NA, NA, NA), tzone = "UTC", class = c("POSIXct", 
"POSIXt")), value = c(176, 139, 103, 85, 61, NA, NA, NA, NA, 
NA, NA)), row.names = c(NA, -11L), class = c("tbl_df", "tbl", 
"data.frame"), spec = structure(list(cols = list(timestamp = structure(list(
    format = ""), class = c("collector_datetime", "collector"
)), meter.identification = structure(list(), class = c("collector_skip", 
"collector")), meter.name = structure(list(), class = c("collector_skip", 
"collector")), meter.type = structure(list(), class = c("collector_skip", 
"collector")), value = structure(list(), class = c("collector_double", 
"collector")), meter.unit = structure(list(), class = c("collector_skip", 
"collector"))), default = structure(list(), class = c("collector_guess", 
"collector")), delim = ";"), class = "col_spec"))
> # Rather than read in data each time from a csv, we can create an rds object and then read that in for the analysis.
> write_rds(raw_Measures,"all_raw_measures.rds")
> # In future we use the data in the rds object
> rm(raw_Measures)
> 
> # We can read in the rds object as follows. using the same variable (dataframe)
> # name means we can use all the following code without modification
> # First, I rename value to WattHour, a useful name in the context
> # Omit the "known" errors used to illustrate how to find nas in the input file
> raw_Measures <- read_rds("all_raw_measures.rds")%>% rename(WattHour = value)%>%na.omit()
> summary(raw_Measures$WattHour)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0     0.0    52.0   167.4   258.0   916.0 
> energy_raw_1=sum(raw_Measures$WattHour)
> dput(raw_Measures[108995:109005, ])
structure(list(timestamp = structure(c(1753726500, 1753727400, 
NA, NA, NA, NA, NA, NA, NA, NA, NA), tzone = "UTC", class = c("POSIXct", 
"POSIXt")), WattHour = c(85, 61, NA, NA, NA, NA, NA, NA, NA, 
NA, NA)), row.names = c(NA, -11L), class = c("tbl_df", "tbl", 
"data.frame"), spec = structure(list(cols = list(timestamp = structure(list(
    format = ""), class = c("collector_datetime", "collector"
)), meter.identification = structure(list(), class = c("collector_skip", 
"collector")), meter.name = structure(list(), class = c("collector_skip", 
"collector")), meter.type = structure(list(), class = c("collector_skip", 
"collector")), value = structure(list(), class = c("collector_double", 
"collector")), meter.unit = structure(list(), class = c("collector_skip", 
"collector"))), default = structure(list(), class = c("collector_guess", 
"collector")), delim = ";"), class = "col_spec"), na.action = structure(c(`5996` = 5996L, 
`13980` = 13980L, `106620` = 106620L), class = "omit"))
> # We need now to remove the zeros that occur before sunrise and after sunset
> # (while accounting for occlusion in the built environment!)
> 
> # Import sunrise/sunset data from https://robinfo.oma.be/en/astro-info/sun/sunrise-sunset-2025/ 
> # The Royal Observatory of Belgium
> # This is currently in the Excel file in the working directory where the data have been wrangled into shape.
> # This will let us filter out fixed-time zeros in the data to ensure accurate and meaningful data analysis
> 
> dput(raw_Measures[108995:109005, ])
structure(list(timestamp = structure(c(1753726500, 1753727400, 
NA, NA, NA, NA, NA, NA, NA, NA, NA), tzone = "UTC", class = c("POSIXct", 
"POSIXt")), WattHour = c(85, 61, NA, NA, NA, NA, NA, NA, NA, 
NA, NA)), row.names = c(NA, -11L), class = c("tbl_df", "tbl", 
"data.frame"), spec = structure(list(cols = list(timestamp = structure(list(
    format = ""), class = c("collector_datetime", "collector"
)), meter.identification = structure(list(), class = c("collector_skip", 
"collector")), meter.name = structure(list(), class = c("collector_skip", 
"collector")), meter.type = structure(list(), class = c("collector_skip", 
"collector")), value = structure(list(), class = c("collector_double", 
"collector")), meter.unit = structure(list(), class = c("collector_skip", 
"collector"))), default = structure(list(), class = c("collector_guess", 
"collector")), delim = ";"), class = "col_spec"), na.action = structure(c(`5996` = 5996L, 
`13980` = 13980L, `106620` = 106620L), class = "omit"))
> # We are now able to extract information for use in the grouping and plotting of data for the analysis and charts.
> # In this section I use a number of functions from the lubridate library.
> # The plots are prepared using the ggplot library in tidyverse.
> # The variable called processed_measures will hold the modified data for subsequent editing.
> 
> #Processed Data====
> # Add a sunshine ID (SID) to allow easy joining of datasets or filtering by joins
> #Extract variety of data from the timestamp using lubridate and pass the data to a new variable called processed_Measures
> processed_Measures<-raw_Measures%>%
+    mutate(SID=paste0(as.character(year(timestamp)),"-",sprintf("%03d",yday(timestamp))),theTime=as_hms(timestamp),
+    theDate=date(timestamp),
+    theHour=sprintf("%02d",hour(timestamp)),
+    DoW=wday(timestamp, label = TRUE),
+    theWeek=sprintf("%02d",week(timestamp)),
+    yearWeek=paste0(as.character(year(timestamp)),"-",sprintf("%02d",week(timestamp))),
+    theMonth=sprintf("%02d",month(timestamp)),
+    MoY=month(timestamp, label = TRUE),
+    theYear=year(timestamp),
+    yearMonth=paste0(as.character(year(timestamp)),"-",sprintf("%02d",month(timestamp))),
+    thePower=4*WattHour
+    )%>%relocate(SID)
>

The global environment shows the successful processing of data. When I add a single extra, correctly-formatted row to the csv file the code stops.

margusl · August 6, 2025, 12:30pm

How exactly are you adding row(s)?

I'd disable readr's date/time parsing for a moment to check records that caused issues:

library(readr)
library(lubridate)

csv2 <- 
"timestamp;WattHour                                                                                                                                                                                  
2025-07-28 17:30:00;176
2025-00-28 17:45:00;139
2025-07-28 18:00:00;103"

raw_Measures <- 
  read_csv2(csv2, col_types = cols(.default = col_character()))

raw_Measures[is.na(ymd_hms(raw_Measures$timestamp)), ]
#> Warning: 1 failed to parse.
#> # A tibble: 1 × 2
#>   timestamp           WattHour
#>   <chr>               <chr>   
#> 1 2025-00-28 17:45:00 139

^{Created on 2025-08-06 with reprex v2.1.1}

FJCC · August 6, 2025, 11:43pm

The length of the data frame is not a problem on my system. I would have been shocked if R could not handle more than 190000 rows. Try running the following code and see if it works for you.

library(tidyverse)
library(hms)
raw_Measures <- structure(list(timestamp = c("2025-07-28 17:30:00", "2025-07-28 17:45:00", 
                             "2025-07-28 18:00:00", "2025-07-28 18:15:00", "2025-07-28 18:30:00", 
                             "2025-07-28 18:45:00", "2025-07-28 19:00:00", "2025-07-28 19:15:00", 
                             "2025-07-28 19:30:00", "2025-07-28 19:45:00", "2025-07-28 20:00:00"), 
                             WattHour = c(176, 139, 103, 85, 61, 53, 47, 44, 35, 29, 24
)), row.names = c(NA, -11L), class = c("tbl_df", "tbl", "data.frame"), 
spec = structure(list(cols = list(timestamp = structure(list(), class = c("collector_character", "collector")), 
                                     meter.identification = structure(list(), class = c("collector_skip", "collector")), 
                                     meter.name = structure(list(), class = c("collector_skip", "collector")), 
                                     meter.type = structure(list(), class = c("collector_skip", "collector")), 
                                     value = structure(list(), class = c("collector_double", "collector")), 
                                     meter.unit = structure(list(), class = c("collector_skip", "collector"))), 
                         default = structure(list(), class = c("collector_guess", "collector")), delim = ";"), class = "col_spec"))

processed_Measures<-raw_Measures %>%
   mutate(SID=paste0(as.character(year(timestamp)),"-",sprintf("%03d",yday(timestamp))),#theTime=as_hms(timestamp),
   theDate=date(timestamp),
   theHour=sprintf("%02d",hour(timestamp)),
   DoW=wday(timestamp, label = TRUE),
   theWeek=sprintf("%02d",week(timestamp)),
   yearWeek=paste0(as.character(year(timestamp)),"-",sprintf("%02d",week(timestamp))),
   theMonth=sprintf("%02d",month(timestamp)),
   MoY=month(timestamp, label = TRUE),
   theYear=year(timestamp),
   yearMonth=paste0(as.character(year(timestamp)),"-",sprintf("%02d",month(timestamp))),
   thePower=4*WattHour
   )  %>%  relocate(SID)

raw_Measures_long <- data.frame(timestamp = rep(raw_Measures$timestamp, 20000),
                                WattHour = rep(raw_Measures$WattHour, 20000))
processed_Measures_long <- raw_Measures_long %>%
  mutate(SID=paste0(as.character(year(timestamp)),"-",sprintf("%03d",yday(timestamp))),#theTime=as_hms(timestamp),
         theDate=date(timestamp),
         theHour=sprintf("%02d",hour(timestamp)),
         DoW=wday(timestamp, label = TRUE),
         theWeek=sprintf("%02d",week(timestamp)),
         yearWeek=paste0(as.character(year(timestamp)),"-",sprintf("%02d",week(timestamp))),
         theMonth=sprintf("%02d",month(timestamp)),
         MoY=month(timestamp, label = TRUE),
         theYear=year(timestamp),
         yearMonth=paste0(as.character(year(timestamp)),"-",sprintf("%02d",month(timestamp))),
         thePower=4*WattHour
  )  %>%  relocate(SID)

Notice that I commented out the calculation of the column named theTime. The as_hms() function does not parse the full timestamps, at least on my system and package set.

library(hms)
theTime=as_hms("2025-07-28 17:30:00")
#> Error in `abort_lossy_cast()`:
#> ! Lossy cast from <character> to <hms> at position(s) 1

^{Created on 2025-08-06 with reprex v2.1.1}

Tom1961 · August 7, 2025, 8:43am

Once again, thank you for your comprehensive answer. I will try these suggestions but remain at a loss as to why it's happened just now. The code has always worked and allowed me to extract data (including something I can use to represent time) as some of my ggplots depend on this.

I will let you know what happens.

Tom1961 · August 7, 2025, 8:56am

FJCC:

library(tidyverse)
library(hms)
raw_Measures <- structure(list(timestamp = c("2025-07-28 17:30:00", "2025-07-28 17:45:00", 
                             "2025-07-28 18:00:00", "2025-07-28 18:15:00", "2025-07-28 18:30:00", 
                             "2025-07-28 18:45:00", "2025-07-28 19:00:00", "2025-07-28 19:15:00", 
                             "2025-07-28 19:30:00", "2025-07-28 19:45:00", "2025-07-28 20:00:00"), 
                             WattHour = c(176, 139, 103, 85, 61, 53, 47, 44, 35, 29, 24
)), row.names = c(NA, -11L), class = c("tbl_df", "tbl", "data.frame"), 
spec = structure(list(cols = list(timestamp = structure(list(), class = c("collector_character", "collector")), 
                                     meter.identification = structure(list(), class = c("collector_skip", "collector")), 
                                     meter.name = structure(list(), class = c("collector_skip", "collector")), 
                                     meter.type = structure(list(), class = c("collector_skip", "collector")), 
                                     value = structure(list(), class = c("collector_double", "collector")), 
                                     meter.unit = structure(list(), class = c("collector_skip", "collector"))), 
                         default = structure(list(), class = c("collector_guess", "collector")), delim = ";"), class = "col_spec"))

processed_Measures<-raw_Measures %>%
   mutate(SID=paste0(as.character(year(timestamp)),"-",sprintf("%03d",yday(timestamp))),#theTime=as_hms(timestamp),
   theDate=date(timestamp),
   theHour=sprintf("%02d",hour(timestamp)),
   DoW=wday(timestamp, label = TRUE),
   theWeek=sprintf("%02d",week(timestamp)),
   yearWeek=paste0(as.character(year(timestamp)),"-",sprintf("%02d",week(timestamp))),
   theMonth=sprintf("%02d",month(timestamp)),
   MoY=month(timestamp, label = TRUE),
   theYear=year(timestamp),
   yearMonth=paste0(as.character(year(timestamp)),"-",sprintf("%02d",month(timestamp))),
   thePower=4*WattHour
   )  %>%  relocate(SID)

raw_Measures_long <- data.frame(timestamp = rep(raw_Measures$timestamp, 20000),
                                WattHour = rep(raw_Measures$WattHour, 20000))
processed_Measures_long <- raw_Measures_long %>%
  mutate(SID=paste0(as.character(year(timestamp)),"-",sprintf("%03d",yday(timestamp))),#theTime=as_hms(timestamp),
         theDate=date(timestamp),
         theHour=sprintf("%02d",hour(timestamp)),
         DoW=wday(timestamp, label = TRUE),
         theWeek=sprintf("%02d",week(timestamp)),
         yearWeek=paste0(as.character(year(timestamp)),"-",sprintf("%02d",week(timestamp))),
         theMonth=sprintf("%02d",month(timestamp)),
         MoY=month(timestamp, label = TRUE),
         theYear=year(timestamp),
         yearMonth=paste0(as.character(year(timestamp)),"-",sprintf("%02d",month(timestamp))),
         thePower=4*WattHour
  )  %>%  relocate(SID)

Thank you for your response.
I add rows each day by downloading the data from the panel supplier and pasting them into the csv file! This has produced copy/paste errors in the past but I'm much more cautious now. I need the time part of the timestamp as the next part of my script combines the data in the table with data from a sunrise/set table which allows me to filter out the zeroes returned before sunrise and after sunset (with a little allowance for the local built environment that I calculated using trial and error).
This project is certainly a WIP and I'm learning as I go along.

Tom1961 · August 7, 2025, 10:09am

I get the same message as you do on running your code. This is clearly saying something but, unfortunately, still fails to account for the fact that the code I use runs perfectly, extracts all the data I need, in the format that I need it and then allows me to create output such as ggplots until I add one more row to the csv file!
I have checked all the downloaded source files and every row's timestamp is in the same format.
You can see the data types of the processed_Measure data file in the image.

(The four 'missing rows' are omitted nas, there are 108896 in the data frame) that I have kept to explain how to remove nas!

Tom1961 · August 7, 2025, 10:10am

The top few rows of the processed_Measures data frame are shown below:

Tom1961 · August 7, 2025, 10:14am

Your code runs correctly and shows that the dataframe can hold at least 200,000 rows. So that's not an issue. I have analysed bigger data sets during the Harvard Data Science course.
theTime variable is listed in the data frame data types as an 'hms' with units of 'sec' s shown in the image.

margusl · August 7, 2025, 10:29am

My proposal to disable date/time parsing in readr::read_csv2() was not meant as a change in your workflow but as a quick debugging step to identify offending rows in input CSV. Parsing raw_Measures$timestamp character(!) vector with lubridate::ymd_hms() and subsetting byNAs (i.e. timestamp strings that did not match ymd hms pattern) should also answer your "why now" question.

Perhaps there was an update from panel supplier and timestamp format or something else in export has changed, perhaps something crashed or restarted and one of the records now is malformed, those issues can be hard to spot unless your editor can display non-printing characters and different linefeeds (e.g. Notepad++). Though with any process that involves any manual steps, I'd first double- and triple-check results of those steps first.

If you feel stuck with your CSV and you don't consider the data to be too sensitive, you could share it through Google Drive or similar service so others can take a look.

Tom1961 · August 7, 2025, 10:33am

I understand and am thankful for any help. The data are not sensitive at all and, having checked, are always in a good format. Having said that, I'm happy to upload the data for inspection. I'll add a link to them when I've done that.

The Google Drive link is:

FJCC · August 7, 2025, 11:10am

The file you linked to does not allow public access.

margusl · August 7, 2025, 11:16am

In addition to allowing public access (I requested read permission), is this the exact file causing those issues? It only includes 109000 rows (incl. header), from your description I'd guess this one works for you too but fails once you append more rows.

Tom1961 · August 7, 2025, 2:19pm

I will sort this out. I have little experience using shared file systems!

Tom1961 · August 7, 2025, 2:20pm

Correct! If you add one more correctly-formatted row it falls over.
This is shown in the screenshot uploaded to Google Drive. Just adding one more row, stops the code running.

It is a (Latin word for I see) but that's a forbidden word in Google Drive!

jrkrideau · August 7, 2025, 4:50pm

I am only seeing 108999 rows of data. I think Tom1961 is using the spreadsheet row count. I get 109000 row if I open the file in LibreOffice Calc.

Tom1961 · August 7, 2025, 5:15pm

I have 109000 rows including the header row. My apologies for not making that explicit.
I am using Notepad++ to edit CSV files. As soon as I add a row to make the number of data rows 109000 the script stops executing as seen in the MP4 file.

margusl · August 7, 2025, 10:34pm

By default readr uses 1000 rows for guessing column types, evenly spaced from the first to the last row ( Automatic guessing ).

In your file you have 3 rows with faulty timestamps, with increased row count one or more of those rows seem to end up in the "guess sample" and as a result, timestamp column type is guessed to be character instead of datetime:

library(readr)

# duplicate last line for reprex
l <- read_lines("measures_solar_all.csv")
write_lines(c(l, tail(l, n = 1)), "measures_solar_all+1.csv")

# guessed column spec for original csv, timestamp = col_datetime(format = ""):
read_csv2("measures_solar_all.csv",   col_select = c(1,5), show_col_types = FALSE) |> spec()
#> cols(
#>   timestamp = col_datetime(format = ""),
#>   meter.identification = col_skip(),
#>   meter.name = col_skip(),
#>   meter.type = col_skip(),
#>   value = col_double(),
#>   meter.unit = col_skip()
#> )

# guessed column spec for +1.csv, timestamp = col_character():
read_csv2("measures_solar_all+1.csv", col_select = c(1,5), show_col_types = FALSE) |> spec()
#> cols(
#>   timestamp = col_character(),
#>   meter.identification = col_skip(),
#>   meter.name = col_skip(),
#>   meter.type = col_skip(),
#>   value = col_double(),
#>   meter.unit = col_skip()
#> )

To handle this properly, you could define column types yourself, default format for col_datetime() is somewhat lax ISO8601 which works well with your (valid) timestamp values:

raw_Measures <- 
  read_csv2(
    "measures_solar_all+1.csv", 
    col_types =  cols_only(timestamp = col_datetime(), value = col_integer())
  )
#> ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#>   dat <- vroom(...)
#>   problems(dat)

raw_Measures
#> # A tibble: 109,000 × 2
#>    timestamp           value
#>    <dttm>              <int>
#>  1 2020-11-12 06:15:00     0
#>  2 2020-11-12 06:30:00     0
#>  3 2020-11-12 06:45:00     0
#>  4 2020-11-12 07:00:00     0
#>  5 2020-11-12 07:15:00     0
#>  6 2020-11-12 07:30:00     0
#>  7 2020-11-12 07:45:00     0
#>  8 2020-11-12 08:00:00     4
#>  9 2020-11-12 08:15:00    28
#> 10 2020-11-12 08:30:00    71
#> # ℹ 108,990 more rows


spec(raw_Measures)
#> cols_only(
#>   timestamp = col_datetime(format = ""),
#>   meter.identification = col_skip(),
#>   meter.name = col_skip(),
#>   meter.type = col_skip(),
#>   value = col_integer(),
#>   meter.unit = col_skip()
#> )

# timestamps that failed to parse:
problems(raw_Measures)
#> # A tibble: 3 × 5
#>      row   col expected        actual             file                          
#>    <int> <int> <chr>           <chr>              <chr>                         
#> 1   5997     1 date in ISO8601 021-03-01 06:15:00 measures_solar_all+1.csv
#> 2  13981     1 date in ISO8601 021-07-01 05:30:00 measures_solar_all+1.csv
#> 3 106621     1 date in ISO8601 05:30:00           measures_solar_all+1.csv