Complete missing rows

mihefra · July 3, 2020, 9:33am

Hi!

I have the following problem. The data frames I am using have hourly observations, but occasionally there are values missing due to faults in the observing station. In these cases, rows were completely skipped, instead of inserting NAs. So it looks like this e.g.:

time code value
201901010000 3.5
201901010100 4.2
201901010200 3.2
201901010300 1.5
201901010400 5.3
201901010800 3.8
201901010900 2.9
201901011000 4.6

What I need, is to fix this, so it looks like that:

time code value
201901010000 3.5
201901010100 4.2
201901010200 3.2
201901010300 1.5
201901010400 5.3
201901010500 NA
201901010600 NA
201901010700 NA
201901010800 3.8
201901010900 2.9
201901011000 4.6

It is okay for me to have NAs, I just need full samples without gaps. I found that you can achieve this with the "complete" function of tidyr, but I don't have dates or years etc. How can I solve this using this hourly sequence?

Thank you in advance! If there any questions, feel free to ask any time.

Best regards,
Michael

nirgrahamuk · July 3, 2020, 9:58am

there may be a more elegant way but this is a start

library(lubridate)
library(tidyverse)

exdf <- data.frame(
  timecode = c(
    "201901010300", "201901010400", # skip5.
    "201901010600", "201901010700"
  ),
  value = letters[1:4]
)

exdf <- mutate(exdf,
  mytime = as_datetime(ymd_hm(timecode), )
) %>% group_by(mytime)


unique_days <- pull(exdf, mytime) %>%
  date() %>%
  unique()

total_times_df <- expand_grid(
  unique_days,
  hour = 0:23
)
total_times_df$dtimes <- as.POSIXct(as_datetime(total_times_df$unique_days 
                                                + hours(total_times_df$hour)))

completed_df <- left_join(total_times_df, exdf,
  by = c("dtimes" = "mytime")
)

siddharthprabhu · July 3, 2020, 10:36am

EDIT: I didn't consider multiple dates in my approach. See Yarnabrina's post below for a more robust solution.

You rightly identified tidyr::complete() as the most appropriate function for the job. It works for all types of values not just dates. All you need to do is supply it with the sequence (increments of 100 in this case).

library(tidyverse)

df <- tribble(~timecode, ~value,
              201901010000, 3.5,
              201901010100, 4.2,
              201901010200, 3.2,
              201901010300, 1.5,
              201901010400, 5.3,
              201901010800, 3.8,
              201901010900, 2.9,
              201901011000, 4.6)

complete(df, timecode = seq(min(timecode), max(timecode), by = 100))
#> # A tibble: 11 x 2
#>        timecode value
#>           <dbl> <dbl>
#>  1 201901010000   3.5
#>  2 201901010100   4.2
#>  3 201901010200   3.2
#>  4 201901010300   1.5
#>  5 201901010400   5.3
#>  6 201901010500  NA  
#>  7 201901010600  NA  
#>  8 201901010700  NA  
#>  9 201901010800   3.8
#> 10 201901010900   2.9
#> 11 201901011000   4.6

^{Created on 2020-07-03 by the reprex package (v0.3.0)}

mihefra · July 3, 2020, 10:43am

Thanks for the quick reply. I am looking into it, there is a lot of new syntax for me.

Yarnabrina · July 3, 2020, 10:48am

100 is risky. What if column has times of another date?

Combining neatness of Siddharth's solution and time approach of Nir:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
library(readr)
library(tidyr)

sample_input <- read_delim(file = "timecode value
201901010000 3.5
201901010100 4.2
201901010200 3.2
201901010300 1.5
201901010400 5.3
201901010800 3.8
201901010900 2.9
201901011000 4.6
201901020000 3.5
201901020100 4.2
201901020200 3.2
201901020300 1.5
201901020400 5.3
201901020800 3.8
201901020900 2.9
201901021000 4.6",
                           delim = " ",
                           col_names = TRUE)

sample_input %>%
    # converting character to date time object
    mutate(timecode = ymd_hm(timecode)) %>%
    # completing with hourly intervals in the entire range
    complete(timecode = seq(from = min(timecode),
                            to = max(timecode),
                            by = "hour")) %>%
    # if character required, skip otherwise
    mutate(timecode = format(x = timecode,
                             format = "%Y%m%d%H%M")) %>%
    # just to show the result, skip this also
    as.data.frame()
#>        timecode value
#> 1  201901010000   3.5
#> 2  201901010100   4.2
#> 3  201901010200   3.2
#> 4  201901010300   1.5
#> 5  201901010400   5.3
#> 6  201901010500    NA
#> 7  201901010600    NA
#> 8  201901010700    NA
#> 9  201901010800   3.8
#> 10 201901010900   2.9
#> 11 201901011000   4.6
#> 12 201901011100    NA
#> 13 201901011200    NA
#> 14 201901011300    NA
#> 15 201901011400    NA
#> 16 201901011500    NA
#> 17 201901011600    NA
#> 18 201901011700    NA
#> 19 201901011800    NA
#> 20 201901011900    NA
#> 21 201901012000    NA
#> 22 201901012100    NA
#> 23 201901012200    NA
#> 24 201901012300    NA
#> 25 201901020000   3.5
#> 26 201901020100   4.2
#> 27 201901020200   3.2
#> 28 201901020300   1.5
#> 29 201901020400   5.3
#> 30 201901020500    NA
#> 31 201901020600    NA
#> 32 201901020700    NA
#> 33 201901020800   3.8
#> 34 201901020900   2.9
#> 35 201901021000   4.6

^{Created on 2020-07-03 by the reprex package (v0.3.0)}

Hi @mihefra! Saw you following comment #6, but I'm not sure I understood the context, and hence can't help. I hope others will be able to guide you. Good luck!

mihefra · July 3, 2020, 11:02am

Thanks a lot, I'll try it! Otherwise, there are also correct and "clean" samples, that have all the values and no gaps. My first idea was that I can use the time code vector of the correct samples as kind of a reference sequence. Can that be useful or beneficial? I tried a double for loop that compares the defective sequences with the correct ones, but it takes too much computation and R freezes.

mihefra · July 3, 2020, 3:35pm

Hi again! It seems to work so far, thanks again. But how can I set definite values for start and end of the sequence? Some of my sample start at a time later than the start of the whole observation. If I place the starting date (which is 201601010000) instead of "min(timecode)", I get an error message. Not sure, where the problem is.

toryn_stat · July 3, 2020, 5:08pm

I think you need to put the starting date in the ymd_hm format so try replacing min(timecode) with ymd_hm(201601010000)

mihefra · July 3, 2020, 10:49pm

This works, thank you.

system · July 18, 2020, 4:59am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.