Data Cleaning - Dealing with dates, partial imputation and format issues

Hello everyone !

I'm currently cleaning up a dataset and dealing with dates.

The dataset has a few problems with inconsistent formatting, some variables having fluctuating precision ("Y" vs "Y-m "vs "Y-m-d"). I would like to standardize and partially imputate the data.
I'm doing my homework and have been reading about/trying out different packages (lubridate, cleanepi, datefixR). Lubridate has been the most helpful but I'm looking for a way to choose the particular date to imput the data with (it goes to January 1st as default). I haven't found a way to change it myself and would really like to set my missing dates to the 15th of the month when day is missing, or July 1st when day+month are missing.
datefixR would have been perfect for my case, but it only works with character variables and only works with "mdy" and "dmy" formats.
I couldn't get cleanepi to work because of some NA's in my dataframe.

Some forum topic answers seem to combine lubridate with admiral for the effect. It looks interesting, but I'm not the biggest coder and find the code hard to understand/re-use.

Right now, the code for one of the variables with inconsistent precision looks like that :

df$variable <- parse_date_time(df$variable, orders = c("Y-m-d", "Y-m", "Y"))

This would seem to be a recurrent problem for people working with free text in their code/old data, so I was wondering if anyone else had an effective solution already !
Any help would be appreciated !

This sounds like a nightmare, Is there any way that you can supply us with some sample raw data that shows the problems? We probably don't need a lot of data but we need to see all the problems if possible.

A handy way to supply data is to use the dput() function. Do dput(mydata) where "mydata" is the name of your dataset. For really large datasets probably dput(head(mydata, 100). Paste it here between

```

```

1 Like

Hello !
Here's a short example of what the problematic object looks like.
(All data is modified from the original dataset, but has the same format.)
As you can see, it's a mixed bag.

dput(df$InconsistentDates)
c("2001/NUL/NUL", "2001/NUL/NUL", "2000-06-02", "1991/NUL/NUL", 
  "1984/NUL/NUL", "1985/NUL/NUL", "2013-09-12", "1995", "1994/NUL/NUL", 
  "2009-09-05", "2011-03-22", "1990/NUL/NUL", "1999-06-17", "2000-09-11", 
  "1993/NUL/NUL", "2007/NUL/NUL", "2005/NUL/NUL", "1991/NUL/NUL", 
  "1992/NUL/NUL", "2011/NUL/NUL", "1994/NUL/NUL", "2003-09-30",
  "1993", "2010-06", NA, NA, NA)

Other variables have this form (ex : 2020-06-20, 02-AUG-19, 13/17/2006 or 1965/08/22). Though much easier to deal with as they retain a consistent format for the whole variable.

As you seem to have a fixed and relatively small number of different patterns there, pre-processing strings yourself into a common format (e.g. yyyy-mm-dd for lubridate::ymd()) is probably one of the easiest approaches and we can achieve this through replacements with regular expressions here.
And {stringr} to make it a bit more convenient:

library(dplyr)
library(lubridate)
library(stringr)

df <- 
  tibble(InconsistentDates = 
  c("2001/NUL/NUL", "2001/NUL/NUL", "2000-06-02", "1991/NUL/NUL", 
  "1984/NUL/NUL", "1985/NUL/NUL", "2013-09-12", "1995", "1994/NUL/NUL", 
  "2009-09-05", "2011-03-22", "1990/NUL/NUL", "1999-06-17", "2000-09-11", 
  "1993/NUL/NUL", "2007/NUL/NUL", "2005/NUL/NUL", "1991/NUL/NUL", 
  "1992/NUL/NUL", "2011/NUL/NUL", "1994/NUL/NUL", "2003-09-30",
  "1993", "2010-06", NA, NA, NA)
)

df |> 
  mutate(
    dstr_ymd = str_replace_all(InconsistentDates, pattern = c(
      # pattern1 = replacement1
      # pattern: 
      # ^ - string start anchor; () - group; \\d - digits; 
      # {4} - repeated exactly 4 times; ? - repeated 0 or 1 times; 
      # "/NUL/NUL" - literal string
      # $ - string end anchor
      # replacement: \\1 - 1st group from pattern; "-07-01" - literal string
      "^(\\d{4})(/NUL/NUL)?$" = "\\1-07-01",
      "^(\\d{4}-\\d{2})$" = "\\1-15")),
    d = ymd(dstr_ymd)
    ) |> 
  print(n = Inf)

Result :

#> # A tibble: 27 × 3
#>    InconsistentDates dstr_ymd   d         
#>    <chr>             <chr>      <date>    
#>  1 2001/NUL/NUL      2001-07-01 2001-07-01
#>  2 2001/NUL/NUL      2001-07-01 2001-07-01
#>  3 2000-06-02        2000-06-02 2000-06-02
#>  4 1991/NUL/NUL      1991-07-01 1991-07-01
#>  5 1984/NUL/NUL      1984-07-01 1984-07-01
#>  6 1985/NUL/NUL      1985-07-01 1985-07-01
#>  7 2013-09-12        2013-09-12 2013-09-12
#>  8 1995              1995-07-01 1995-07-01
#>  9 1994/NUL/NUL      1994-07-01 1994-07-01
#> 10 2009-09-05        2009-09-05 2009-09-05
#> 11 2011-03-22        2011-03-22 2011-03-22
#> 12 1990/NUL/NUL      1990-07-01 1990-07-01
#> 13 1999-06-17        1999-06-17 1999-06-17
#> 14 2000-09-11        2000-09-11 2000-09-11
#> 15 1993/NUL/NUL      1993-07-01 1993-07-01
#> 16 2007/NUL/NUL      2007-07-01 2007-07-01
#> 17 2005/NUL/NUL      2005-07-01 2005-07-01
#> 18 1991/NUL/NUL      1991-07-01 1991-07-01
#> 19 1992/NUL/NUL      1992-07-01 1992-07-01
#> 20 2011/NUL/NUL      2011-07-01 2011-07-01
#> 21 1994/NUL/NUL      1994-07-01 1994-07-01
#> 22 2003-09-30        2003-09-30 2003-09-30
#> 23 1993              1993-07-01 1993-07-01
#> 24 2010-06           2010-06-15 2010-06-15
#> 25 <NA>              <NA>       NA        
#> 26 <NA>              <NA>       NA        
#> 27 <NA>              <NA>       NA

For testing pattern matches, you could try stringr::str_view():

str_view(df$InconsistentDates, "^(\\d{4})(/NUL/NUL)?$")
#>  [1] │ <2001/NUL/NUL>
#>  [2] │ <2001/NUL/NUL>
#>  [4] │ <1991/NUL/NUL>
#>  [5] │ <1984/NUL/NUL>
#>  [6] │ <1985/NUL/NUL>
#>  [8] │ <1995>
#>  [9] │ <1994/NUL/NUL>
#> [12] │ <1990/NUL/NUL>
#> [15] │ <1993/NUL/NUL>
#> [16] │ <2007/NUL/NUL>
#> [17] │ <2005/NUL/NUL>
#> [18] │ <1991/NUL/NUL>
#> [19] │ <1992/NUL/NUL>
#> [20] │ <2011/NUL/NUL>
#> [21] │ <1994/NUL/NUL>
#> [23] │ <1993>
str_view(df$InconsistentDates, "^(\\d{4}-\\d{2})$")
#> [24] │ <2010-06>

If you are new to regex, {stringr} Regular expressions Vignette is a great place to start:

Thanks for the data. It's a lovely mess.

BTW is this a typo?

Oh yeah it is !!! Sorry, I manually changed the stuff.

No problem, I thought it was but time/date formats are wacky enough there may be a specialized need for a 13th month.

Thank you so much !!

This is exactly what I was trying to do ! Thanks for explaining it so well, it helped a lot.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.