Your code is OK, but it's the strings that do not contain enough information. You only provide the last two digits of the year, so lubridate assumes that from the current year 2019 or 19, 50 years before and after are displayed as two numbers (i.e. the year closest to the current year). So 94 is interpreted as 1994, but 42 is interpreted as 2042 (since 2042 is closer to 2019 than 1942).
There is no fool proof way to fix this (apart from having the full year), as you'll never know for sure what the two digits mean. Of course you know for birthdays that they can't be in the future, but someone who was born in 1915, and now is 104 years old will be interpreted as 4 years old now.
I hope you see what the problem is.
I created one workaround, but it is not perfect.
library(stringr)
library(dplyr)
library(lubridate)
myData = data.frame(original = c("030919-3460", "220522-2567"))
myData = myData %>% mutate(birthday = as.integer(str_extract(original, "..(?=-)")),
birthday = ifelse(birthday > 19,
str_replace(original, "(....)(..(?=-))()", "\\119\\2"),
str_replace(original, "(....)(..(?=-))()", "\\120\\2")),
birthday = as.Date(str_extract(birthday, "^\\d+"), "%d%m%Y"),
id = as.numeric(str_extract(original, "\\d+$")))
myData = myData %>% mutate(age = floor(interval(birthday, Sys.Date()) / years(1)),
gender = ifelse(id %% 2 == 0, 'Female', "Male"))
> print(myData)
original birthday id age gender
1 030919-3460 2019-09-03 3460 0 Female
2 220522-2567 1922-05-22 2567 97 Male
I first extract the two digit year from the original string, then look if it's > 19 (current year). If so, I put 19 in from of it, else 20. This changes 15 into 2015 but 35 into 1935. It still will make errors for people > 100 years old.
PJ