Separating integers in a column

naja · September 27, 2019, 12:11pm

Hi There,
I have a column with a birth date and a unique identification number that I want to split into two seperate columns giving me the information "age" and "gender"
The cell values look like this:
Ex: 031096-3460

Whereby the first six integers are in the format ddmmyy and the remaining four integers represent a unique identification number. If this number is an even number i want to code it 'female' and if it is uneven I want to code it 'Male' in a seperate column.

In this way, i need to turn the column into a birth day column and convert that to age and the other column into a gender column.
Can anyone help me with some code to do this?
Many Thanks,
Naja

pieterjanvc · September 27, 2019, 12:32pm

Hi,

First of all, welcome to the RStudio Community!

Find below a way to implement your request

library(stringr)
library(dplyr)
library(lubridate)

#Create a fake dataset with the original column
myData = data.frame(original = c("031096-3460", "220506-2567"))

#Create two new columns birthday and id
myData = myData %>% mutate(birthday = as.Date(str_extract(original, "^\\d+"), "%d%m%y"),
                           id = as.numeric(str_extract(original, "\\d+$")))

#Convert birthday to age and id to gender
myData = myData %>% mutate(age = floor(interval(birthday, Sys.Date()) / years(1)),
                           gender = ifelse(id %% 2 == 0, 'Female', "Male"))

The first part uses the str_extract function from stringr to get the right data and convert it to the type needed

The second part then converts the birthday to age using the interval function from lubridate and a simple if-else to see if the id is even or not to assign the gender

NOTE: you could actually combine step 1 and 2 into just a single step, but I split it to make it more readable.

#All in one (just get age and gender)
myData = myData %>% 
  mutate(age = floor(interval(
                as.Date(original, "%d%m%y"), Sys.Date()) 
                / years(1)),
         gender = ifelse(
           as.numeric(str_extract(original, "\\d+$")) %% 2 == 0, 
           'Female', "Male"))

Hope this helps,
PJ

naja · October 2, 2019, 12:22pm

Hi Pieterjanvc!

Thank you so much for your useful guide and your welcome to the community. As you can hear I'm still fairly new at using R. When I run your code many of the age integers now appear in a negative format. Ages range only from +50 to -50.

I'm having a problem with the first bit of code as the manual entering of the numbers (c("031096-3460", "220506-2567") wouldn't work since it is a dataset of roughly 30.000 dataentries. I made the code as follows, but that results in these negative integers, probably because I cannot seem to get the first step right. My second bit of code looks as follows:

###############################################

donations = donations %>%
mutate(age = floor(interval(
as.Date(CPR, "%d%m%y"), Sys.Date())
/ years(1)),
gender = ifelse(
as.numeric(str_extract(CPR, "\d+$")) %% 2 == 0,
'Female', "Male"))

######################################################
Whereby 'donations' is the name of the dataset and 'CPR' is the name of the column I wish to split.

Many Thanks,
Naja

andresrcs · October 2, 2019, 12:49pm

You are not supposed to do any manual data entry, that is just made up data used for exemplifying purposes, since you haven't provided any sample data for us to work with.

To help us help you, could you please prepare a reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

pieterjanvc · October 2, 2019, 8:25pm

Your code is OK, but it's the strings that do not contain enough information. You only provide the last two digits of the year, so lubridate assumes that from the current year 2019 or 19, 50 years before and after are displayed as two numbers (i.e. the year closest to the current year). So 94 is interpreted as 1994, but 42 is interpreted as 2042 (since 2042 is closer to 2019 than 1942).

There is no fool proof way to fix this (apart from having the full year), as you'll never know for sure what the two digits mean. Of course you know for birthdays that they can't be in the future, but someone who was born in 1915, and now is 104 years old will be interpreted as 4 years old now.

I hope you see what the problem is.

I created one workaround, but it is not perfect.

library(stringr)
library(dplyr)
library(lubridate)

myData = data.frame(original = c("030919-3460", "220522-2567"))

myData = myData %>% mutate(birthday = as.integer(str_extract(original, "..(?=-)")),
                           birthday = ifelse(birthday > 19, 
                                             str_replace(original, "(....)(..(?=-))()", "\\119\\2"),
                                             str_replace(original, "(....)(..(?=-))()", "\\120\\2")),
                           birthday = as.Date(str_extract(birthday, "^\\d+"), "%d%m%Y"),
                           id = as.numeric(str_extract(original, "\\d+$")))

myData = myData %>% mutate(age = floor(interval(birthday, Sys.Date()) / years(1)),
                           gender = ifelse(id %% 2 == 0, 'Female', "Male"))

> print(myData)
     original   birthday   id age gender
1 030919-3460 2019-09-03 3460   0 Female
2 220522-2567 1922-05-22 2567  97   Male

I first extract the two digit year from the original string, then look if it's > 19 (current year). If so, I put 19 in from of it, else 20. This changes 15 into 2015 but 35 into 1935. It still will make errors for people > 100 years old.

PJ

andresrcs · October 2, 2019, 9:57pm

There is also a lower level lubridate function that allows you to change the default cutoff value (cutoff_2000 = 68L)

library(stringr)
library(dplyr)
library(lubridate)

myData <- data.frame(original = c("030919-3460", "220522-2567"))

myData %>%
    mutate(birthday = parse_date_time2(str_extract(original, "\\d+(?=-)"),
                                       orders = "dmy",
                                       cutoff_2000 = 19),
           id = as.numeric(str_extract(original, "\\d+$")),
           age = floor(interval(birthday, Sys.Date()) / years(1)),
           gender = ifelse(id %% 2 == 0, 'Female', "Male"))
#>      original   birthday   id age gender
#> 1 030919-3460 2019-09-03 3460   0 Female
#> 2 220522-2567 1922-05-22 2567  97   Male

naja · October 9, 2019, 11:03am

Thank you so much to both of you. This seems to work now although there appears to be a small number of NA's with both your codes. Still a huge help though as I can just eliminate the NA's from the dataset.
Andresrcs, I have tried for the first time to do a reproducable example in another question on here and really appreciated your FAQ for that. Thank you!

Naja

andresrcs · October 9, 2019, 12:13pm

If your question's been answered (even by you!), would you mind choosing a solution? It helps other people see which questions still need help, or find solutions if they have similar problems. Here’s how to do it:

system · October 30, 2019, 12:13pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.