Creating a new variable under conditions of other two variables

gustavobrp · February 5, 2020, 10:25pm

I'm trying to create a new variable in a dataset under some conditions of other variables. Basically, I want to simplify the information about education of parents, that is split between father and mother, and create a new one, that takes in account the highest level of education of the parents. For example, if the father education level is 1 and mother education is 0, the value for this row in the new variable would be 1.

I'm trying to use mutate() with case_when() functions, that worked in another variable, but I'm not understanding why isn't right now. When I try, it creates a column with only NA's and when I print a table from it, the result is:

< table of extent 0 >

The class of the two variables that I'm using for conditions is 'labelled' and 'factor'.

First, I tried the following command (I'm simplifying the codes):

dataset <- dataset %>% 
           mutate(NEW_EDUCATIONAL_VAR = case_when(MOTHER_EDUCATIONAL_VAR == '0' &  FATHER_EDUCATIONAL_VAR == '0' ~ '0',
                                                  MOTHER_EDUCATIONAL_VAR == '0' & FATHER_EDUCATIONAL_VAR == '1' ~ '1')

Then, I tried to consider the cases that has NA values, since there is NA in some rows:

dataset <- dataset %>% 
           mutate(NEW_EDUCATIONAL_VAR = case_when(is.na(MOTHER_EDUCATIONAL_VAR) & is.na(FATHER_EDUCATIONAL_VAR) ~ '99',
                                                  MOTHER_EDUCATIONAL_VAR == '0' & FATHER_EDUCATIONAL_VAR == '1' ~ '1')

When I used these functions to create a new one for the age of the cases, it worked.

dataset <- dataset %>% mutate(AGE_CAT = case_when(AGE >= 16 & AGE <= 18 ~ '0',
                                                   AGE >= 19 & AGE <= 24 ~ '1',
                                                   AGE >= 25 & AGE <= 29 ~ '2',
                                                   AGE >= 30 ~ '3'))

So, what am I doing wrong? Thanks a lot.

technocrat · February 5, 2020, 11:01pm

Try ifelse(var1 == 1 | var2 == 1, 1,0)

gustavobrp · February 6, 2020, 12:28am

Isn't | means 'or'? Should't I use 'and'?

Anyway, I tried the following steps, and I always get the last number (the false response):

dataset$NEW_EDUCATIONAL_VAR <- NA
dataset$NEW_EDUCATIONAL_VAR <- ifelse(dataset$MOTHER_EDUCATIONAL_VAR == '0' | dataset$FATHER_EDUCATIONAL_VAR == '0', '0', '99')
dataset$NEW_EDUCATIONAL_VAR <- ifelse(dataset$MOTHER_EDUCATIONAL_VAR == '1' | dataset$FATHER_EDUCATIONAL_VAR == '1', '1', '99')

I only get '99' on the column. And I know there is values in the two other ones that enter the condition in ifelse(). I have a dataset with 5081 observations, and I get a new column with 4813 rows with 99 and the rest NA.

nirgrahamuk · February 6, 2020, 12:31am

I'm thinking that if all education levels are numbers with clear order 0,1,2,3 etc.
Why not convert from char to numeric.
You could use the max() function to find the greater of the two parties education level. Also max() has a na.rm parameter you can set to true, to ignore NA values so that max (1,NA) would return 1

nirgrahamuk · February 6, 2020, 1:00am

Possibly pmax for a vectorized approach will work. Here is a very detailed lesson/example

gustavobrp · February 6, 2020, 1:55am

I'm posting the answer by another user from stack that resolved my issue.But I will also try the suggestion given by @nirgrahamuk.

https://stackoverflow.com/posts/60086150/revisions

You can play around with the values. Hope this helps.

#packages
library(tidyverse)

#sample data
Mother <- c(0,0,0,1,1,NA)
Father <- c(0,1,1,0,0,1)
df <- data.frame(Mother, Father)
str(df) #both Mother and Father columns are numeric

#mutate + case_when
df %>% 
  mutate(New = case_when(Mother == 0 & Father == 0 ~ 0, #condition 1
                         Mother == 0 & Father == 1 ~ 1, #condition 2
                         is.na(Mother) & Father == 1 ~ NA_real_, #condition 3
                         TRUE ~ 99)) #all other cases

Output:

  Mother Father New
1      0      0   0
2      0      1   1
3      0      1   1
4      1      0  99
5      1      0  99
6     NA      1  NA

technocrat · February 6, 2020, 1:56am

See FAQ: What's a reproducible example (`reprex`) and how do I do one?

If I understood your test correctly, that you want to create a new variable representing the highest score between the parents.

The | tests both parents for and returns 1 if any of the conditions are true

Both parents are coded 1
Father is coded 1
Mother is coded 1

If none of these are satisfied, the new variable is coded 0, because neither both parents nor neither parent is coded 1.

gustavobrp · February 6, 2020, 3:43pm

Thanks a lot for the tip about the reprex.

I see, going to try with that. I posted a answer that resolved the problem, but I see your point. Thanks for replying!

system · February 13, 2020, 3:43pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.