Recoding gender variable

I have two questions regarding data cleaning and re coding of a GENDER variable from a survey I did.

My original numeric data has the following datapoints under GENDER variable:
"1" (which is female)
"2" (which is male)
"4" (which is other)
"5" (which is unsure)"
Then I apparently have a two ="NA's"
NOTE: nobody selected option "3" which was "non binary" in our survey

I start by running this code in R:

convert from numeric survey responses to factor variable

newdata$gender <- as.factor(newdata$gender)
str(newdata$gender) # looks OK I got following: factor w/ 4 levels "1", "2", "4", "5"

table(newdata$gender) #
summary(newdata$gender) #
Question #1
why do table and summary outputs differ? 4 levels for table...but summary includes NA's see results pasted below

Screen Shot 2020-05-24 at 2.03.27 PM

Question #2.
how do I re code this factor simply so that it is only 3 levels (Male, Female, Other)
where anybody who selected 3, 4 or 5 (or NAs) simply becomes subsumed into "other" ? Remember nobody ever selected "3" in teh survey.


table's default behaviour is to ignore NA, summary is to not. the table function behaviour can be altered by passing useNA argument. see below and for example for Q2.

raw <- c(1,2,4,5,NA,NA)

(gndr_all <- as.factor(raw))


table(gndr_all,useNA = "always")

gndr_all %>% 
  forcats::fct_explicit_na(na_level = "missing") %>% 
                      other_level = "other") -> new_gndr

table(new_gndr,useNA = "always")
1 Like

Plus, forcats::fct_recode() is also useful in recoding categorical variables, except that it keeps unmentioned levels as is. So the second solution is

gndr_all %>% 
  forcats::fct_explicit_na(na_level = "missing") %>% 
                      other = "missing",
                      other = "4",
                      other = "5"

Great, thanks. A follow up question re: including/excluding NAs. In fact, rather than “including” NAs, my main problem is getting rid of them, especially when I calculate % tables and draw histograms/bars. How do I get rid of them so they dont interfere with my (table/visual) presentation of data? See attached pic for example of problem

you can take your whole data.frame table and use function na.omit() on it, to throw away records/observations with NA values in it. If thats what you wish to do...

Hey thanks, I'm familiar na.omit() function for the whole dataset. But it does not seem to work. Any idea what is going wrong?

Below I paste the following:

  1. the result when I run na.omit
  2. the result when I (subsequently) run a simple table count/percent

Thx for your input ....

  1. the result when I run na.omit:
> na.omit(newdata)
# A tibble: 0 x 75
# … with 75 variables: id <dbl>, block <dbl>, timestart <dttm>, answertime <dbl>, email <chr>, city <dbl>,   city25MultipleCho <dbl>, city25tx <chr>, greenspace <dbl>, rectrain <dbl>, recwalk <dbl>, recpicnik <dbl>,...... etc etc etc ...
  1. the result when I (subsequently) run a simple table count/percent
> newdata %>%   group_by(gender) %>%   summarise(count = n() / nrow(.)) 
# A tibble: 4 x 2
  gender   count
  <fct>    <dbl>
1 female 0.471  
2 male   0.518  
3 other  0.00660
4 NA     0.00377
Warning message:
Factor `gender` contains implicit NA, consider using `forcats::fct_explicit_na`
> na.omit(newdata) 
> # A tibble: 0 x 75

this implies that there isnt a single observation in newdata that doesnt have an NA somewhere in one of its columns.... this is bad.
however, you dont assign the result of the na.omit(newdata) to any R name with <- as I would have expected you to do. therefore when you group and summarise newdata, its the same newdata as before you ran na.omit

Wow, that is good to know, thx. I see now what you mean.

I therefore think (?) the solution might be to use na.omit on certain variables included in some calculation/tabulation, rather than the whole dataset, right? In other words, when analyzing variables like GENDER (independent var.) and ACCEPTANCE (dependent variable) I could use these commands (if I understand you correctly)

gender2 <- na.omit(newdata$gender)
accetpance2<- na.omit(newdata$acceptance)

then run for example.. (or whatever I am analyzing)
xtabs(~ gender2 + acceptance2 )

or would that only cause another problem?

rather than work with seperate vectors and use na.omit which would break the relationship between 1 elemenent of 1 vector and 1 element of another, it makes more sense to me to select() the two columns of interest into their own table, na.omit() that table and send that to your summarising function or plot

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.