Cleaning Data for Logistic Regression

Hello Friends,

My goal is to use logistic regression on a data set to determine which factors are significant in predicting a binary factor. I want to tidy the data by assigning binary values to the levels of the columns (lumping together levels that are equivalent in meaning), and change the data types of the columns so as to facilitate logistic regression. I'm trying to convert categorical data, currently stored as factors, into numeric data.

So far, nothing that I've found online has worked. If you could help me solve this problem I would really appreciate it.

Here is an example of a fictional column in my data set. Keep in mind that each column has hundreds of entries that match each of the categorical entries in the "messy" vector below.

messy <- c("N","Y","","Big Y", "(Other)","NA's")
problem <- as.factor(messy)

Thank you, in advance!

Is messy your outcome variable? You don't need to convert binary categories to 0/1 numeric. For logistic regression, the glm function will take care of that internally if you provide it with a two-level factor as the outcome variable.

To recode messy, you could do something like this:


messy <- c("N","Y","","Big Y", "(Other)","NA's")

clean = fct_collapse(messy %>% replace(., .=="NA's", NA_character_), 
                     Y=c("Y", "Big Y"),
                     N=c("N", "(Other)"))

If you provide a sample data frame that reflects your use case, we can provide additional suggestions and code more directly tailored to your needs.


Hi joels,

I appreciate your suggestions and timeliness!

While messy is not my outcome variable, it is a contrived example representing a few of the independent variable columns that I'm trying to use in my glm function. I'm excited that you knew to ask that! You're awesome!

Here's maybe a better example of the data frame that I'm actually using:

outcome_var <- as.factor(c("","barely Yes","","","big yes","","","","Big Yes",
             "","big yes and crazy","","bigg yes","","","","","Yes",
indep_var1 <- as.factor(c("","N","Y","Y/n","","","","","N","Y",
indep_var2 <- as.factor(c("kinda","","Y","N","non","none","Yup 1","yup 2","","",
                "kinda","","Y","N","non","none","Yup 1","yup 2","","",
                "kinda","","Y","N","non","none","Yup 1","yup 2","",""))
indep_var3 <- as.factor(sample(20:60, 30, replace = TRUE))
indep_var4 <- as.factor(sample(0:1, 30, replace = TRUE))

messy_2.0 <- data.frame(outcome_var, indep_var1, indep_var2, indep_var3, indep_var4)

(*Note, this example is also contrived. If any of the entries are unclear, please let me know.)

(**Note, I want the independent variables' entries to also be converted into binary numeric values.)

In your solution, I see that you essentially created a new object called "clean" (genius, by the way!). For messy_2.0, would you recommend that I try your same solution for each column, then build a new data frame with the resulting cleaned up columns, and then use the new cleaned up columns in my analysis?

Or, given the slightly hairier nature of the data set, would you change your recommendations at all?

Thanks so much for your help!

In your actual data, are the numeric columns (indep_var3 and indep_var4 in your sample data) coded as factors?

Does the following help? Let me know if I should provide additional explanation about the code.


cleaned = messy_2.0 %>%
  # Convert numeric columns coded as factor back to numeric
  mutate_at(vars(paste0("indep_var", 3:4)), funs(as.numeric(as.character(.)))) %>% 
  # Convert all categorical columns to character and replace blanks or empty strings with missing value
  mutate_if(is.factor, funs(replace(as.character(.), . %in% c("", " "), NA_character_))) %>% 
  # For all character columns, recode answers to "Y","N", or "Ambiguous" as applicable
            funs(case_when(grepl("Yes|Yup|Si|big Y", ., ~ "Y",
                           grepl("No", ., ~ "N",
                           grepl("kinda|y/n", ., ~ "Ambiguous",
                           TRUE ~ .))) %>%
  # Convert outcome_var back to factor
  mutate(outcome_var = factor(outcome_var, levels=c("N", "Y")))

model = glm(outcome_var ~ ., data=cleaned, family=binomial)     
1 Like

Hi joels,

That helps immensely, thank you!

I'm unfamiliar with most of the functions and notation that you've used. Thank you for your helpful comments in your code!

To answer your earlier question about my actual data, yes: all of the original columns are read in as factors.

I do have a few questions concerning what you've done:

  1. How did you learn how to use all of those functions?!? That's amazing! I'm completely baffled! You're fantastic!

  2. Let's say that indep_var3 and indep_4 aren't so conveniently named. Let's pretend they're named "Factor A" and "Effect B". How would you incorporate that into your response?

  3. This question is a longer one, please forgive me:
    When I run the first step (as indicated by the first # sign, after the first %>%, in your response) , my R Console informs me that, "NAs introduced by coercion". I can't tell if each step introduces NAs or not, but in any case, there are NAs in the cleaned data frame. Having NAs may not be the problem, but I suspect that it might be important. The problem is that, when I run the glm function now, there are over a hundred "observations deleted due to 'missingness'", according to the glm output. What does R mean by: "deletion" (which I'm interpreting as "exclusion from the logistic regression"), "observations" (which I'm interpreting as "rows"), and "missingness" (which I'm interpreting as NAs)? Am I interpreting those correctly?

  4. Another question that comes to mind is, "Is each row that has an NA excluded from the analysis, or are the individual entries that have NA themselves not included in the logistic regression?"

  5. My data set only has ~3000 rows, so ignoring ~100 rows may affect the accuracy of the logistic regression, and of the p-values that I care about. Supposing that imputation is out of the question, is there anything I can do to do resolve this problem?

  6. Let's say that I discover from those who performed the data entry that each NA should be considered as "N". How would you handle that? (I don't know if this is the case, it's just hypothetical)

  7. Is it alright that I'm asking so many questions? I don't want to overstep my bounds. I'm just impressed by how effective your suggestions have been so far, and I really want to learn how to solve this problem.

Thank you so much for your help, patience, and proficiency! I really appreciate you taking the time to teach me how to tackle this problem!

You're awesome!

10 years of working with R; almost every day for the last seven. It helped that I had studied Pascal in programming courses in college and used FORTRAN regularly for a few years in grad school before coming to R (after many intervening years in the wilderness of Excel and point-and-click software, such as Statistica). Check out the free book R for Data Science to help you get going.

You're learning a new language and a new way of thinking. Do it regularly and you will become fluent in an ever wider array of techniques. And as you reach each new level of proficiency, you'll learn how to do things more efficiently and with greater generality. Even though I know a lot about using R, what I know how to do is only a fraction of what I want to be able to do.

This one can be tricky. The idea is to look for regularities and exploit them. mutate_at and mutate_if allow you to create or change multiple columns with a single function. In this case, you could do:

mutate_at(vars(`Factor A`, `Effect B`), funs(as.numeric(as.character(.)))) %>%

If you have a lot of columns that follow this pattern (e.g., the word Factor or Effect followed by a space, followed by a letter, you could use a Regular Expression (another language that's worth learning for convenience when working with text):

mutate_at(vars(matches("(Factor|Effect) [A-Z]")), funs(as.numeric(as.character(.)))) %>%

Or, you could operate on them separately, using mutate:

mutate(`Factor A` = as.numeric(as.character(`Factor A`)),  
       `Effect B` = as.numeric(as.character(`Effect B`)))

You have some columns that are really numeric, but maybe they had some elements with text in them, like x = factor(c(10, 5, 8, "Missing", 6, " ")).

If you do x = as.numeric(as.character(x)) you get that warning. This is because you're converting the vector (or column of a data frame) to numeric, but "Missing" and " " aren't numbers, so R changes them to missing values, which are coded as NA in R. This is called "coercion". This will also happen if you have a character column that you convert to numeric: x = c(10, 5, 8, "Missing", 6, " "); as.numeric(x).

Yes, exactly. glm (and many other R modeling functions) will automatically exclude ("delete") from the model fit all rows (the entire observation) that have at least one missing value in a data column used in the regression formula.

See answer above.

Not for a classical regression model. You can only fit the model on data that includes an outcome value and values for each independent variable (whether present or imputed) that you want to include in the model. However, in general, you'll want to try to evaluate whether missing data are missing at random or missing in some systematic way that could bias your results.

You can recode these values in a similar way to recoding other values.

x = c("Y", NA, "N", "Y")
x[] = "N"

Or in a data frame:

d = data.frame(x = c("Y", NA, "N", "Y"))

# Method 1
d$x[$x)] = "N"

# Method 2
d = d %>% mutate(x = replace(x,, "N"))

# Method 3
d = d %>% mutate(x = case_when( ~ "N",
                               TRUE ~ x))
1 Like

You, sir, are phenomenal!

Thank you so much for your helpful explanations, detailed examples, and friendly encouragement!

This has been very educational and motivating! Thanks again for your timely and thorough help!

You're awesome!

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.