Cleaning Data for Logistic Regression

joels · May 23, 2019, 3:46am

10 years of working with R; almost every day for the last seven. It helped that I had studied Pascal in programming courses in college and used FORTRAN regularly for a few years in grad school before coming to R (after many intervening years in the wilderness of Excel and point-and-click software, such as Statistica). Check out the free book R for Data Science to help you get going.

You're learning a new language and a new way of thinking. Do it regularly and you will become fluent in an ever wider array of techniques. And as you reach each new level of proficiency, you'll learn how to do things more efficiently and with greater generality. Even though I know a lot about using R, what I know how to do is only a fraction of what I want to be able to do.

This one can be tricky. The idea is to look for regularities and exploit them. mutate_at and mutate_if allow you to create or change multiple columns with a single function. In this case, you could do:

mutate_at(vars(`Factor A`, `Effect B`), funs(as.numeric(as.character(.)))) %>%

If you have a lot of columns that follow this pattern (e.g., the word Factor or Effect followed by a space, followed by a letter, you could use a Regular Expression (another language that's worth learning for convenience when working with text):

mutate_at(vars(matches("(Factor|Effect) [A-Z]")), funs(as.numeric(as.character(.)))) %>%

Or, you could operate on them separately, using mutate:

mutate(`Factor A` = as.numeric(as.character(`Factor A`)),  
       `Effect B` = as.numeric(as.character(`Effect B`)))

You have some columns that are really numeric, but maybe they had some elements with text in them, like x = factor(c(10, 5, 8, "Missing", 6, " ")).

If you do x = as.numeric(as.character(x)) you get that warning. This is because you're converting the vector (or column of a data frame) to numeric, but "Missing" and " " aren't numbers, so R changes them to missing values, which are coded as NA in R. This is called "coercion". This will also happen if you have a character column that you convert to numeric: x = c(10, 5, 8, "Missing", 6, " "); as.numeric(x).

Yes, exactly. glm (and many other R modeling functions) will automatically exclude ("delete") from the model fit all rows (the entire observation) that have at least one missing value in a data column used in the regression formula.

See answer above.

Not for a classical regression model. You can only fit the model on data that includes an outcome value and values for each independent variable (whether present or imputed) that you want to include in the model. However, in general, you'll want to try to evaluate whether missing data are missing at random or missing in some systematic way that could bias your results.

You can recode these values in a similar way to recoding other values.

x = c("Y", NA, "N", "Y")
x[is.na(x)] = "N"
x

Or in a data frame:

d = data.frame(x = c("Y", NA, "N", "Y"))

# Method 1
d$x[is.na(d$x)] = "N"

# Method 2
d = d %>% mutate(x = replace(x, is.na(x), "N"))

# Method 3
d = d %>% mutate(x = case_when(is.na(x) ~ "N",
                               TRUE ~ x))