You don't need to convert the columns to numeric before converting to factor. The factor()
function will convert it back to character, anyway.
This is what I prefer when there are a lot of factor columns and/or levels. To reduce the amount of basic data in the code, you could save each column's lookup table in a separate CSV. This has the benefits of being (1) language agnostic and (2) easy to edit in Excel (it has to be good for something).
For example, suppose you had lookup table file named income_levels.csv
like this:
level,label
"1","Under $5,000"
"2","$5,000-$7,999"
"3","$8,000-$9,999"
"4","$10,000+"
Then you can use that in the code like this:
library(readr)
income_levels <- read_csv("income_levels.csv")
x[["INCOME"]] <- factor(x[["INCOME"]], income_levels[["level"]], income_levels[["label"]])
Even when there are a lot of factor columns, I still prefer a separate recode file for each. That makes it easy for me to keep track of things. If I need to edit a file, it should be easy. The code can rearrange it into the "optimal" shape.
library(dplyr)
library(readr)
factor_columns <- c("INCOME", "RACE", "SEX", "EDUCATION")
level_files <- paste0("level_files/", factor_columns, "_levels.csv")
mappings <- lapply(level_files, read_csv)
for (fc in factor_columns) {
fc_map <- mappings[[fc]]
x[[fc]] <- factor(x[[fc]], fc_map[["level"]], fc_map[["label"]])
}
Not counting the library
calls, that's 7 lines of code for handling four columns. Expanding it to 20 columns will only add two or three extra lines for including them in factor_columns
. Sure, you'll need to manually create the level/label mappings, but that's unavoidable.*
* It's totally avoidable if the mappings are defined in an HTML or text file. You can use string-processing to extract and arrange the information you want. Just make sure to save the results instead of any fancy regex code which generates them. It's easier to clean up plain text than finely tune regex. Save the regex part as a separate script.