How to use janitor::clean_names() in a pipeline for an MLR learner

gabrielsaul · February 6, 2024, 5:20pm

I am trying to clean the column names of my data after I scale and dummy encode it:

I set up the task and learner like this:

# Task for classification.
  data.task = makeClassifTask(id = data_id,
                              data = data,
                              target = target,
                              positive = target_values[POS_CLV_INDEX])

# Learner: Random Forest.
    lrn = makeLearner("classif.randomForest", 
                      predict.type = "prob", 
                      fix.factors.prediction = TRUE)

Then I form a CPO pipeline with the learner:

# Normalisation/dummy encode.
    data.lrn = cpoScale() %>>% cpoDummyEncode() %>>% lrn

Ideally, I want to have something like:

# Normalisation/dummy encode.
    data.lrn = cpoScale() %>>% cpoDummyEncode() %>>% janitor::clean_names() %>>% lrn

that will clean the column names after the scaling and dummy encoding (as there will be new column names formed). However, I get an error saying that the default is missing for clean_names(). The documentation says that clean_names() should work in a pipeline, but I'm not sure how to use it in this context.

Essentially, the source issue was to do with values in the dataframe having spaces in them, so when the dummy encode happens, it creates column names that are unrecognisable to the model/predictor.

DBScan · February 7, 2024, 12:28pm

If you want to clean your content in a column, you can use make_clean_names.
clean_names needs a data.frame as input, maybe you don't have one after cpoScale() %>>% cpoDummyEncode()?

system · March 20, 2024, 12:29pm

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.