xgboost works with add_formula but not with recipe

avargas · April 9, 2021, 5:58pm

Hi! I'm trying to fit an xgboost model (regression) for some Airbnb data. I´m using the tidymodels framework. I go thru my usual steps when working with tidymodels:

Split data

data_split <- initial_split(listings_regre,
                            strata = "y",
                            prop = 0.8)
data_train <- training(data_split)
data_test  <- testing(data_split)

Create recipe

rec <- recipe(y  ~ ., data = data_train) %>% 
  step_nzv(all_nominal()) %>%
  step_dummy(all_nominal())

Create model

xgb_mod <-
  boost_tree() %>% 
  set_engine('xgboost') %>%
  set_mode('regression')

Create workflow

xgb_flow <- workflow() %>%
  add_model(xgb_mod) %>% 
  add_recipe(rec)

Fit model

xgb_fit <- xgb_flow %>% 
  last_fit(split = data_split)

Then I get:

preprocessor 1/1, model 1/1: Error in xgboost::xgb.DMatrix(x, label = y, missing = NA): 'data' has class 'character' and length 682192.\n  'data' accepts either a numeric matrix or a single filename."

But if change the workflow to

xgb_flow <- workflow() %>%
  add_model(xgb_mod) %>% 
  add_formula(y ~ .)

Everything works just fine.

I understood from here that both of these should work but is not happening. Does anybody know what is wrong with my recipe? I prefer working with recipes so I'd prefer using the first option.

Thank you in advance

nirgrahamuk · April 10, 2021, 12:41pm

You haven't provided a reprex so it's might be difficult to directly help you in reference to what you are doing.

But I'm going to go out on a limb and guess that the issue is with the data type of the y column, probably the recipe is adjusting all dependant variables but not the outcome, and xgboost doesn't know how to target a character outcome ?

I'm away from the computer so can't yet test my theory on a made up example, at this time.

avargas · April 12, 2021, 2:51pm

You're right, no reprex was provided. Here's my attempt to do so:

Data from: Get the Data - Inside Airbnb. Adding data to the debate. (the first one, from Amsterdam, you download listings.csv.gz)
My first, failed attempt:

listings <- read_csv(here("listings.csv") )

listings <- listings %>%
  mutate(Price = parse_number(Price),
         across(where(is.character), as.factor)
  )

data_split <- initial_split(listings,
                            strata = "Price",
                            prop = 0.8)
train <- training(data_split)
test  <- testing(data_split)

rec_xgb <- recipe(Price ~ ., data =  train) %>%  
  step_nzv(all_nominal()) %>%
  step_dummy(all_nominal())

xgb_mod <-
  boost_tree() %>% 
  set_engine('xgboost') %>%
  set_mode('regression')

xgb_flow <- workflow() %>%
  add_model(xgb_mod) %>% 
  add_recipe(rec_xgb)

xgb_fit <- xgb_flow %>% 
  last_fit(split = data_split)

Then changed the workflow to:

xgb_flow <- workflow() %>%
  add_model(xgb_modelo) %>% 
  add_formula(Price ~ .)

And it worked.

As you can see, my dependent variable is not a character, is numeric.

Thank you for your response and help, hope this helps to clear up the situation

nirgrahamuk · April 12, 2021, 4:21pm

I don't think your demonstration shows that, Doesn't it rather demonstrate that your flow executes when using the formula approach, despite price is character...

furthermore, isn't it not Price but price in the data ?


listings <- read_csv("http://data.insideairbnb.com/the-netherlands/north-holland/amsterdam/2021-02-08/data/listings.csv.gz")
str(listings$price)

avargas · April 12, 2021, 4:36pm

I'm sorry for the poor reprex. I couldn't reproduce my problem as it is because we did a lot of preprocessing and translation of variables.

Nonetheless, I think we've figure it out:

Dates were the problem, we were introducing dates to the model without any coertion and that was what was causing the problem. I will look into step_date and other methods that may help with this issue.

Thank you so much for all your help

nirgrahamuk · April 12, 2021, 4:47pm

Glad you figured it out, and thanks for sharing the date info

system · April 19, 2021, 4:48pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.