Error: `data` and `reference` should be factors with the same levels.

gabrielburcea · May 29, 2020, 11:55am

Hello,

I do get an error when trying to run a confusion matrix.

> Error: `data` and `reference` should be factors with the same levels.

This what I am doing:
I am creating my own function to get my best cutoff points for the ROC curve for training and validating dataset.

AccuracyCutoffInfo <- function(train, test, predict, actual) 
{
  
  #change the cutoff value's range as you please
  
  cutoff <- seq(.1 , .9, by = .05)
  
  accuracy <- lapply(cutoff, function(c)
    
  {
    
    # use the confusionMatrix from the caret package
    
    cm_train <- confusionMatrix(train[[predict]] > c, train[[actual]])
    cm_test <- confusionMatrix(test[[predict]] > c, test[[actual]] )

    return(dt)    
    
  }) %>% rbindlist()
  
  accuracy_long <- gather(accuracy, "data", "accuracy", -1)
  
  plot <- ggplot(accuracy_long, aes(cutoff, accuracy, group = data, color = data)) +
    geom_line(size = 1) + geom_poin(size = 3) +
    scale_y_continuous(label = percent) +
    ggtitle("Train/Test Accuracy for Different Cutoff")
  
  return(list(data = accuracy, plot = plot))
}
  
theme_set(theme_minimal())

I am using then the same function:

> accuracy_info <- AccuracyCutoffInfo(train = train, test = validate, predict = "pred", actual = "real")

Now, my training and testing datasets have the class as factors.
Check this out:

For training data set my dput is:

structure(list(shortness_breath = structure(c(1L, 1L, 1L, 1L, 
1L, 1L), .Label = c("No", "Yes"), class = "factor"), obesity = structure(c(2L, 
1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"), 
    asthma = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("No", 
    "Yes"), class = "factor"), diabetes_type_one = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"), 
    diabetes_type_two = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("No", 
    "Yes"), class = "factor"), hypertension = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"), 
    heart_disease = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("No", 
    "Yes"), class = "factor"), lung_condition = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"), 
    liver_disease = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("No", 
    "Yes"), class = "factor"), kidney_disease = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"), 
    Covid_tested = c("negative", "negative", "negative", "negative", 
    "negative", "negative"), Age = c(42, 53, 42, 50, 27, 26), 
    Gender = c("Female", "Female", "Female", "Male", "Female", 
    "Male"), pred = c(`1` = 0.194445511752173, `2` = 0.157691990854952, 
    `3` = 0.158715363855891, `4` = 0.157970559536371, `5` = 0.160119548044875, 
    `6` = 0.160213516891202), real = structure(c(1L, 1L, 1L, 
    1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor")), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"), problems = structure(list(
    row = c(2910L, 35958L), col = c("how_unwell", "how_unwell"
    ), expected = c("a double", "a double"), actual = c("How Unwell", 
    "How Unwell"), file = c("'/Users/gabrielburcea/Rprojects/data/data_lev_categorical_no_sev.csv'", 
    "'/Users/gabrielburcea/Rprojects/data/data_lev_categorical_no_sev.csv'"
    )), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
)))

And then for validation dataset my dput is:

structure(list(shortness_breath = structure(c(1L, 2L, 2L, 1L, 
1L, 1L), .Label = c("No", "Yes"), class = "factor"), obesity = structure(c(1L, 
1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"), 
    asthma = structure(c(2L, 1L, 1L, 1L, 1L, 1L), .Label = c("No", 
    "Yes"), class = "factor"), diabetes_type_one = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"), 
    diabetes_type_two = structure(c(2L, 1L, 1L, 1L, 1L, 1L), .Label = c("No", 
    "Yes"), class = "factor"), hypertension = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"), 
    heart_disease = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("No", 
    "Yes"), class = "factor"), lung_condition = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"), 
    liver_disease = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("No", 
    "Yes"), class = "factor"), kidney_disease = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"), 
    Covid_tested = c("negative", "negative", "negative", "negative", 
    "negative", "negative"), Age = c(63, 19, 31, 26, 30, 45), 
    Gender = c("Male", "Female", "Male", "Male", "Female", "Female"
    ), pred = c(`1` = 0.26594006201297, `2` = 0.160872548705087, 
    `3` = 0.159744118695227, `4` = 0.160213516891202, `5` = 0.159837909145038, 
    `6` = 0.15843572889978), real = structure(c(1L, 2L, 2L, 1L, 
    1L, 1L), .Label = c("No", "Yes"), class = "factor")), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"), problems = structure(list(
    row = c(2910L, 35958L), col = c("how_unwell", "how_unwell"
    ), expected = c("a double", "a double"), actual = c("How Unwell", 
    "How Unwell"), file = c("'/Users/gabrielburcea/Rprojects/data/data_lev_categorical_no_sev.csv'", 
    "'/Users/gabrielburcea/Rprojects/data/data_lev_categorical_no_sev.csv'"
    )), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
)))

Max · May 29, 2020, 3:54pm

Is this from caret?

If so:

Error: data and reference should be factors with the same levels.

means that you need to give it factors as inputs (train[[predict]] > c is not a factor). Try using factor(ifelse(...), levels) instead).

gabrielburcea · May 29, 2020, 5:10pm

hello Max,

Yes, it caret. Yet, I am a bit confused where I shall put those levels, and how? Would you please illustrate to me through a full code?

Max · May 29, 2020, 5:39pm

This might be the solution:

obs <- train[[actual]]
lvl <- levels(obs)
new_pred <- ifelse(train[[predict]] > c, lvl[1], lvl[2])
new_pred <- factor(new_pred, levls = lvl)

I haven't loaded the data so you'll have to check.

system · June 19, 2020, 5:39pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.