Hi,
I'm playing around with a dummy dataset with locations as well as time / day information also available. I am reverse engineering an exercise from a website but I am unable to recreate the expected results on my local machine, I keep getting an error that says:
Warning: predict.naive_bayes(): only 1 feature(s) out of 2 defined in the naive_bayes object "locmodel" are used for prediction.
Warning: predict.naive_bayes(): more features in the newdata are provided as there are probability tables in the object. Calculation is performed based on features to be found in the tables.
Error: predict.naive_bayes():
1 feature is discrete, and compared to the corresponding probability table it misses some levels or has more levels.
Other possibility: there is type mismatch between training data and newdata (for instance, some variable should be numeric but is character/factor).
I did not know the best way to share the data, therefore I uploaded it to kaggle. I was going to try and incorporate in the code downloading the dataset directly from kaggle but I was unsure on how to proceed, hence I'll just share the link in hope you can download it.
Locations_Dummy_NaiveBayes | Kaggle
library(naivebayes)
library(tidyverse)
locations <- read_csv("locations.csv")
# The exercise contains two "objects" that were loaded in the
# environment, therefore I had to try reverse engineer
# the exercise and these objects.
# The objects in question are:
# weekend_evening & weekend_afternoon
# To determine what these objects were on the online R console
# I inspected their classes & just calling them to see what came up,
# below is the online R console output:
# > weekend_afternoon
# daytype hourtype location
# 85 weekend afternoon home
# > class(weekend_afternoon)
# [1] "data.frame"
# > weekend_evening
# daytype hourtype location
# 91 weekend evening home
# >class(weekend_evening)
# [1] "data.frame"
# Based on the console output I deduced that these were "simple" data frames,
# which I could quickly build as seen below:
weekday_afternoon <- tibble(
datatype = "weekday",
hourtype = "aternoon",
location = "office"
) %>% mutate(datatype = as.factor(datatype),
hourtype = as.factor(hourtype),
location = as.factor(location))
weekday_evening <- tibble(
datatype = "weekday",
hourtype = "evening",
location = "home"
) %>% mutate(datatype = as.factor(datatype),
hourtype = as.factor(hourtype),
location = as.factor(location))
# Build a NB model of location
locmodel <- naive_bayes(location ~ daytype + hourtype, data = locations)
# Predict my location on a weekday afternoon
predict(locmodel, weekday_afternoon)
# Predict my location on a weekday evening
predict(locmodel, weekday_evening)
I believe that I was recreating successfully the exercises behavior, nonetheless I keep getting the described error.
What I found intriguing was that even though these objects weekend_afternoon
& weekend_evening
despite when called upon, its output appears to have a single observation with 3 variables. Also, how can these df
having a single observation and have more than what the observed factor levels. Maybe to work I need to set these factor levels as well to these df
, if so how? I thought that the factors were based on existing observations for those variables, can I stipulate all the different factor levels regardless if that observation even exists in the data frame?
# Don't run this part since this output comes from the online console.
str(weekend_evening)
'data.frame': 1 obs. of 3 variables:
$ daytype : Factor w/ 2 levels "weekday","weekend": 2
$ hourtype: Factor w/ 4 levels "afternoon","evening",..: 2
$ location: Factor w/ 7 levels "appointment",..: 3
# Calling weeken_evening outputs only this:
> weekend_evening
daytype hourtype location
91 weekend evening home
str(weekend_afternoon)
'data.frame': 1 obs. of 3 variables:
$ daytype : Factor w/ 2 levels "weekday","weekend": 2
$ hourtype: Factor w/ 4 levels "afternoon","evening",..: 1
$ location: Factor w/ 7 levels "appointment",..: 3
# Calling weekend_afternoon ouputs only this:
> weekend_afternoon
daytype hourtype location
85 weekend afternoon home
How can weekend_evening
& weekend_afternoon
be a data frames with a single observation and have many factor levels? This is what I believe might be the issue, otherwise I have no clue on why I get this error.
Thanks for your time.