The following code illustrates the problem:
library(caret)
library(NeuralNetTools)
Fruit<-c("Banana", "Apple", "Banana", "Orange", "Appel")
Origin<-c("New Guinea", "China","Germany", "USA", "Germany")
Quality<-c("Good", "Bad", "Good", "Very bad", "Decent")
Value<-c(50,75,80,60,30) #cents
Price<-c(1,2,1,3,1) #euros
Fruits<-data.frame(Fruit, Origin, Quality, Value, Price)
n <- 5
Fruits.replicated<-do.call("rbind", replicate(n, Fruits, simplify = FALSE))
KI <- train(
x = as.data.frame(subset(Fruits.replicated, select=-c(`Price`))),
y = Fruits.replicated$`Price`,
method = "nnet",
preProc = c("center", "scale"),
trControl = trainControl(
search = "random",
allowParallel = TRUE,
savePredictions = "final"
),
tuneLength = 5,
maxit = 500,
MaxNWts = 5000,
linout = TRUE,
trace = TRUE
)
Olden<-olden(KI$finalModel, bar_plot = FALSE)
Garson<-garson(KI$finalModel, bar_plot = FALSE)
It is an easy example of my problem. It programs an AI that, given a few parameters, tries to predict the price of a product.
I also implemented the olden and garson functions, which give the relative importance of the parameters. If low, it could have nothing to do with the outcome and just confuse the AI, so I could try not to use it, and hope for a better outcome with only the other parameters. It does work well for the column "Value", because it is numeric and therefore there is just one relative importance, so I can easily see if I should erase it with low relative importance or keep it with a high one. It does not work that well with string parameters though, because every string gets a relative importance. In my actual, bigger example I have a huge dataframe and 100+ strings, that each get a relative importance. It is quite messy and I cannot do anything with it, since I am only interested in the coloumn itself, should I use it for the AI or not.
Is it possible to "average" the result for each coloumn with strings, so I only get one value as in the numeric case for the functions garson and olden? Meaning that if in ColoumnA, String1 appears in 75% of cases with relative importance 2, and String2 appears in 25% of cases with relative importance 1, I get 1,75 for the whole coloumn, to see how important the coloumn is on average?