I have managed to build a neural network in R using the neuralnet package. I have used the Adult Income Census database and created a neural network with 3 hidden layers as such:
# Select point of interest & remove from training data.
x.interest = data[1,]
data = data[-1,]
form1 = as.formula(paste("~ ",
paste(names(data), collapse = " + ")))
m = model.matrix(form1, data = data)
# Hackish solution here: Select the first value of the class variable in the dataframe.
# Concatenate onto class label.
targ = gsub(" ", "", paste(adult_data@target, data[1,adult_data@target]))
form2 = as.formula(paste(targ,
"~",
paste(colnames(m)[colnames(m) != targ & colnames(m) != "(Intercept)"], collapse = " + ")))
nn = neuralnet(form2,
m,
hidden = 3,
err.fct = "ce",
linear.output = FALSE)
As I understand it, the data frame has to be converted to a model matrix for building the neural network as NNs operate only on numeric features. As such, qualitative data has to be converted into dummy variables that signify the presence/absence of a given value.
I created a new predictor as such, casting the model matrix to a data frame for the data
parameter:
pred = Predictor$new(model = nn,
data = as.data.frame(m),
class = adult_data@target_values[POS_CLV_INDEX],
conditional = FALSE)
I then convert the previously selected point of interest to a model matrix and cast it to a data frame for the prediction:
pred$predict(as.data.frame(model.matrix(form1, x.interest)))
I'm pretty sure something must be wrong, because I've tried a prediction on a point of interest with a positive class value:
age workclass fnlwgt education education_num marital_status occupation relationship race sex
1 52 Self_emp_not_inc 209642 HS_grad 9 Married_civ_spouse Exec_managerial Husband White Male
capital_gain capital_loss hours_per_week native_country class
1 0 0 45 United_States GT50K
pred
1 0.7611381
... and a point of interest with a negative class value:
age workclass fnlwgt education education_num marital_status occupation relationship race sex
2 50 Self_emp_not_inc 83311 Bachelors 13 Married_civ_spouse Exec_managerial Husband White Male
capital_gain capital_loss hours_per_week native_country class
2 0 0 13 United_States LTE50K
pred
1 0.7600538
As shown, both predicted results are similar. Here are my questions:
-
What have I done wrong here such that the predictions for these two points of interest are similar?
-
As shown in my first code block, I employ a hackish solution for extracting the class label from the model matrix. Essentially, I noted that the class label in the model matrix tends to be the original class label followed by the value of the class label in the first row of the data frame, so I simply concatenated these. However, I know this isn't consistent and could lead to issues when I apply this to other datasets. Is there a general solution to set this column name to, say, the class label followed by the positive class value? I.e. "classGT50K". The end goal is to use this across 5 different datasets, hence why I need a general solution for consistent column names.
-
I require that the predictor should give its prediction for a point of interest being in the positive class. I have had this same issue with setting up a predictor with a random forest. I set the
class
parameter in the predictor constructor call to the positive class value, but this doesn't seem to do anything. How can I guarantee that the predictor is always returning its prediction for the positive class and also includes the positive class value as the column name for the result?
Thank you. I am quite new to R, so apologies for my hacky code.