Hi everyone,
I couldn't find the answer and got so confused by standardization in glmnet...
I have 500 variables (chemicals), and each of them has 3 estimated levels, which means I actually have 1500 variables (X) in a dataset. Now I want to rule out the chemicals that do not play an important role on the outcome (Y), so I'm using glmnet to select them.
I'm training my data and trying to apply glmnet as follows:
a <- seq(0.1,0.9,0.05)
search <- foreach(i=a, .combine = rbind, .packages = 'glmnet') %dopar%{
cv <- cv.glmnet(mdlX, mdlY, family = 'binomial', nfold = 10, type.measure = 'auc', parallel = TRUE, alpha = i, standardize = TRUE)
data.frame(cvm = cv$cvm[cv$lambda == cv$lambda.min], lambda.min = cv$lambda.min, alpha = i)}
cv3 <- search[search$cvm == min(search$cvm),]
md3 <- glmnet(mdlX, mdlY, family = 'binomial', alpha = cv3$alpha, lambda = cv3$lambda.min, standardize = TRUE)
I read that the default is standardize = TRUE
if family = 'gaussian'
, so I added it to my codes. But then, it was indicated that the coefficients would be returned on the original scale.
So my question is:
Should I still do Scale(X)
in addition to standardize = TRUE
in both cv.glmnet and glmnet if the variables (chemicals) have different units? Because in the end, I'm using it to select the variables I need by the Variable Inclusion Probability:
Result <- apply(coeff_df,2,function(x) {sum(x!=0)/5*100})
But I also read from another post that I do not need to standardize it beforehand, if I use predict()
I'm not sure which is the best option.
Thank you for reading!