Dear list,
I am trying to dummy code a factor and do a simple regression model. I used ft_one_hot_encoder to do it, but the results are different with those from lm() in R base. Could you help me understand this please?
Also, could ml_* machine learning functions provide standard error, especially for the regression models? I need info for significance test.
I heard that it might also possible to do the dummy coding using mutate function, it looks like that mutate is quite flexible in creating new columns, could any one give me some hint how to create dummy coding with mutate function? Thank you very much.
Please find the codes and results below:
sc_mtcars%>%ft_one_hot_encoder("gear","gear1")%>%ml_linear_regression(hp~gear1+wt)
Formula: hp ~ gear1 + wt
Coefficients:
(Intercept) gear1_0 gear1_1 gear1_2 gear1_3 gear1_4 wt
69.84297 0.00000 0.00000 0.00000 -79.65578 -105.33888 47.76914
> summary(lm(hp~as.factor(gear)+wt,data=mtcars))
Call:
lm(formula = hp ~ as.factor(gear) + wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-81.069 -21.774 -3.935 11.983 94.621
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.813 38.648 -0.254 0.80143
as.factor(gear)4 -25.683 19.513 -1.316 0.19878
as.factor(gear)5 79.656 23.600 3.375 0.00218 **
wt 47.769 9.581 4.986 2.88e-05 ***
---
Signif. codes: 0 β***β 0.001 β**β 0.01 β*β 0.05 β.β 0.1 β β 1
Residual standard error: 39.27 on 28 degrees of freedom
Multiple R-squared: 0.7037, Adjusted R-squared: 0.672
F-statistic: 22.17 on 3 and 28 DF, p-value: 1.482e-07