dummy coding in sparklyr?

xinxi813 · March 6, 2019, 2:23pm

Dear list,

I am trying to dummy code a factor and do a simple regression model. I used ft_one_hot_encoder to do it, but the results are different with those from lm() in R base. Could you help me understand this please?

Also, could ml_* machine learning functions provide standard error, especially for the regression models? I need info for significance test.

I heard that it might also possible to do the dummy coding using mutate function, it looks like that mutate is quite flexible in creating new columns, could any one give me some hint how to create dummy coding with mutate function? Thank you very much.

Please find the codes and results below:

sc_mtcars%>%ft_one_hot_encoder("gear","gear1")%>%ml_linear_regression(hp~gear1+wt)
Formula: hp ~ gear1 + wt

Coefficients:
(Intercept)     gear1_0     gear1_1     gear1_2     gear1_3     gear1_4          wt 
   69.84297     0.00000     0.00000     0.00000   -79.65578  -105.33888    47.76914 
> summary(lm(hp~as.factor(gear)+wt,data=mtcars))

Call:
lm(formula = hp ~ as.factor(gear) + wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-81.069 -21.774  -3.935  11.983  94.621 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)        -9.813     38.648  -0.254  0.80143    
as.factor(gear)4  -25.683     19.513  -1.316  0.19878    
as.factor(gear)5   79.656     23.600   3.375  0.00218 ** 
wt                 47.769      9.581   4.986 2.88e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 39.27 on 28 degrees of freedom
Multiple R-squared:  0.7037,	Adjusted R-squared:  0.672 
F-statistic: 22.17 on 3 and 28 DF,  p-value: 1.482e-07

Chuck · March 6, 2019, 3:14pm

library(tidyverse)
mtcars %>% as_tibble() %>% 
  mutate(gear3=(case_when(gear %in% 3 ~ 1, TRUE ~ 0)),
         gear4=(case_when(gear %in% 4 ~ 1, TRUE ~ 0)),
         gear5=(case_when(gear %in% 5 ~ 1, TRUE ~ 0))) -> mtcars

xinxi813 · March 9, 2019, 3:24pm

Thank you very much, Chuck:)

cderv · March 9, 2019, 3:43pm

If your question's been answered (even by you!), would you mind choosing a solution? It helps other people see which questions still need help, or find solutions if they have similar problems. Here’s how to do it:

xinxi813 · March 10, 2019, 10:06am

Thanks @cderv

My question on dummy coding has been answered.

Im still waiting to see if someone knows the possibility to get the standard error of the model estimates from ml_linear_regression(), ml_logistic_regression(), and so on. Hopefully someone knows something about it:)

jdlong · March 10, 2019, 12:37pm

you might want to open another question with a title that makes it clear what you are asking.

system · March 17, 2019, 12:37pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.