Is there a way to display standard errors with ml_linear_regression in sparklyr?

aquev · September 29, 2020, 2:06pm

When running a linear regression using sparklyr, such as:

cached_cars %>%
  ml_linear_regression(mpg ~ .) %>%
  summary()

The results do not include standard errors

Deviance Residuals:
     Min       1Q   Median       3Q      Max 
-3.47339 -1.37936 -0.06554  1.05105  4.39057 

Coefficients:
(Intercept) cyl_cyl_8.0 cyl_cyl_4.0        disp          hp        drat
16.15953652  3.29774653  1.66030673  0.01391241 -0.04612835  0.02635025
          wt        qsec          vs          am       gear        carb 
 -3.80624757  0.64695710  1.74738689  2.61726546 0.76402917  0.50935118  

R-Squared: 0.8816
Root Mean Squared Error: 2.041

Is there a way to display standard errors when running this regression?
Is there a way to cluster standard errors in sparklyr?
I have also been trying to run a linear model with multiple group fixed effects in sparklyr. In base R, I have done so with felm. Does anyone have experience doing this in sparklyr?

Solutions using SparkR are also highly appreciated.

Link to StackOverflow question

yitaoli · September 30, 2020, 5:03pm

@aquev Hi thanks for your interest in sparklyr!

For question 1, you can print the standard error of the coefficients and the intercept with the following:

library(sparklyr)

spark_version <- "2.4.4" # This is the version of Spark I ran this example code with,
# but I think everything that follows should work in all versions of Spark anyways

sc <- spark_connect(master = "local", version = spark_version)

cached_cars <- copy_to(sc, mtcars)
model <- cached_cars %>%
  ml_linear_regression(mpg ~ .)

coeff_std_errs <- invoke(model$model$.jobj, "summary") %>%
  invoke("coefficientStandardErrors") %>%

print(coeff_std_errs)

We probably should make those numbers part of the summary output in R.

I'm not sure if I understood what question 2 and question 3 meant exactly. Please elaborate, with a small example, or a link to relevant maths formula, if possible. I'll be more than happy to see what can be done in sparklyr to address those use cases.

aquev · September 30, 2020, 5:54pm

This works, thank you!

@yitaoli
For questions 2 and 3, I am essentially trying to run a linear model with multiple fixed effects. In base R, I would use felm: https://www.rdocumentation.org/packages/lfe/versions/2.8-5.1/topics/felm. Would you know how to run the equivalent of this in sparklyr?

kevinykuo · September 30, 2020, 11:13pm

@aquev For 1) you can use the tidy function, e.g.

lm_cars <- cached_cars %>%
  ml_linear_regression(mpg ~ .)
tidy(lm_cars)
# A tibble: 11 x 5
   term        estimate std.error statistic p.value
   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
 1 (Intercept)  12.3      18.7        0.657  0.518 
 2 cyl          -0.111     1.05      -0.107  0.916 
 3 disp          0.0133    0.0179     0.747  0.463 
 4 hp           -0.0215    0.0218    -0.987  0.335 
 5 drat          0.787     1.64       0.481  0.635 
 6 wt           -3.72      1.89      -1.96   0.0633
 7 qsec          0.821     0.731      1.12   0.274 
 8 vs            0.318     2.10       0.151  0.881 
 9 am            2.52      2.06       1.23   0.234 
10 gear          0.655     1.49       0.439  0.665 
11 carb         -0.199     0.829     -0.241  0.812

Regarding 2) and 3) Spark ML doesn't support multilevel modeling. A quick search turned up https://github.com/linkedin/photon-ml which might be worth considering if it has features many users want.

system · October 7, 2020, 11:13pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.