Is there a way to display standard errors with ml_linear_regression in sparklyr?

When running a linear regression using sparklyr, such as:

cached_cars %>%
  ml_linear_regression(mpg ~ .) %>%

The results do not include standard errors

Deviance Residuals:
     Min       1Q   Median       3Q      Max 
-3.47339 -1.37936 -0.06554  1.05105  4.39057 

(Intercept) cyl_cyl_8.0 cyl_cyl_4.0        disp          hp        drat
16.15953652  3.29774653  1.66030673  0.01391241 -0.04612835  0.02635025
          wt        qsec          vs          am       gear        carb 
 -3.80624757  0.64695710  1.74738689  2.61726546 0.76402917  0.50935118  

R-Squared: 0.8816
Root Mean Squared Error: 2.041
  1. Is there a way to display standard errors when running this regression?
  2. Is there a way to cluster standard errors in sparklyr?
  3. I have also been trying to run a linear model with multiple group fixed effects in sparklyr. In base R, I have done so with felm. Does anyone have experience doing this in sparklyr?

Solutions using SparkR are also highly appreciated.

@aquev Hi thanks for your interest in sparklyr! :slightly_smiling_face:

For question 1, you can print the standard error of the coefficients and the intercept with the following:


spark_version <- "2.4.4" # This is the version of Spark I ran this example code with,
# but I think everything that follows should work in all versions of Spark anyways

sc <- spark_connect(master = "local", version = spark_version)

cached_cars <- copy_to(sc, mtcars)
model <- cached_cars %>%
  ml_linear_regression(mpg ~ .)

coeff_std_errs <- invoke(model$model$.jobj, "summary") %>%
  invoke("coefficientStandardErrors") %>%


We probably should make those numbers part of the summary output in R.

I'm not sure if I understood what question 2 and question 3 meant exactly. Please elaborate, with a small example, or a link to relevant maths formula, if possible. I'll be more than happy to see what can be done in sparklyr to address those use cases.

This works, thank you!

For questions 2 and 3, I am essentially trying to run a linear model with multiple fixed effects. In base R, I would use felm: Would you know how to run the equivalent of this in sparklyr?

@aquev For 1) you can use the tidy function, e.g.

lm_cars <- cached_cars %>%
  ml_linear_regression(mpg ~ .)
# A tibble: 11 x 5
   term        estimate std.error statistic p.value
   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
 1 (Intercept)  12.3      18.7        0.657  0.518 
 2 cyl          -0.111     1.05      -0.107  0.916 
 3 disp          0.0133    0.0179     0.747  0.463 
 4 hp           -0.0215    0.0218    -0.987  0.335 
 5 drat          0.787     1.64       0.481  0.635 
 6 wt           -3.72      1.89      -1.96   0.0633
 7 qsec          0.821     0.731      1.12   0.274 
 8 vs            0.318     2.10       0.151  0.881 
 9 am            2.52      2.06       1.23   0.234 
10 gear          0.655     1.49       0.439  0.665 
11 carb         -0.199     0.829     -0.241  0.812 

Regarding 2) and 3) Spark ML doesn't support multilevel modeling. A quick search turned up which might be worth considering if it has features many users want.

