What the difference between tbl_* and sdf_* functions in sparklyr

talegari · September 1, 2021, 3:30pm

The amazing sparklyr API provides methods form common dplyr generics (like mutate), some tbl_* and sdf_* functions. Sometimes its confusing to understand which one to use as they seem to do that same thing.

pivot_wider and sdf_pivot seem to provide similar functionality with former bring intuitive. Is the former a spark SQL functionality only?
tbl_cache and tbl_persist seem to do the same thing (both have scanty documentation).

Please guide a regular user about an understanding of appropriate situations to use the similar looking functions.

yitaoli · September 2, 2021, 9:54am

pivot_wider() is intended to provide the same interface tidyr provides and offers much more functionalities than sdf_pivot() does.

More generally, sdf_* family of functions are simply R wrappers for Spark DataSet API functions.

tbl_* family of functions are required S3 methods for implementing the dplyr backend for Spark dataframes. They should not really be considered as part of the user-facing API of sparkylr.

One key difference between *_cache and *_persist is the latter has a parameter for the level of persistence (none, memory, disk, etc).

system · September 23, 2021, 9:54am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.