merge statistics for dplyr joins -- new feature in tidylog

benj · July 26, 2019, 10:32am

Hi,
some of you may remember the tidylog package. Right now, I'm working on improving the output for join operations such as left_join, inner_join, etc., and would welcome feedback on what the package should report.

This is a first draft, loosely oriented on what Stata reports for merges:

> tidylog::left_join(flights[1:10000, ], airlines[1:10, ], by = "carrier")
#>left_join: added one column (name)
#>           rows only in x    2,783
#>           rows only in y  (     0)
#>           matched rows      7,217
#>                           ========
#>           rows total       10,000

(Any time a number is printed in parentheses, it means that those rows are not included in the result.)

Because joins are complicated and cover a lot of different use cases, I would welcome additional input on this. It's also possible to test the current implementation (which surely still has some bugs). See the github issue here for more information: https://github.com/elbersb/tidylog/issues/25

Another interesting thing to report would be the numbers of rows that were duplicated, but I'm not sure yet on how to approach this.

Ben

system · August 16, 2019, 10:32am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.