merge statistics for dplyr joins -- new feature in tidylog

some of you may remember the tidylog package. Right now, I'm working on improving the output for join operations such as left_join, inner_join, etc., and would welcome feedback on what the package should report.

This is a first draft, loosely oriented on what Stata reports for merges:

> tidylog::left_join(flights[1:10000, ], airlines[1:10, ], by = "carrier")
#>left_join: added one column (name)
#>           rows only in x    2,783
#>           rows only in y  (     0)
#>           matched rows      7,217
#>                           ========
#>           rows total       10,000

(Any time a number is printed in parentheses, it means that those rows are not included in the result.)

Because joins are complicated and cover a lot of different use cases, I would welcome additional input on this. It's also possible to test the current implementation (which surely still has some bugs). See the github issue here for more information:

Another interesting thing to report would be the numbers of rows that were duplicated, but I'm not sure yet on how to approach this.



This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.