Dynamic grid which summarizes

Hello,

do you know if there is a way to produce a simple grid which would automatically summarize depending on which columns were selected ?

So for the following dataset....

Sales Profit Category Region
20 10 FISH UK
15 12 FISH US
25 18 FISH AUS
10 10 MEAT UK
19 15 MEAT US
11 20 MEAT AUS
10 30 VEG US

If only the Region column was selected the result would be :-

Sales Profit Region
30 20 UK
44 57 US
36 38 AUS

I'm using an R visual in PowerBI, and I'm passing a column called 'chosen_column' into the R Script which is the name of the column chosen by the user.

Thanks for your help.

Hi,

Welcome to the RStudio community!

Let me start by saying that I have not heard or used PowerBI, so my approach below is just the solution as if it were performed in a regular R script.

I'd also like to point you to the Reprex guide. A reprex consists of the minimal code and data needed to recreate the issue/question you're having. You can find instructions how to build and share one here:

The reason for this is that the way you shared your dataset here is not easy to transfer to R just by copy-paste. Reading the guide, you can learn how to share the data like this in future:

data.frame(
  stringsAsFactors = FALSE,
             Sales = c(20L, 15L, 25L, 10L, 19L, 11L, 10L),
            Profit = c(10L, 12L, 18L, 10L, 15L, 20L, 30L),
          Category = c("FISH", "FISH", "FISH", "MEAT", "MEAT", "MEAT", "VEG"),
            Region = c("UK", "US", "AUS", "UK", "US", "AUS", "US")
)

OK, now for the solution. Here is my approach, assuming you have the dataset as a variable in R and the column of interest as one too:

library(dplyr) #make sure you have the latest version, dplyr 1.0.0+
library(tidyr)

> myData
  Sales Profit Category Region
1    20     10     FISH     UK
2    15     12     FISH     US
3    25     18     FISH    AUS
4    10     10     MEAT     UK
5    19     15     MEAT     US
6    11     20     MEAT    AUS
7    10     30      VEG     US

colOfInterest = "Region"

myData = myData %>% group_by(across(all_of(colOfInterest))) %>% 
  summarise(Sales = sum(Sales), Profit = sum(Profit), .groups = "drop")

> myData
# A tibble: 3 x 3
  Region Sales Profit
  <chr>  <int>  <int>
1 AUS       36     38
2 UK        30     20
3 US        44     57

EXPLANATION
I used the tidyverse implementation here. If you are unfamiliar with this, please read some more here

The group_by and summarise function are basic tidyverse functions (part of dplyr package). The first groups the data by a certain column, the second allows to summarise the other columns. In this case, I assumed that Sales and Profit are always to be summarised.

The more difficult to grasp part is the across(all_of(colOfInterest)) bit. Normally, you would just type the column name in the group_by function like this: myData %>% group_by(Region). However, you don't know beforehand which column to group by, so you need to convert a string given by the user's selection (from PowerBI) to a column name that the tidyverse language accepts.

To do this, you use the new across function (dplyr 1.0.0) combined with the all_of function. Here you can have a vector of one or more strings that convert to column names.

I know this all sounds a bit confusing, and I still struggle with it myself as it's all new in dplyr 1.0.0, so maybe someone else comes up with even a more elegant solution :slight_smile:

Hope this helps,
PJ

Another approach would be to write a function that accepts an arbitrary number of arguments for the grouping variables by passing the dots.

library(dplyr, warn.conflicts = FALSE)

MyData <- data.frame(
  stringsAsFactors = FALSE,
  Sales = c(20L, 15L, 25L, 10L, 19L, 11L, 10L),
  Profit = c(10L, 12L, 18L, 10L, 15L, 20L, 30L),
  Category = c("FISH", "FISH", "FISH", "MEAT", "MEAT", "MEAT", "VEG"),
  Region = c("UK", "US", "AUS", "UK", "US", "AUS", "US")
)

MyFunc <- function(.data, ...) {
  .data %>% 
    group_by(...) %>% 
    summarise(Sales = sum(Sales), Profit = sum(Profit), .groups = "drop")
}

MyFunc(MyData, Category)
#> # A tibble: 3 x 3
#>   Category Sales Profit
#>   <chr>    <int>  <int>
#> 1 FISH        60     40
#> 2 MEAT        40     45
#> 3 VEG         10     30
MyFunc(MyData, Region)
#> # A tibble: 3 x 3
#>   Region Sales Profit
#>   <chr>  <int>  <int>
#> 1 AUS       36     38
#> 2 UK        30     20
#> 3 US        44     57

Created on 2020-07-16 by the reprex package (v0.3.0)

1 Like

Question for @siddharthprabhu :

Your proposed function expects to get a column name that is not in string format or a variable containing one. So if you'd give it a variable, this happens:

colOfInterest = c("Region")

MyFunc(MyData, colOfInterest)

Error: Must group by variables found in `.data`.
* Column `colOfInterest` is not found.

How would you fix that?

PJ

EDIT:
Looking at my related post, I guess now it would be:

colOfInterest = c("Region")

MyFunc(MyData, across(all_of(colOfInterest)))
# A tibble: 3 x 3
  Region Sales Profit
  <chr>  <int>  <int>
1 AUS       36     38
2 UK        30     20
3 US        44     57
1 Like

Hi @pieterjanvc & siddharthprabhu - thanks for your detailed explanations, that's really helpful - and they both work perfectly in PowerBI!

I've been attempting to add a second column of interest, so it would display 2 dimensions -

colOfInterest = "Region"
colOfInterest2 = "Category"

do you know if there is any easy way to do this ?

Hi,

Just use a vector like this:

colOfInterest = c("Region", "Category")

PJ

Thanks @pieterjanvc , most appreciated !

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.