do you know if there is a way to produce a simple grid which would automatically summarize depending on which columns were selected ?
So for the following dataset....
Sales Profit Category Region
20 10 FISH UK
15 12 FISH US
25 18 FISH AUS
10 10 MEAT UK
19 15 MEAT US
11 20 MEAT AUS
10 30 VEG US
If only the Region column was selected the result would be :-
Sales Profit Region
30 20 UK
44 57 US
36 38 AUS
I'm using an R visual in PowerBI, and I'm passing a column called 'chosen_column' into the R Script which is the name of the column chosen by the user.
Let me start by saying that I have not heard or used PowerBI, so my approach below is just the solution as if it were performed in a regular R script.
I'd also like to point you to the Reprex guide. A reprex consists of the minimal code and data needed to recreate the issue/question you're having. You can find instructions how to build and share one here:
The reason for this is that the way you shared your dataset here is not easy to transfer to R just by copy-paste. Reading the guide, you can learn how to share the data like this in future:
OK, now for the solution. Here is my approach, assuming you have the dataset as a variable in R and the column of interest as one too:
library(dplyr) #make sure you have the latest version, dplyr 1.0.0+
library(tidyr)
> myData
Sales Profit Category Region
1 20 10 FISH UK
2 15 12 FISH US
3 25 18 FISH AUS
4 10 10 MEAT UK
5 19 15 MEAT US
6 11 20 MEAT AUS
7 10 30 VEG US
colOfInterest = "Region"
myData = myData %>% group_by(across(all_of(colOfInterest))) %>%
summarise(Sales = sum(Sales), Profit = sum(Profit), .groups = "drop")
> myData
# A tibble: 3 x 3
Region Sales Profit
<chr> <int> <int>
1 AUS 36 38
2 UK 30 20
3 US 44 57
EXPLANATION
I used the tidyverse implementation here. If you are unfamiliar with this, please read some more here
The group_by and summarise function are basic tidyverse functions (part of dplyr package). The first groups the data by a certain column, the second allows to summarise the other columns. In this case, I assumed that Sales and Profit are always to be summarised.
The more difficult to grasp part is the across(all_of(colOfInterest)) bit. Normally, you would just type the column name in the group_by function like this: myData %>% group_by(Region). However, you don't know beforehand which column to group by, so you need to convert a string given by the user's selection (from PowerBI) to a column name that the tidyverse language accepts.
I know this all sounds a bit confusing, and I still struggle with it myself as it's all new in dplyr 1.0.0, so maybe someone else comes up with even a more elegant solution
Your proposed function expects to get a column name that is not in string format or a variable containing one. So if you'd give it a variable, this happens:
colOfInterest = c("Region")
MyFunc(MyData, colOfInterest)
Error: Must group by variables found in `.data`.
* Column `colOfInterest` is not found.
How would you fix that?
PJ
EDIT:
Looking at my related post, I guess now it would be:
colOfInterest = c("Region")
MyFunc(MyData, across(all_of(colOfInterest)))
# A tibble: 3 x 3
Region Sales Profit
<chr> <int> <int>
1 AUS 36 38
2 UK 30 20
3 US 44 57