Summarize by distinct count of another column (Trying to replicate Excel's pivot table feature, 'summarize by distinct count' and 'summarize by distinct count percent of grant total')

neonowl · May 15, 2022, 5:24pm

Hi RStudio Community,

I'm trying to replicate a summary table in R, which I have successfully created in Excel.

Goal:

Create a summary table of 2 columns (country, and type), and summarize by the distinct count of another column : customer_id.

Given the following sample dataframe:

library(tidyverse)

customer_id <- c('A1', 'A1', 'A1', 'B2', 'B2', 'C3', 'C3', 'D4', 'D4', 'E5')
country <- c('Brazil', 'Brazil', 'Peru', 'Brazil', 'Costa Rica', 'Brazil', 'Peru', 'Brazil', 'Costa Rica', 'Hawaii')
type <- c('Arabica', 'Robusta', 'Excelsa', 'Arabica', 'Robusta', 'Arabica', 'Excelsa','Robusta', 'Excelsa', 'Arabica')

df <- data.frame(customer_id, country, type)

I want to pivot country and type to the rows, and create two summary columns:

distinct_count_of_customer_id
percentage_of_grand_total_distinct_count_of_customer_id

The attached screenshot from what I've done in Excel should help explain better what I'm trying to accomplish: a pivot table with country and type, summarized by the distinct count of customer ids in the customer_id column.

As you can see, there are 5 customers, so the grand total is 5. (And the grand total also sums to 100%)

I have tried various combinations of group_by(), summarize(), distinct(), and/or distinct_n()... but I just can't make it work after after a lot of research, and trial and error. Really appreciate any advice.

FJCC · May 15, 2022, 6:30pm

Getting the information in the pivot table is easy enough. Melding the two types of summaries together takes a little more work and, frankly, I find the combination confusing. For example, the 3 and the 2 in your summary of Brazil by Type do not result in the 4 in the Brazil Total row. And the 5 and the 100% in your Grand Total are not obviously the sum of any of the preceding numbers. However, this gets you most of the way to the final table.

library(tidyverse)
#> Warning: package 'tibble' was built under R version 4.1.2
customer_id <- c('A1', 'A1', 'A1', 'B2', 'B2', 'C3', 'C3', 'D4', 'D4', 'E5')
country <- c('Brazil', 'Brazil', 'Peru', 'Brazil', 'Costa Rica', 'Brazil', 'Peru', 'Brazil', 'Costa Rica', 'Hawaii')
type <- c('Arabica', 'Robusta', 'Excelsa', 'Arabica', 'Robusta', 'Arabica', 'Excelsa','Robusta', 'Excelsa', 'Arabica')

df <- data.frame(customer_id, country, type)
df
#>    customer_id    country    type
#> 1           A1     Brazil Arabica
#> 2           A1     Brazil Robusta
#> 3           A1       Peru Excelsa
#> 4           B2     Brazil Arabica
#> 5           B2 Costa Rica Robusta
#> 6           C3     Brazil Arabica
#> 7           C3       Peru Excelsa
#> 8           D4     Brazil Robusta
#> 9           D4 Costa Rica Excelsa
#> 10          E5     Hawaii Arabica

TotalCust <- length(unique(df$customer_id))

SummCountryType <- df |> group_by(country, type) |> 
  summarize(N = n()) |> 
  mutate(Perc = N/TotalCust)
#> `summarise()` has grouped output by 'country'. You can override using the
#> `.groups` argument.
SummCountryType
#> # A tibble: 6 x 4
#> # Groups:   country [4]
#>   country    type        N  Perc
#>   <chr>      <chr>   <int> <dbl>
#> 1 Brazil     Arabica     3   0.6
#> 2 Brazil     Robusta     2   0.4
#> 3 Costa Rica Excelsa     1   0.2
#> 4 Costa Rica Robusta     1   0.2
#> 5 Hawaii     Arabica     1   0.2
#> 6 Peru       Excelsa     2   0.4

SummaryCountry <- df |> group_by(country) |> 
  summarize(N = length(unique(customer_id))) |> 
  mutate(Perc = N/TotalCust)
SummaryCountry
#> # A tibble: 4 x 3
#>   country        N  Perc
#>   <chr>      <int> <dbl>
#> 1 Brazil         4   0.8
#> 2 Costa Rica     2   0.4
#> 3 Hawaii         1   0.2
#> 4 Peru           2   0.4

SummaryCountry$type <- ""
GlueFunc <- function(Country) {
  tmp1 <- filter(SummCountryType, country == Country)
  tmp2 <- filter(SummaryCountry, country == Country) |> 
    mutate(country = paste(country, "Total"))
  rbind(tmp1, tmp2)
}

COUNTRIES <- unique(df$country)

FINAL <- map_dfr(COUNTRIES, GlueFunc)
FINAL
#> # A tibble: 10 x 4
#> # Groups:   country [8]
#>    country          type          N  Perc
#>    <chr>            <chr>     <int> <dbl>
#>  1 Brazil           "Arabica"     3   0.6
#>  2 Brazil           "Robusta"     2   0.4
#>  3 Brazil Total     ""            4   0.8
#>  4 Peru             "Excelsa"     2   0.4
#>  5 Peru Total       ""            2   0.4
#>  6 Costa Rica       "Excelsa"     1   0.2
#>  7 Costa Rica       "Robusta"     1   0.2
#>  8 Costa Rica Total ""            2   0.4
#>  9 Hawaii           "Arabica"     1   0.2
#> 10 Hawaii Total     ""            1   0.2

^{Created on 2022-05-15 by the reprex package (v2.0.1)}

neonowl · May 16, 2022, 2:58am

This did the trick, THANK YOU!

I agree with you, it is a weird way to represent the data (where the totals don't add intuitively in the table) but it works for my use case. I wonder if there's a better way to represent the data in a tabular format -- open to any ideas you have.

(My made-up data probably made it more confusing, but you managed to solve it anyway.)

system · May 25, 2022, 11:39pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.