Summarize by distinct count of another column (Trying to replicate Excel's pivot table feature, 'summarize by distinct count' and 'summarize by distinct count percent of grant total')

Getting the information in the pivot table is easy enough. Melding the two types of summaries together takes a little more work and, frankly, I find the combination confusing. For example, the 3 and the 2 in your summary of Brazil by Type do not result in the 4 in the Brazil Total row. And the 5 and the 100% in your Grand Total are not obviously the sum of any of the preceding numbers. However, this gets you most of the way to the final table.

library(tidyverse)
#> Warning: package 'tibble' was built under R version 4.1.2
customer_id <- c('A1', 'A1', 'A1', 'B2', 'B2', 'C3', 'C3', 'D4', 'D4', 'E5')
country <- c('Brazil', 'Brazil', 'Peru', 'Brazil', 'Costa Rica', 'Brazil', 'Peru', 'Brazil', 'Costa Rica', 'Hawaii')
type <- c('Arabica', 'Robusta', 'Excelsa', 'Arabica', 'Robusta', 'Arabica', 'Excelsa','Robusta', 'Excelsa', 'Arabica')

df <- data.frame(customer_id, country, type)
df
#>    customer_id    country    type
#> 1           A1     Brazil Arabica
#> 2           A1     Brazil Robusta
#> 3           A1       Peru Excelsa
#> 4           B2     Brazil Arabica
#> 5           B2 Costa Rica Robusta
#> 6           C3     Brazil Arabica
#> 7           C3       Peru Excelsa
#> 8           D4     Brazil Robusta
#> 9           D4 Costa Rica Excelsa
#> 10          E5     Hawaii Arabica

TotalCust <- length(unique(df$customer_id))

SummCountryType <- df |> group_by(country, type) |> 
  summarize(N = n()) |> 
  mutate(Perc = N/TotalCust)
#> `summarise()` has grouped output by 'country'. You can override using the
#> `.groups` argument.
SummCountryType
#> # A tibble: 6 x 4
#> # Groups:   country [4]
#>   country    type        N  Perc
#>   <chr>      <chr>   <int> <dbl>
#> 1 Brazil     Arabica     3   0.6
#> 2 Brazil     Robusta     2   0.4
#> 3 Costa Rica Excelsa     1   0.2
#> 4 Costa Rica Robusta     1   0.2
#> 5 Hawaii     Arabica     1   0.2
#> 6 Peru       Excelsa     2   0.4

SummaryCountry <- df |> group_by(country) |> 
  summarize(N = length(unique(customer_id))) |> 
  mutate(Perc = N/TotalCust)
SummaryCountry
#> # A tibble: 4 x 3
#>   country        N  Perc
#>   <chr>      <int> <dbl>
#> 1 Brazil         4   0.8
#> 2 Costa Rica     2   0.4
#> 3 Hawaii         1   0.2
#> 4 Peru           2   0.4

SummaryCountry$type <- ""
GlueFunc <- function(Country) {
  tmp1 <- filter(SummCountryType, country == Country)
  tmp2 <- filter(SummaryCountry, country == Country) |> 
    mutate(country = paste(country, "Total"))
  rbind(tmp1, tmp2)
}

COUNTRIES <- unique(df$country)

FINAL <- map_dfr(COUNTRIES, GlueFunc)
FINAL
#> # A tibble: 10 x 4
#> # Groups:   country [8]
#>    country          type          N  Perc
#>    <chr>            <chr>     <int> <dbl>
#>  1 Brazil           "Arabica"     3   0.6
#>  2 Brazil           "Robusta"     2   0.4
#>  3 Brazil Total     ""            4   0.8
#>  4 Peru             "Excelsa"     2   0.4
#>  5 Peru Total       ""            2   0.4
#>  6 Costa Rica       "Excelsa"     1   0.2
#>  7 Costa Rica       "Robusta"     1   0.2
#>  8 Costa Rica Total ""            2   0.4
#>  9 Hawaii           "Arabica"     1   0.2
#> 10 Hawaii Total     ""            1   0.2

Created on 2022-05-15 by the reprex package (v2.0.1)

1 Like