Classification into 2 groups using dplyr

Hi Team,
I hope you will help me to classify the test result count by specialty unit.
Repex is pasted.
Expected result is in the last column called 'grp' where top10 and other will be seen.
The purpose is to show top 10 test ordering specialties and the rest in a single aggregate report.
Thank you very much.
G

result <- tibble::tribble(
                                            ~RESULT_UNIT,     ~n,    ~grp,
                                         "Emergency Med", 41792L, "top10",
                                      "Liver Transplant", 10909L, "top10",
                                       "Cardiac Surgery",  9838L, "top10",
                                          "Neurosurgery",  5498L, "top10",
                              "Hepatabiliary/Transplant",  5356L, "top10",
                                      "Resp & Sleep Med",  5173L, "top10",
                                            "Cardiology",  4884L, "top10",
                                    "Colorectal Surgery",  4227L, "top10",
                                                 "Renal",  3957L, "top10",
                                         "Gastro/Hepato",  3578L, "top10",
                                      "Upper GI Surgery",  3399L, "other",
                                                      NA,  3042L, "other",
                                       "Emergency 24 Hr",  2877L, "other",
                                         "Endocrinology",  2842L, "other",
                                         "Haematology B",  2674L, "other",
                                                "Spinal",  2639L, "other",
                                                "Stroke",  2512L, "other",
                                   "Infectious Diseases",  2469L, "other",
                                               "Urology",  2278L, "other",
                                              "Vascular",  1957L, "other",
                                         "Haematology A",  1822L, "other",
                                             "Gen Med 3",  1787L, "other",
                                             "Gen Med 2",  1782L, "other",
                                          "Ortho Surg 2",  1621L, "other",
                                             "Gen Med 6",  1573L, "other",
                                      "Thoracic Surgery",  1546L, "other",
                                       "ENT/Head & Neck",  1541L, "other",
                                             "Gen Med 1",  1534L, "other",
                  "General And Trauma Emergency Surgery",  1479L, "other",
                                             "Gen Med 4",  1450L, "other",
                                             "Gen Med 5",  1403L, "other",
                                            "Oncology A",  1305L, "other",
                                            "Oncology B",  1068L, "other",
                              "Maxillary Facial Surgery",  1030L, "other",
                                            "Toxicology",   921L, "other",
                                          "Ortho Surg 3",   907L, "other",
                                             "Neurology",   897L, "other",
                          "Medical Admission Unit (NEW)",   859L, "other",
                                    "Aged Care Services",   833L, "other",
                                          "Ortho Surg 4",   768L, "other",
                                              "Epilepsy",   737L, "other",
                                    "Vic Resp Supp Serv",   651L, "other",
                                       "Plastic Surgery",   617L, "other",
                                        "Intensive Care",   464L, "other",
                                           "Haematology",   327L, "other",
                                            "Gen Med 1B",   303L, "other",
                                        "Ortho 2 - Geri",   284L, "other",
                                        "Breast Surgery",   228L, "other",
                                   "Paediatric Medicine",   222L, "other",
                                          "Rheumatology",   202L, "other",
                                   "Renal Transplant SN",   197L, "other",
                                            "Gen Med 2B",   189L, "other",
                                            "Gen Med 1A",   178L, "other",
                                      "Emerg/Toxicology",   169L, "other",
                                        "Ortho 3 - Geri",   161L, "other",
                                      "General Medicine",   156L, "other",
                                            "Gen Med 3A",   142L, "other",
                              "Ventilation Weaning Unit",   137L, "other",
                                            "Gen Med 2A",   134L, "other",
                                              "Oncology",    91L, "other",
                                         "Haematem/Med5",    88L, "other",
                                            "Gen Med 3B",    81L, "other",
                                              "Dialysis",    62L, "other",
                "Vascular Surgery/Diabetic Foot Service",    61L, "other",
                                         "Haematem/Med3",    45L, "other",
                              "Mental Health - Northern",    33L, "other",
                                         "Haematem/Med1",    32L, "other",
                                      "NEMPS Psychiatry",    26L, "other",
                                             "H&M/Med 6",    23L, "other",
                                           "Gynaecology",    21L, "other",
                                       "Palliative Care",    19L, "other",
                                               "H&M/Med",    17L, "other",
                                           "Geriatric H",    13L, "other",
                                         "Haematem/Med2",    13L, "other",
                                        "Ortho 4 - Geri",    13L, "other",
                                 "General Amputee Rehab",    10L, "other",
                                         "Haematem/Med4",    10L, "other",
                                         "Neuro Rehab 2",    10L, "other",
                                             "Radiology",     9L, "other",
                                          "Spinal Rehab",     8L, "other",
                 "NE Adult Prevention and Recovery Care",     7L, "other",
                                      "Rehabilitation H",     6L, "other",
                                   "Acq Brain Inj Rehab",     5L, "other",
                                      "Child Psychiatry",     5L, "other",
            "Mental Health - Community Recovery Program",     5L, "other",
                                         "Ophthalmology",     5L, "other",
                "Mental Health - Eating & Mood Disorder",     4L, "other",
                            "Mental Health - North West",     4L, "other",
                                         "Radiation Onc",     3L, "other",
                                   "Medical Oncology 6S",     2L, "other",
                              "Psychiatry - St Vincents",     2L, "other",
                                "Rehabilitation - Repat",     2L, "other",
                       "Mental Health - Baby and Mother",     1L, "other",
                                            "Oncology H",     1L, "other"
            )

result
#> # A tibble: 94 x 3
#>    RESULT_UNIT                  n grp  
#>    <chr>                    <int> <chr>
#>  1 Emergency Med            41792 top10
#>  2 Liver Transplant         10909 top10
#>  3 Cardiac Surgery           9838 top10
#>  4 Neurosurgery              5498 top10
#>  5 Hepatabiliary/Transplant  5356 top10
#>  6 Resp & Sleep Med          5173 top10
#>  7 Cardiology                4884 top10
#>  8 Colorectal Surgery        4227 top10
#>  9 Renal                     3957 top10
#> 10 Gastro/Hepato             3578 top10
#> # ... with 84 more rows

Created on 2025-03-25 with reprex v2.1.1

Hi @genghiskhan,

The following summarises the data to keep the top 10 most frequent RESULT_UNIT and collapsing the remaining to a single 'Other' category:

result <- tibble::tribble(
  ~RESULT_UNIT,     ~n,    ~grp,
  "Emergency Med", 41792L, "top10",
  "Liver Transplant", 10909L, "top10",
  "Cardiac Surgery",  9838L, "top10",
  "Neurosurgery",  5498L, "top10",
  "Hepatabiliary/Transplant",  5356L, "top10",
  "Resp & Sleep Med",  5173L, "top10",
  "Cardiology",  4884L, "top10",
  "Colorectal Surgery",  4227L, "top10",
  "Renal",  3957L, "top10",
  "Gastro/Hepato",  3578L, "top10",
  "Upper GI Surgery",  3399L, "other",
  NA,  3042L, "other",
  "Emergency 24 Hr",  2877L, "other",
  "Endocrinology",  2842L, "other",
  "Haematology B",  2674L, "other",
  "Spinal",  2639L, "other",
  "Stroke",  2512L, "other",
  "Infectious Diseases",  2469L, "other",
  "Urology",  2278L, "other",
  "Vascular",  1957L, "other",
  "Haematology A",  1822L, "other",
  "Gen Med 3",  1787L, "other",
  "Gen Med 2",  1782L, "other",
  "Ortho Surg 2",  1621L, "other",
  "Gen Med 6",  1573L, "other",
  "Thoracic Surgery",  1546L, "other",
  "ENT/Head & Neck",  1541L, "other",
  "Gen Med 1",  1534L, "other",
  "General And Trauma Emergency Surgery",  1479L, "other",
  "Gen Med 4",  1450L, "other",
  "Gen Med 5",  1403L, "other",
  "Oncology A",  1305L, "other",
  "Oncology B",  1068L, "other",
  "Maxillary Facial Surgery",  1030L, "other",
  "Toxicology",   921L, "other",
  "Ortho Surg 3",   907L, "other",
  "Neurology",   897L, "other",
  "Medical Admission Unit (NEW)",   859L, "other",
  "Aged Care Services",   833L, "other",
  "Ortho Surg 4",   768L, "other",
  "Epilepsy",   737L, "other",
  "Vic Resp Supp Serv",   651L, "other",
  "Plastic Surgery",   617L, "other",
  "Intensive Care",   464L, "other",
  "Haematology",   327L, "other",
  "Gen Med 1B",   303L, "other",
  "Ortho 2 - Geri",   284L, "other",
  "Breast Surgery",   228L, "other",
  "Paediatric Medicine",   222L, "other",
  "Rheumatology",   202L, "other",
  "Renal Transplant SN",   197L, "other",
  "Gen Med 2B",   189L, "other",
  "Gen Med 1A",   178L, "other",
  "Emerg/Toxicology",   169L, "other",
  "Ortho 3 - Geri",   161L, "other",
  "General Medicine",   156L, "other",
  "Gen Med 3A",   142L, "other",
  "Ventilation Weaning Unit",   137L, "other",
  "Gen Med 2A",   134L, "other",
  "Oncology",    91L, "other",
  "Haematem/Med5",    88L, "other",
  "Gen Med 3B",    81L, "other",
  "Dialysis",    62L, "other",
  "Vascular Surgery/Diabetic Foot Service",    61L, "other",
  "Haematem/Med3",    45L, "other",
  "Mental Health - Northern",    33L, "other",
  "Haematem/Med1",    32L, "other",
  "NEMPS Psychiatry",    26L, "other",
  "H&M/Med 6",    23L, "other",
  "Gynaecology",    21L, "other",
  "Palliative Care",    19L, "other",
  "H&M/Med",    17L, "other",
  "Geriatric H",    13L, "other",
  "Haematem/Med2",    13L, "other",
  "Ortho 4 - Geri",    13L, "other",
  "General Amputee Rehab",    10L, "other",
  "Haematem/Med4",    10L, "other",
  "Neuro Rehab 2",    10L, "other",
  "Radiology",     9L, "other",
  "Spinal Rehab",     8L, "other",
  "NE Adult Prevention and Recovery Care",     7L, "other",
  "Rehabilitation H",     6L, "other",
  "Acq Brain Inj Rehab",     5L, "other",
  "Child Psychiatry",     5L, "other",
  "Mental Health - Community Recovery Program",     5L, "other",
  "Ophthalmology",     5L, "other",
  "Mental Health - Eating & Mood Disorder",     4L, "other",
  "Mental Health - North West",     4L, "other",
  "Radiation Onc",     3L, "other",
  "Medical Oncology 6S",     2L, "other",
  "Psychiatry - St Vincents",     2L, "other",
  "Rehabilitation - Repat",     2L, "other",
  "Mental Health - Baby and Mother",     1L, "other",
  "Oncology H",     1L, "other"
)

# process the data
result2 <-
  result |> 
  dplyr::mutate(
    RESULT_UNIT = RESULT_UNIT |> 
      forcats::fct() |> # convert to a factor
      forcats::fct_lump_n(n = 10, w = n) # keep n most frequent
  ) |> 
  # sum up the 'other' RESULT_UNITS
  dplyr::summarise(
    n = sum(n, na.rm = TRUE),
    .by = RESULT_UNIT
  )

result2
#> # A tibble: 12 × 2
#>    RESULT_UNIT                  n
#>    <fct>                    <int>
#>  1 Emergency Med            41792
#>  2 Liver Transplant         10909
#>  3 Cardiac Surgery           9838
#>  4 Neurosurgery              5498
#>  5 Hepatabiliary/Transplant  5356
#>  6 Resp & Sleep Med          5173
#>  7 Cardiology                4884
#>  8 Colorectal Surgery        4227
#>  9 Renal                     3957
#> 10 Gastro/Hepato             3578
#> 11 Other                    56008
#> 12 <NA>                      3042

Created on 2025-03-25 with reprex v2.1.1

The NA is retained as its own level - if you wanted to include this within the summary then add in a call to forcats::fct_na_value_to_level() in the mutate:

result2 <-
  result |> 
  dplyr::mutate(
    RESULT_UNIT = RESULT_UNIT |> 
      forcats::fct() |> # convert to a factor
      forcats::fct_na_value_to_level() |> # convert NA to a level
      forcats::fct_lump_n(n = 10, w = n) # keep n most frequent
  ) |> 
  # sum up the 'other' RESULT_UNITS
  dplyr::summarise(
    n = sum(n, na.rm = TRUE),
    .by = RESULT_UNIT
  )
2 Likes

@craig.parylo Thanks for your prompt response. The script fails to work in my R session.
The result2 becomes odd.

I modified your pipe with %>% and .by with group_by.

result %>%
dplyr::mutate(
RESULT_UNIT = RESULT_UNIT %>%
forcats::fct() %>% # convert to a factor
forcats::fct_lump_n(n = 10, w = n) # keep n most frequent
) %>%
group_by(RESULT_UNIT) %>%
summarise(n = sum(n))

Then, the result is identical to yours.

My R is 4.0.5 and updated forcats to 1.0.0.

If run your original script, then I got the result like this.

A tibble: 94 x 2

    n .by                     
<int> <fct>                   

1 154262 Emergency Med
2 154262 Liver Transplant
3 154262 Cardiac Surgery
4 154262 Neurosurgery
5 154262 Hepatabiliary/Transplant
6 154262 Resp & Sleep Med
7 154262 Cardiology
8 154262 Colorectal Surgery
9 154262 Renal
10 154262 Gastro/Hepato

n has all the same values where |> replaced by %>% only to work in my R 4.0.5.

I hope how to run your script and get the same results as yours.
I have been figuring out the whole night and this morning.

Thanks.
G

1 Like

Thanks for the update, @genghiskhan, well done on refactoring the script to suit your environment. I hope you get your script working. Let us know if we can be of any help.

1 Like