Assign a category based on the maximum value from several other columns

slygoose0 · May 31, 2024, 5:52pm

I have value for several columns in a df and want to assign the column names as categorical variables in a new column in the df based on which value is the greatest. I'll try to demonstrate.

|Column 1 | Column 2 | Column 3 | Column 4 | Category|
|5 | 11 | 25 | 50 | Column 4|
|71 | 7 | 12 | 1 | Column 1|
|17 | 9 | 18 | 10 | Column 3|
|25 | 3 | 20 | 19 | Column 1|
|6 | 62 | 11 | 13 | Column 2|

I would be trying to create the column "Cat". I was able to do something similar, but it was based on which column was above 50. I used mutate() and case_when() to achieve that.
The values in my actual "A-D" columns are percentages that add to roughly 100 (with rounding error), I also have 7 of those columns rather than 4, so there is not always one percent with the majority, so the >50 barrier creates a lot of NAs in the "Cat" column.

Any help would be greatly appreciated!

startz · May 31, 2024, 6:09pm

Try max.col(df).

slygoose0 · May 31, 2024, 6:40pm

@startz
Thanks for the suggestion! it almost works. I should've said this originally, but these are a range on columns inside a larger df.
Trying out ways to put a range of columns in max.col() now

prubin · May 31, 2024, 7:00pm

See if the following yields what you want. It requires the dplyr library.

df |> mutate(Category = names(df)[max.col(df)])

joels · May 31, 2024, 7:17pm

Here's one way to identify the maximum column from a subset of current columns. I feel like there must be a less verbose approach, but this is what I've got for now.

library(tidyverse)

mtcars %>% 
  slice(1:5) %>% 
  mutate(name.of.max.col = select(.,  drat, wt, carb) %>% {names(.)[max.col(.)]})
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#>                   name.of.max.col
#> Mazda RX4                    carb
#> Mazda RX4 Wag                carb
#> Datsun 710                   drat
#> Hornet 4 Drive                 wt
#> Hornet Sportabout              wt

slygoose0 · May 31, 2024, 7:35pm

@prubin
hi! this worked with the range of columns subsetted out, thanks!

df <- mutate(Category = names(df[1:4])[max.col(df[1:4])])

joels · May 31, 2024, 7:45pm

I suggest changing this to:

df %>% mutate(Category = names(.[1:4])[max.col(.[1:4])])

In the original version, the df inside mutate is referring to the version of df in the global environment, rather than the current version of the data at that point in the chain. Switching to . ensures that you're operating on the current version of the data within the chain, rather than the (possibly different) version of the data in the global environment.

dromano · June 2, 2024, 12:05am

And here's a tweak of @prubin's solution with @joels pipe modification that allows you select the columns by name:

library(tidyverse)

tibble::tribble(
  ~Column.1, ~Column.2, ~Column.3, ~Column.4,  ~Category,
  5L,       11L,       25L,       50L, "Column 4",
  71L,        7L,       12L,        1L, "Column 1",
  17L,        9L,       18L,       10L, "Column 3",
  25L,        3L,       20L,       19L, "Column 1",
  6L,       62L,       11L,       13L, "Column 2"
) -> df

# pass df to magrittr pipe so can later use '.'
df %>% 
  mutate(
    Cat = 
      . |> 
      # pass seletcted columns to magrittr pipe so can later use '.'
      select(Column.1:Column.4) %>%
      {names(.)[max.col(.)]}
  )
#> # A tibble: 5 × 6
#>   Column.1 Column.2 Column.3 Column.4 Category Cat     
#>      <int>    <int>    <int>    <int> <chr>    <chr>   
#> 1        5       11       25       50 Column 4 Column.4
#> 2       71        7       12        1 Column 1 Column.1
#> 3       17        9       18       10 Column 3 Column.3
#> 4       25        3       20       19 Column 1 Column.1
#> 5        6       62       11       13 Column 2 Column.2

^{Created on 2024-06-01 with reprex v2.0.2}

system · June 9, 2024, 12:06am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.