Categorizing a variable

Hi,
I want to categorize age variable into 4 categories according to 3 division points and percentile width of 25% each.

Here is my attempt and code:

df$age <- structure(c(61, 44, 65, 44, 45, 46, 65, 42, 48, 82, 37, 74, 55,
55, 42, 74, 35, 23, 72, 63, 79, 50, 68, 48, 51, 46, 56, 54, 58,
78, 67, 54, 61, 60, 60, 56, 52, 48, 60, 73, 77, 85, 47, 62, 58,
51, 52, 49, 74, 59, 52, 46, 29, 43, 70, 78, 55, 63, 69, 46, 74,
80, 71, 56, 82, 31, 53, 36, 58, 58, 51, 56, 58, 51, NA, 80, 66,
60, 22, 65, 65, 57, 84, 51, 45, NA, 34, 45, 53, 77, 61, 55, 43,
62, 55, 54, 61, 47, 74, 32, 49, 62, 56, 60, 55, 54, 76, 59, 51,
57, 42, 56, 99, 90, 71, 42, 60, 69, 47, 62, 82, 93), format.spss = "F2.0", display_width = 12L)

quantiles <- quantile(df$age, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)

# Use cut() with custom quantiles to create the categories
labels <- c("Group 1", "Group 2", "Group 3", "Group 4")

I want to know what are the breaks according to age were used? So I want to create another variable showing me this.
How do I do it , please ?
I would like to know which side of break is open and which is closed, something like sometimes is presented by: (15, 23].
So lets say that age of 61 belongs to third group with span of break from 56.5 to 66.5 (if I am not mistaken).
Any help will be much appreciated.

1 Like

Does this do what you want?

df <- data.frame(age = c(61, 44, 65, 44, 45, 46, 65, 42, 48, 82, 37, 74, 55,
                      55, 42, 74, 35, 23, 72, 63, 79, 50, 68, 48, 51, 46, 56, 54, 58,
                      78, 67, 54, 61, 60, 60, 56, 52, 48, 60, 73, 77, 85, 47, 62, 58,
                      51, 52, 49, 74, 59, 52, 46, 29, 43, 70, 78, 55, 63, 69, 46, 74,
                      80, 71, 56, 82, 31, 53, 36, 58, 58, 51, 56, 58, 51, NA, 80, 66,
                      60, 22, 65, 65, 57, 84, 51, 45, NA, 34, 45, 53, 77, 61, 55, 43,
                      62, 55, 54, 61, 47, 74, 32, 49, 62, 56, 60, 55, 54, 76, 59, 51,
                      57, 42, 56, 99, 90, 71, 42, 60, 69, 47, 62, 82, 93))

quantiles <- quantile(df$age, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
df$bin <- cut(df$age,breaks = quantiles)
head(df)
#>   age         bin
#> 1  61 (56.5,66.2]
#> 2  44   (22,48.8]
#> 3  65 (56.5,66.2]
#> 4  44   (22,48.8]
#> 5  45   (22,48.8]
#> 6  46   (22,48.8]

Created on 2023-08-24 with reprex v2.0.2

1 Like

Yes, thank you very much.
How to create another variable saying to which group (labels) does the particular age's value belong ?
And I assume that (x,y] is showing lower end is open and upper end is closed so for example in first row
everything lower than 56.5 and all values up to 66.2 including this value is embraced in that break ? Correct ?

How is this different than the new column bin that I made in the data frame?
Your interpretation of the (x,y] notation is correct.

I mean, because I created labels vector in my first post I was thinking about something like this using if_else maybe:
If values belongs to first break then Group1, if values belong to second break then Group2 and so on.
Something like:

df %>% mutate(groups = case_when(bin = (22,48.8] ~ "Group1",
                                 bin =  (48.8,56.5]  ~ "Group2",
                                 bin = (56.5,66.2] ~ "Group3",
                                 bin = (66.2,99] ~ "Group4"))))))

You can use the labels argument in cut().

quantiles <- quantile(df$age, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
df$bin <- cut(df$age,breaks = quantiles, 
               labels = c("Group 1", "Group 2", "Group 3", "Group 4"))
head(df)
  age     bin
1  61 Group 3
2  44 Group 1
3  65 Group 3
4  44 Group 1
5  45 Group 1
6  46 Group 1
1 Like

I would like to have both: breaks and groups, not one of them. With labels argument breaks disappeared.

I would run cut() twice to get both the breaks and the group assignment. If you have some reason to do it with a case_when(), I suggest you store the levels of the bin column and use those instead of manually entering the things like bin == "(22,48.8]" ~ "Group1". Those would have to be edited if your data changed.

quantiles <- quantile(df$age, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE)
df$bin <- cut(df$age,breaks = quantiles)
LevelVec <- levels(df$bin)
df <- df |> mutate(Groups = case_when(
   bin == LevelVec[1] ~ "Group 1",
   bin == LevelVec[2] ~ "Group 2",
   bin == LevelVec[3] ~ "Group 3",
   bin == LevelVec[4] ~ "Group 4"
 ))
head(df)
  age         bin  Groups
1  61 (56.5,66.2] Group 3
2  44   (22,48.8] Group 1
3  65 (56.5,66.2] Group 3
4  44   (22,48.8] Group 1
5  45   (22,48.8] Group 1
6  46   (22,48.8] Group 1

This is fantastic, thank you very much indeed.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.