Creating additional columns by splitting observations from another column

Akash01 · September 11, 2024, 1:47pm

Here is some data.

structure(list(Name = c("Porbandar", "Porbandar", "Porbandar", 
"Porbandar", "Porbandar", "Porbandar", "Porbandar", "Porbandar", 
"Porbandar", "Porbandar", "Porbandar", "Porbandar", "Porbandar", 
"Porbandar", "Porbandar", "Porbandar", "Porbandar", "Porbandar", 
"Porbandar", "Porbandar"), Level = c("CD BLOCK", "CD BLOCK", 
"CD BLOCK", "CD BLOCK", "CD BLOCK", "CD BLOCK", "CD BLOCK", "CD BLOCK", 
"CD BLOCK", "CD BLOCK", "CD BLOCK", "CD BLOCK", "CD BLOCK", "CD BLOCK", 
"CD BLOCK", "CD BLOCK", "CD BLOCK", "CD BLOCK", "CD BLOCK", "CD BLOCK"
), TRU = c("Total", "Rural", "Urban", "Total", "Rural", "Urban", 
"Total", "Rural", "Urban", "Total", "Rural", "Urban", "Total", 
"Rural", "Urban", "Total", "Rural", "Urban", "Total", "Rural"
), variable = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 
4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 7L, 7L), levels = c("TOT_WORK_P", 
"TOT_WORK_M", "TOT_WORK_F", "MAINWORK_P", "MAINWORK_M", "MAINWORK_F", 
"MAIN_CL_P", "MAIN_CL_M", "MAIN_CL_F", "MAIN_AL_P", "MAIN_AL_M", 
"MAIN_AL_F", "MAIN_HH_P", "MAIN_HH_M", "MAIN_HH_F", "MAIN_OT_P", 
"MAIN_OT_M", "MAIN_OT_F"), class = "factor"), Population = c(83780, 
77412, 6368, 56447, 51184, 5263, 27333, 26228, 1105, 66862, 61215, 
5647, 52006, 47115, 4891, 14856, 14100, 756, 31635, 31321)), row.names = c(NA, 
20L), class = "data.frame")

I want to create two columns from the 'levels' column, named as:
'work_type' and 'pop_division'.

Pop_division shall be formed from the transformation of the last character in observations of levels such that 'P ' gets listed as 'All', 'F' as 'Female', M as 'Male'.
While work_type will include everything except the last character(and the _ before it) so 'MAIN_AL_F' shall be listed as 'MAIN_AL'.

craig.parylo · September 11, 2024, 2:04pm

Hi @Akash01 ,

Here I do two steps:

Classify variable to include gender along with a parseable character - I've used the pipe |,
Then use seprate_wider_delim() (an experimental tidyr function) to split the single field into two on the pipe.

library(tidyverse)
df <- structure(list(Name = c("Porbandar", "Porbandar", "Porbandar", 
                        "Porbandar", "Porbandar", "Porbandar", "Porbandar", "Porbandar", 
                        "Porbandar", "Porbandar", "Porbandar", "Porbandar", "Porbandar", 
                        "Porbandar", "Porbandar", "Porbandar", "Porbandar", "Porbandar", 
                        "Porbandar", "Porbandar"), Level = c("CD BLOCK", "CD BLOCK", 
                                                             "CD BLOCK", "CD BLOCK", "CD BLOCK", "CD BLOCK", "CD BLOCK", "CD BLOCK", 
                                                             "CD BLOCK", "CD BLOCK", "CD BLOCK", "CD BLOCK", "CD BLOCK", "CD BLOCK", 
                                                             "CD BLOCK", "CD BLOCK", "CD BLOCK", "CD BLOCK", "CD BLOCK", "CD BLOCK"
                        ), TRU = c("Total", "Rural", "Urban", "Total", "Rural", "Urban", 
                                   "Total", "Rural", "Urban", "Total", "Rural", "Urban", "Total", 
                                   "Rural", "Urban", "Total", "Rural", "Urban", "Total", "Rural"
                        ), variable = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 
                                                  4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 7L, 7L), levels = c("TOT_WORK_P", 
                                                                                                          "TOT_WORK_M", "TOT_WORK_F", "MAINWORK_P", "MAINWORK_M", "MAINWORK_F", 
                                                                                                          "MAIN_CL_P", "MAIN_CL_M", "MAIN_CL_F", "MAIN_AL_P", "MAIN_AL_M", 
                                                                                                          "MAIN_AL_F", "MAIN_HH_P", "MAIN_HH_M", "MAIN_HH_F", "MAIN_OT_P", 
                                                                                                          "MAIN_OT_M", "MAIN_OT_F"), class = "factor"), Population = c(83780, 
                                                                                                                                                                       77412, 6368, 56447, 51184, 5263, 27333, 26228, 1105, 66862, 61215, 
                                                                                                                                                                       5647, 52006, 47115, 4891, 14856, 14100, 756, 31635, 31321)), row.names = c(NA, 
                                                                                                                                                                                                                                                  20L), class = "data.frame")
df2 <- df |> 
  mutate(
    variable2 = str_replace_all(
      string = variable,
      pattern = c('_P' = '|All', '_F' = '|Female', '_M' = '|Male')
    )
  ) |> 
  separate_wider_delim(
    cols = variable2,
    delim = '|',
    names = c('work_type', 'pop_division')
  )

df2
#> # A tibble: 20 × 7
#>    Name      Level    TRU   variable   Population work_type pop_division
#>    <chr>     <chr>    <chr> <fct>           <dbl> <chr>     <chr>       
#>  1 Porbandar CD BLOCK Total TOT_WORK_P      83780 TOT_WORK  All         
#>  2 Porbandar CD BLOCK Rural TOT_WORK_P      77412 TOT_WORK  All         
#>  3 Porbandar CD BLOCK Urban TOT_WORK_P       6368 TOT_WORK  All         
#>  4 Porbandar CD BLOCK Total TOT_WORK_M      56447 TOT_WORK  Male        
#>  5 Porbandar CD BLOCK Rural TOT_WORK_M      51184 TOT_WORK  Male        
#>  6 Porbandar CD BLOCK Urban TOT_WORK_M       5263 TOT_WORK  Male        
#>  7 Porbandar CD BLOCK Total TOT_WORK_F      27333 TOT_WORK  Female      
#>  8 Porbandar CD BLOCK Rural TOT_WORK_F      26228 TOT_WORK  Female      
#>  9 Porbandar CD BLOCK Urban TOT_WORK_F       1105 TOT_WORK  Female      
#> 10 Porbandar CD BLOCK Total MAINWORK_P      66862 MAINWORK  All         
#> 11 Porbandar CD BLOCK Rural MAINWORK_P      61215 MAINWORK  All         
#> 12 Porbandar CD BLOCK Urban MAINWORK_P       5647 MAINWORK  All         
#> 13 Porbandar CD BLOCK Total MAINWORK_M      52006 MAINWORK  Male        
#> 14 Porbandar CD BLOCK Rural MAINWORK_M      47115 MAINWORK  Male        
#> 15 Porbandar CD BLOCK Urban MAINWORK_M       4891 MAINWORK  Male        
#> 16 Porbandar CD BLOCK Total MAINWORK_F      14856 MAINWORK  Female      
#> 17 Porbandar CD BLOCK Rural MAINWORK_F      14100 MAINWORK  Female      
#> 18 Porbandar CD BLOCK Urban MAINWORK_F        756 MAINWORK  Female      
#> 19 Porbandar CD BLOCK Total MAIN_CL_P       31635 MAIN_CL   All         
#> 20 Porbandar CD BLOCK Rural MAIN_CL_P       31321 MAIN_CL   All

^{Created on 2024-09-11 with reprex v2.1.0}

Akash01 · September 12, 2024, 12:18pm

craig, my man, thanks for taking your time out. Your solution works. However I would like to see other solutions that would not involve this special tidyr function.

craig.parylo · September 12, 2024, 1:13pm

You're welcome, @Akash01 .

I understand you'd like to avoid the use of separate_wider_delim() from tidyr().

The version below uses stable functions from the tidyverse collection; specifically stringr. Does this meet your needs?

library(tidyverse)
df <- structure(list(Name = c("Porbandar", "Porbandar", "Porbandar", 
                              "Porbandar", "Porbandar", "Porbandar", "Porbandar", "Porbandar", 
                              "Porbandar", "Porbandar", "Porbandar", "Porbandar", "Porbandar", 
                              "Porbandar", "Porbandar", "Porbandar", "Porbandar", "Porbandar", 
                              "Porbandar", "Porbandar"), Level = c("CD BLOCK", "CD BLOCK", 
                                                                   "CD BLOCK", "CD BLOCK", "CD BLOCK", "CD BLOCK", "CD BLOCK", "CD BLOCK", 
                                                                   "CD BLOCK", "CD BLOCK", "CD BLOCK", "CD BLOCK", "CD BLOCK", "CD BLOCK", 
                                                                   "CD BLOCK", "CD BLOCK", "CD BLOCK", "CD BLOCK", "CD BLOCK", "CD BLOCK"
                              ), TRU = c("Total", "Rural", "Urban", "Total", "Rural", "Urban", 
                                         "Total", "Rural", "Urban", "Total", "Rural", "Urban", "Total", 
                                         "Rural", "Urban", "Total", "Rural", "Urban", "Total", "Rural"
                              ), variable = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 
                                                        4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 7L, 7L), levels = c("TOT_WORK_P", 
                                                                                                                "TOT_WORK_M", "TOT_WORK_F", "MAINWORK_P", "MAINWORK_M", "MAINWORK_F", 
                                                                                                                "MAIN_CL_P", "MAIN_CL_M", "MAIN_CL_F", "MAIN_AL_P", "MAIN_AL_M", 
                                                                                                                "MAIN_AL_F", "MAIN_HH_P", "MAIN_HH_M", "MAIN_HH_F", "MAIN_OT_P", 
                                                                                                                "MAIN_OT_M", "MAIN_OT_F"), class = "factor"), Population = c(83780, 
                                                                                                                                                                             77412, 6368, 56447, 51184, 5263, 27333, 26228, 1105, 66862, 61215, 
                                                                                                                                                                             5647, 52006, 47115, 4891, 14856, 14100, 756, 31635, 31321)), row.names = c(NA, 
                                                                                                                                                                                                                                                        20L), class = "data.frame")
df2 <- df |> 
  mutate(
    variable2 = str_replace_all(
      string = variable,
      pattern = c('_P' = '|All', '_F' = '|Female', '_M' = '|Male')
    ),
    
    # find out where the pipe character is
    pipe_position = str_locate(
      string = variable2,
      pattern = '\\|'
    )[,1],
    
    # extract work type
    work_type = str_sub(
      string = variable2,
      start = 1L,
      end = pipe_position - 1
    ),
    
    # extract pop division
    pop_division = str_sub(
      string = variable2,
      start = pipe_position + 1
    )
  ) |> 
  select(-c(variable2, pipe_position))

df2
#>         Name    Level   TRU   variable Population work_type pop_division
#> 1  Porbandar CD BLOCK Total TOT_WORK_P      83780  TOT_WORK          All
#> 2  Porbandar CD BLOCK Rural TOT_WORK_P      77412  TOT_WORK          All
#> 3  Porbandar CD BLOCK Urban TOT_WORK_P       6368  TOT_WORK          All
#> 4  Porbandar CD BLOCK Total TOT_WORK_M      56447  TOT_WORK         Male
#> 5  Porbandar CD BLOCK Rural TOT_WORK_M      51184  TOT_WORK         Male
#> 6  Porbandar CD BLOCK Urban TOT_WORK_M       5263  TOT_WORK         Male
#> 7  Porbandar CD BLOCK Total TOT_WORK_F      27333  TOT_WORK       Female
#> 8  Porbandar CD BLOCK Rural TOT_WORK_F      26228  TOT_WORK       Female
#> 9  Porbandar CD BLOCK Urban TOT_WORK_F       1105  TOT_WORK       Female
#> 10 Porbandar CD BLOCK Total MAINWORK_P      66862  MAINWORK          All
#> 11 Porbandar CD BLOCK Rural MAINWORK_P      61215  MAINWORK          All
#> 12 Porbandar CD BLOCK Urban MAINWORK_P       5647  MAINWORK          All
#> 13 Porbandar CD BLOCK Total MAINWORK_M      52006  MAINWORK         Male
#> 14 Porbandar CD BLOCK Rural MAINWORK_M      47115  MAINWORK         Male
#> 15 Porbandar CD BLOCK Urban MAINWORK_M       4891  MAINWORK         Male
#> 16 Porbandar CD BLOCK Total MAINWORK_F      14856  MAINWORK       Female
#> 17 Porbandar CD BLOCK Rural MAINWORK_F      14100  MAINWORK       Female
#> 18 Porbandar CD BLOCK Urban MAINWORK_F        756  MAINWORK       Female
#> 19 Porbandar CD BLOCK Total  MAIN_CL_P      31635   MAIN_CL          All
#> 20 Porbandar CD BLOCK Rural  MAIN_CL_P      31321   MAIN_CL          All

^{Created on 2024-09-12 with reprex v2.1.0}

AlexisW · September 12, 2024, 5:42pm

If needed, you can also completely do that with base R functions, with a basis like:

dat <- structure(...)

nb_characters <- nchar(as.character(dat$variable))

substr(dat$variable, nb_characters, nb_characters)
#>  [1] "P" "P" "P" "M" "M" "M" "F" "F" "F" "P" "P" "P" "M" "M" "M" "F" "F" "F" "P"
#> [20] "P"

substr(dat$variable, 1, nb_characters - 2)
#>  [1] "TOT_WORK" "TOT_WORK" "TOT_WORK" "TOT_WORK" "TOT_WORK" "TOT_WORK"
#>  [7] "TOT_WORK" "TOT_WORK" "TOT_WORK" "MAINWORK" "MAINWORK" "MAINWORK"
#> [13] "MAINWORK" "MAINWORK" "MAINWORK" "MAINWORK" "MAINWORK" "MAINWORK"
#> [19] "MAIN_CL"  "MAIN_CL"

and then either use mutate() and case_when() from tidyverse, or in pure base R, something like:

dat$work_type <- substr(dat$variable, 1, nb_characters - 2)

ending <- substr(dat$variable, nb_characters, nb_characters)

dat$Pop_division <- NA_character_
dat$Pop_division[ending == "P"] <- "both"
dat$Pop_division[ending == "M"] <- "male"
dat$Pop_division[ending == "F"] <- "female"

Akash01 · September 13, 2024, 8:13pm

Wow!! Awesome @AlexisW

Akash01 · September 13, 2024, 8:16pm

Would you make a case of using library functions over base R functions . I mean both work, but maybe calculation efficiency wise, amount of code or any other factors wise.

AlexisW · September 13, 2024, 9:07pm

I don't think there is a definitive answer to that.

One problem with libraries is stability: if the library changes, your code may break. That's still the case with base R, but that's less likely; and every dependency you add increases the risk a bit. That being said, the tidyverse tends to be quite stable, the amount of risk is relatively low. Still, I regularly have to rerun code that I wrote a few years ago, and it's painful when you have to edit your code that used to run, just because a function has changed.

Another factor is how more readable your code is. In this case, I actually prefer Craig's first answer to mine, in that, if you use the tidyverse consistently, a simple mutate() or separate_wider() tends to be clearer and more readable than my version where I have to edit columns (although I would say Craig's second answer with a lot of string manipulation inside the mutate is less clear in my opinion [1]).

There is more discussion here. More to your specific questions:

maybe calculation efficiency wise, amount of code or any other factors wise.

Here calculation efficiency matters little, if an operation takes 0.01 seconds instead of 0.001, does it matter? If you have big datasets and operations that are repeated, then it can matter. But the tidyverse is not particularly aiming for performance, so you would consider other libraries.

Amount of code I think isn't the best metric, clarity of the code matters more (and in general shorter code is clearer but not always). Here I'd say the tidyverse tends to win, that's what made it successful. But that's mostly true if you use the tidyverse consistently in all your code and keep your data in a "tidy" format. I personally have scripts where I do everything with tidyverse, and other scripts where I don't use it at all (because genomic or matrix data representations are better suited to my problem).

[1] looking at the question, in my opinion the clearest version would be to use separate_wider_regex: even if you don't know that function, the name is clear, you can easily guess what it's doing (if you know regexes). And by using a regex instead of several string processing calls, it keeps the code short which helps reading it fast:

df |>
  separate_wider_regex(
    cols = variable,
    patterns = c(work_type = "^[[:alnum:]_]+",
                 "_",
                 pop_division = "[PMF]$")
  ) |>
  mutate(pop_division = recode(pop_division,
                               "P" = "All",
                               "M" = "Male",
                               "F" = "Female"))

system · September 20, 2024, 9:08pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.