regex to automatically match to different column names

ollie2022 · August 7, 2022, 5:11pm

Hi,

I have spent some time trying to solve this today but not really been able to come up with a solution.

I have some messy spreadsheets (about 20) where the data is divided into a separate spreadsheet for each year and the column naming is not consistent across them, with some not having a column name at all in the spreadsheet and so get assigned "...1" etc after being imported using readxl.

My aim was to create a regular expression which would capture the different column names after being imported and then transform it to a uniform name of "area".

I would ideally like to match the column named "...1" etc first, then if they are not available, to match to a different string. Turning off the global flag seem to be the way to do this but I cannot find anywhere online which discusses this in R/tidyverse. I am thinking it might not be possible, or, there might be a more clever way of writing the regex which I cannot figure out.

I've tried to produce a minimal reprex below which illustrates the problem. Particularly in year_3, I end up with 2 columns which the "area" name.

All helped appreciated including a simpler way to tackle this problem.

library(tidyverse)

year_1 <- tribble(
  ~`...1`, ~admissions,
  "Hospital 1", 10,
  "Hospital 2", 100,
  "hospital 3", 200
)

year_2 <- tribble(
  ~provider_code, ~`...2`, ~admissions,
  "H1", "Hospital 1", 20,
  "H2", "Hospital 2", 400,
  "H3", "hospital 3", 500
)

year_3 <- tribble(
  ~"Hospital provider code", ~"Commissioning region/Provider", ~admissions,
  "H1", "Hospital 1", 350,
  "H2", "Hospital 2", 350,
  "H3", "hospital 3", 550
)


clean_up_area_column_name <- function(x){
  rename({{x}}, area = matches("\\.{3}[0-9]|commissioning region|hospital provider", ignore.case = TRUE))
  }

datasets <- list(year_1, year_2, year_3)

map_df(datasets, clean_up_area_column_name)

# A tibble: 9 × 5
  area       admissions provider_code area1 area2     
  <chr>           <dbl> <chr>         <chr> <chr>     
1 Hospital 1         10 NA            NA    NA        
2 Hospital 2        100 NA            NA    NA        
3 hospital 3        200 NA            NA    NA        
4 Hospital 1         20 H1            NA    NA        
5 Hospital 2        400 H2            NA    NA        
6 hospital 3        500 H3            NA    NA        
7 NA                350 NA            H1    Hospital 1
8 NA                350 NA            H2    Hospital 2
9 NA                550 NA            H3    hospital 3



<sup>Created on 2022-08-07 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>

nirgrahamuk · August 12, 2022, 9:37am

its quite possible I don't understand your issue because

clean_up_area_column_name <- function(x){
  rename({{x}}, area = matches("\\.{3}[0-9]|commissioning region", ignore.case = TRUE))
}

gives

# A tibble: 9 x 4
  area       admissions provider_code `Hospital provider code`
  <chr>           <dbl> <chr>         <chr>                   
1 Hospital 1         10 NA            NA                      
2 Hospital 2        100 NA            NA                      
3 hospital 3        200 NA            NA                      
4 Hospital 1         20 H1            NA                      
5 Hospital 2        400 H2            NA                      
6 hospital 3        500 H3            NA                      
7 Hospital 1        350 NA            H1                      
8 Hospital 2        350 NA            H2                      
9 hospital 3        550 NA            H3

seemingly assigning area for all 3 sets.

ollie2022 · August 24, 2022, 11:53am

thanks for reply. I think I probably didn't explain it very well.

although the regex \\.{3}[0-9]|commissioning region matches to the columns, it matches both strings in the regex, so I end up with the hospital names split between 2 columns.

Ideally I want the regex to match the first part of the regex then stop matching; and only go onto match the second part of the regex if the first isn't matched.

If I understand correctly, the current behaviour is known as global matching but can't seem to find anyway of turning this off in R's regex engine.

I am now wondering if using an if-then-else conditional is probably the best option.

Any thoughts?

nirgrahamuk · August 24, 2022, 11:58am


clean_up_area_column_name <- function(x){
  rename({{x}}, area = matches("\\.{3}[0-9]|commissioning region", ignore.case = TRUE))
}

datasets <- list(year_1, year_2, year_3)

map_df(datasets, clean_up_area_column_name) |>
  mutate(prov_code=coalesce(provider_code,
                            `Hospital provider code`)) |>
  select(area,admissions,prov_code)

# A tibble: 9 x 3
  area       admissions prov_code
  <chr>           <dbl> <chr>    
1 Hospital 1         10 NA       
2 Hospital 2        100 NA       
3 hospital 3        200 NA       
4 Hospital 1         20 H1       
5 Hospital 2        400 H2       
6 hospital 3        500 H3       
7 Hospital 1        350 H1       
8 Hospital 2        350 H2       
9 hospital 3        550 H3

ollie2022 · August 26, 2022, 11:56am

thank you. I actually didn't know about the coalesce function; I'd been using unite.

system · September 16, 2022, 11:56am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.