Regex to match word(words) between n-th occurences of spaces

Andrzej · October 30, 2022, 8:54pm

Hi,
I have got a df as follows:

df <- structure(list(Product.Description.Pack = c("8oz Commemorative Classic  4-6pk",
"8oz Commemorative Diet Coke  4-6pk", "12oz PET Coca-Cola Classic 8 pk  3-8pks",
"8oz Sprite 4-6pk", "12pk can Diet Coke 2-12pk", "8oz Commemorative Dr Pepper 4-6pk",
"8.5oz Coca-Cola AlumBottle 4-6pk", "20oz NR Sprite 4pk", "8.5oz Dt Coke AlumBottle 4-6pk",
"8.5oz Coke Zero AlumBottle 4-6pk", "8.5oz Sprite AlumBottle 4-6pk",
"12pk can CF Diet Coke 2-12pk", "10oz 6 pk Tonic Water 4-6pk",
"12oz PET Coca-Cola Zero  8 pk  3-8pks", "20oz NR Mello Yello 4pk",
"20oz NR Pibb XT 24pk", "12oz PET Dr Pepper 8pk  3-8pks", "12oz PET Dt Dr Pepper 8pk  3-8pks",
"12oz V8 100% Vegetable Juice 12pk")), row.names = c(NA, 19L), class = "data.frame")

I would like to match words between second and fourth spaces and create new column with it.

I have tried a few things and I am still working on it:

 (.*?\s){2}(.*?)

^(?:[^\s]*){2}([^\s]*)(.+)$

 (?<=\s)(.*?)(?=\s)

but so far no success. Any ideas will be greatly appreciated.

The desired results would be as follows:

HanOostdijk · October 30, 2022, 11:33pm

You could use this:

suppressPackageStartupMessages(
  { library(dplyr); library(purrr); library(stringr)  }
) 

df <- structure(list(Product.Description.Pack = c("8oz Commemorative Classic  4-6pk",
"8oz Commemorative Diet Coke  4-6pk", "12oz PET Coca-Cola Classic 8 pk  3-8pks",
"8oz Sprite 4-6pk", "12pk can Diet Coke 2-12pk", "8oz Commemorative Dr Pepper 4-6pk",
"8.5oz Coca-Cola AlumBottle 4-6pk", "20oz NR Sprite 4pk", "8.5oz Dt Coke AlumBottle 4-6pk",
"8.5oz Coke Zero AlumBottle 4-6pk", "8.5oz Sprite AlumBottle 4-6pk",
"12pk can CF Diet Coke 2-12pk", "10oz 6 pk Tonic Water 4-6pk",
"12oz PET Coca-Cola Zero  8 pk  3-8pks", "20oz NR Mello Yello 4pk",
"20oz NR Pibb XT 24pk", "12oz PET Dr Pepper 8pk  3-8pks", "12oz PET Dt Dr Pepper 8pk  3-8pks",
"12oz V8 100% Vegetable Juice 12pk")), row.names = c(NA, 19L), class = "data.frame")

library(dplyr)
library(purrr)
library(stringr)

df2 <- df |>
  rowwise() |>
  mutate (x = stringr::str_split(Product.Description.Pack," +"),
          x = paste0(purrr::pluck(x,3,.default=" ")," ",purrr::pluck(x,4,.default=" ")),
          x = stringr::str_squish(x)
          ) |>
  ungroup()  
  
head(df2)
#> # A tibble: 6 × 2
#>   Product.Description.Pack                x                
#>   <chr>                                   <chr>            
#> 1 8oz Commemorative Classic  4-6pk        Classic 4-6pk    
#> 2 8oz Commemorative Diet Coke  4-6pk      Diet Coke        
#> 3 12oz PET Coca-Cola Classic 8 pk  3-8pks Coca-Cola Classic
#> 4 8oz Sprite 4-6pk                        4-6pk            
#> 5 12pk can Diet Coke 2-12pk               Diet Coke        
#> 6 8oz Commemorative Dr Pepper 4-6pk       Dr Pepper
Created on 2022-10-31 with reprex v2.0.2

Andrzej · October 31, 2022, 5:42am

Thank you,
can you please explain what is happening here please:

HanOostdijk:

mutate (x = stringr::str_split(Product.Description.Pack," +"),
          x = paste0(purrr::pluck(x,3,.default=" ")," ",purrr::pluck(x,4,.default=" ")),
          x = stringr::str_squish(x)

I mean why did you use space and plus sign ? Does it mean here:" one or more space " ?

Is there a possibility to use a "classic" regex as well ? I mean because purrr always was difficult for me to use. But of course it works here and thank you very much for the code.

technocrat · October 31, 2022, 6:18am

That algorithm does not produce the results specified. On row 2 Diet Coke has Diet following the second space, but it is immediately followed by the third space so Coke won't be picked up. On row 13, pk Tonic is the result.

Leaving aside how, focus on what. Are AlumBottle, Sprite 4pk, Dt. Dr and 100% Vegetable actually the desired results, rather than Coca-Cola AlumBottle, Diet Coke AlumBottle, Coke Zero AlumBottle, Sprite AlumBottle, Dt. Dr Pepper and V8 100% Vegetable Juice?

If the universe of beverage names is known, a better approach would be the named entity routines in natural language processing packages.

Andrzej · October 31, 2022, 10:28am

Maybe not fully..., but is very helpful in cleaning my data and I have learnt something new about using purrr, but here:
https://stackoverflow.com/questions/32755703/regex-between-two-nth-position-characters

there is a similar case. I wanted to extract

U10

but in spite of solution given up there on SO website:

`^(?:[^_]*_){2}([^_]*)`

it doesn't work as I expected. What should be a correct code to extract word U10 that is situated between second and third underscore ?
The author (Wiktor) of accepted code mentioned something about Group1 and Group2 . Maybe someone can explain it to me what is going on regarding that solution, please ?

system · December 12, 2022, 10:29am

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.