How to remove last word from a column

ksingh19 · December 1, 2023, 1:41pm

Hi,

I would like to remove last word from the column "Category" where country is "CA" in this example below.
I have been using separate() for a while, but in this case separate doesn't help as the number of words in a column can vary and I just want to remove the last one.

Another option is to remove words containing "WORD" or "TXT"

Is it possible? I am looking to learn both of these options.

Thanks for your help!

library(tidyverse)
library(datapasta)

# Example data
df <-  data.frame(
  stringsAsFactors = FALSE,
             Country = c("USA","USA","USA","USA",
                         "USA","CA","CA","CA","CA","CA","CA","CA","CA",
                         "CA","CA","USA","USA","USA","CA","CA","CA",
                         "CA","CA"),
            Category = c("Soft Goods RU",
                         "Soft Goods TXT1","Soft Goods TXT2","Hard Goods TXT3",
                         "Hard Goods TXT4","Hard Goods WORD1","Hard Goods WORD2",
                         "Hard Goods WORD3","Hard Goods ABC  WORD12",
                         "Hard Goods XYZ ABC LMN  WORD1","Hard Goods","Hard Goods",
                         "Hard Goods","Hard Goods","Hard Goods","Soft Goods WORD",
                         "Soft Goods WORD1","Soft Goods","Hard Goods",
                         "Hard Goods","Hard Goods","Hard Goods","Hard Goods")
  )

FJCC · December 1, 2023, 3:08pm

I think this does what you want.

library(tidyverse)
df <-  data.frame(
  stringsAsFactors = FALSE,
  Country = c("USA","USA","USA","USA",
              "USA","CA","CA","CA","CA","CA","CA","CA","CA",
              "CA","CA","USA","USA","USA","CA","CA","CA",
              "CA","CA"),
  Category = c("Soft Goods RU",
               "Soft Goods TXT1","Soft Goods TXT2","Hard Goods TXT3",
               "Hard Goods TXT4","Hard Goods WORD1","Hard Goods WORD2",
               "Hard Goods WORD3","Hard Goods ABC  WORD12",
               "Hard Goods XYZ ABC LMN  WORD1","Hard Goods","Hard Goods",
               "Hard Goods","Hard Goods","Hard Goods","Soft Goods WORD",
               "Soft Goods WORD1","Soft Goods","Hard Goods",
               "Hard Goods","Hard Goods","Hard Goods","Hard Goods")
)

df |> mutate(Category = ifelse(Country == "CA", 
                               str_remove(Category, "\\b\\w+$"), Category))
#>    Country                 Category
#> 1      USA            Soft Goods RU
#> 2      USA          Soft Goods TXT1
#> 3      USA          Soft Goods TXT2
#> 4      USA          Hard Goods TXT3
#> 5      USA          Hard Goods TXT4
#> 6       CA              Hard Goods 
#> 7       CA              Hard Goods 
#> 8       CA              Hard Goods 
#> 9       CA         Hard Goods ABC  
#> 10      CA Hard Goods XYZ ABC LMN  
#> 11      CA                    Hard 
#> 12      CA                    Hard 
#> 13      CA                    Hard 
#> 14      CA                    Hard 
#> 15      CA                    Hard 
#> 16     USA          Soft Goods WORD
#> 17     USA         Soft Goods WORD1
#> 18     USA               Soft Goods
#> 19      CA                    Hard 
#> 20      CA                    Hard 
#> 21      CA                    Hard 
#> 22      CA                    Hard 
#> 23      CA                    Hard


df |> mutate(Category = str_remove(Category, "\\b\\w*(TXT|WORD)\\w*"))
#>    Country                 Category
#> 1      USA            Soft Goods RU
#> 2      USA              Soft Goods 
#> 3      USA              Soft Goods 
#> 4      USA              Hard Goods 
#> 5      USA              Hard Goods 
#> 6       CA              Hard Goods 
#> 7       CA              Hard Goods 
#> 8       CA              Hard Goods 
#> 9       CA         Hard Goods ABC  
#> 10      CA Hard Goods XYZ ABC LMN  
#> 11      CA               Hard Goods
#> 12      CA               Hard Goods
#> 13      CA               Hard Goods
#> 14      CA               Hard Goods
#> 15      CA               Hard Goods
#> 16     USA              Soft Goods 
#> 17     USA              Soft Goods 
#> 18     USA               Soft Goods
#> 19      CA               Hard Goods
#> 20      CA               Hard Goods
#> 21      CA               Hard Goods
#> 22      CA               Hard Goods
#> 23      CA               Hard Goods

^{Created on 2023-12-01 with reprex v2.0.2}

AlexisW · December 1, 2023, 3:13pm

You can use separate_wider_regex() and define a regex like ^([[:alnum:]⁠ ]+) ([:alnum:]⁠+)$

here the first block ([[:alnum:]⁠ ]+ is "any number of words (including spaces)", the second block "a word without space", so you're ensuring that the second block is the last word.

I guess something like str_replace() where you match a regex like WORD[[:digit:]]+ and replace with "".

ksingh19 · December 1, 2023, 6:11pm

Thanks @FJCC !
How do we read \b\w+$

FJCC · December 1, 2023, 10:19pm

The \\b means a word boundary, \\w means a word character, + means "one or more", and $ matches the end of the text. So, \\b\\w+$ means a word boundary followed by one or more word characters followed by the end of the text. That is the last word boundary followed by the last word in the text.

system · December 8, 2023, 10:19pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.