Regex - not picking up some string values properly

Slavek · August 21, 2019, 3:22pm

Hi, I have prepared this simple sample file:

data.frame(stringsAsFactors=FALSE,
                      InterviewID = c(94, 59, 100, 86, 60, 101, 61),
                       DataTypeID = c(1, 1, 1, 1, 1, 1, 1),
                 QuestionnaireVID = c(6, 6, 6, 6, 6, 6, 6),
                       CustomerID = c(198, 239, 215, 249, 246, 209, 281),
                              URN = c("10BE0002047", "10BE0002051", "10BE0002052",
                                      "10BE0002057", "10BE0002061",
                                      "10BE0002065", "10BE0002067"),
                          OrgCode = c("BE02104", "BE09702", "BE02021", "BE02077", "BE02023",
                                      "BE02095", "BE02124"),
                        CountryID = c(15, 15, 15, 15, 15, 15, 15),
                    InterviewDate = c("2019-05-23 21:48:00", "2019-05-17 12:32:00",
                                      "2019-05-20 16:52:00",
                                      "2019-05-17 20:19:00", "2019-05-17 12:35:00",
                                      "2019-05-20 16:49:00", "2019-05-17 12:50:00"),
                       LoadedDate = c("2019-05-24 02:15:16", "2019-05-18 02:15:08",
                                      "2019-05-21 02:15:03",
                                      "2019-05-18 02:15:08", "2019-05-18 02:15:08",
                                      "2019-05-21 02:15:03", "2019-05-18 02:15:08"),
                             ETID = c(31, 29, 30, 29, 29, 30, 29),
                      Transferred = c(1, 1, 1, 1, 1, 1, 1),
                            Model = c("A", "A", "A", "B", "B", "B", "B"),
                               A1 = c(10, 9, 10, 9, 10, 10, 10),
                          AComm_1 = c("Nom", "neen", "l'accueil fut excellent ,
                                      les explications complètes et la photo prise devant l'A est une très bonne idée et un superbe souvenir .",
                                      "Steeds zeer vriendelijk", "geen commentaar",
                                      "geen commentaren", "Zeer vriendelijke service!"),
                          AComm_2 = c(NA, NA, NA, NA, NA, NA, NA),
                          AComm_3 = c(NA, NA, NA, NA, NA, NA, NA),
                          AComm_4 = c(NA, NA, NA, NA, NA, NA, NA),
                            NEW_0 = c(10, 9, 10, 9, 10, 10, 10),
                            NEW_2 = c("Nom", "Het rijgedrag",
                                      "l'I 10 est très bien équipée avec tout le confort des nouvelles technologies", NA,
                                      "zoals hierboven", "zoals hiervoor",
                                      "zie boven"),
                           NEW_2A = c(NA, NA, NA, NA, NA, NA, NA),
                            NEW_4 = c("Rien",
                                      "De waarschuwingsseinen bij het achteruitrijden werkten tot hiertoe maar 1 keer",
                                      "y permettre une option avec la caméra de recul .", NA, NA, NA, "Niks"),
                           NEW_4A = c(NA, NA, NA, NA, NA, NA, NA),
                               B1 = c("Model", "Ich kann nicht sagen",
                                      "Déjà répondu au-dessus",
                                      "Zonder problemen", "RAS",
                                      "J’aimais le modèle B. La garantie de 5 ans est rassurante.",
                                      "oben kommentiert"),
                               B2 = c(1, 4, 32, 4, 2, 32, 3),
                             B2_1 = c(1, 0, 0, 0, 0, 0, 1),
                             B2_2 = c(0, 0, 0, 0, 1, 0, 1),
                             B2_3 = c(0, 1, 0, 1, 0, 0, 0),
                             B2_4 = c(0, 0, 0, 0, 0, 0, 0),
                             B2_5 = c(0, 0, 0, 0, 0, 0, 0),
                             B2_6 = c(0, 0, 1, 0, 0, 1, 0),
                               B3 = c(NA, NA, "facilité d'accès depuis mon domicile .", NA,
                                      NA, "Rien à dire", NA),
                               C1 = c(10, 6, 10, 9, 10, 7, 10),
                          CComm_1 = c("Nom", NA, "je ne dirai qu'un mot \" proficiat \"", NA,
                                      "Alles was top in orde", NA,
                                      "Tot op heden prima service!"),
                          CComm_2 = c(NA, NA, NA, NA, NA, "Garage un peu loin de chez moi.",
                                      NA),
                          CComm_3 = c(NA, "niet van toepassing", NA, NA, NA, NA, NA),
                          CComm_4 = c(NA, NA, NA, NA, NA, NA, NA),
                               D1 = c(10, 8, 10, 9, 10, 10, 10),
                          DComm_1 = c("Nom", NA, "Non .", NA, "zoals aangegeven hiervoor",
                                      "zoals hiervoor aangegeven",
                                      "Zeer vriendelijke personen!"),
                          DComm_2 = c(NA, "neen", NA, NA, NA, NA, NA),
                          DComm_3 = c(NA, NA, NA, NA, NA, NA, NA),
                          DComm_4 = c(NA, NA, NA, NA, NA, NA, NA),
                              OS2 = c(2, 2, 2, 2, 2, 1, 2),
                               E1 = c(NA, NA, NA, NA, NA, 10, NA),
                               E2 = c(2, 1, 1, 2, 1, NA, 1),
                               F1 = c(10, 9, 10, 9, 9, 10, 10),
                               F2 = c(2, 2, 1, 2, 1, 2, 2),
                               G1 = c(1, 2, 1, 1, 1, 1, 1),
                               H1 = c(3, 1, 1, 1, 1, 1, 3),
                               H2 = c(NA, 3, 1, 3, 3, 3, NA),
                               I1 = c(1, 2, 1, 1, 1, 2, 2),
                          IComm_1 = c("Nom", NA, NA, NA, "Gewoon zo verder doen,
                                     alles was tip top in orde", NA, NA),
                          IComm_2 = c(NA, NA, NA, NA, NA, NA, NA),
                          IComm_3 = c(NA, NA, NA, NA, NA, NA, NA),
                          IComm_4 = c(NA, NA, NA, NA, NA, NA, NA),
                          IComm_5 = c(NA, NA, NA, NA, NA, NA, NA),
                          IComm_6 = c(NA, NA, NA, NA, NA, NA, NA),
                          IComm_7 = c(NA, NA, NA, NA, NA, NA, NA),
                          IComm_8 = c(NA, NA, NA, NA, NA, NA, NA),
                          IComm_9 = c(NA, NA, NA, NA, NA, NA, NA),
                         IComm_10 = c(NA, NA, NA, NA, NA, NA, NA),
                              VIN = c("AAA", "BBB", "CCC", "DDD", "EEE", "FFF", "GGG"),
                        ModelLong = c("A (2013~ )", "A (2013~ )", "A (2013~ )",
                                      "B (2014 ~ )", "B (2014 ~ )",
                                      "B (2014 ~ )", "B (2014 ~ )")
              )

then I specified previous_statements and blank_statements using regex:

blank_statements <- regex("geen\\scommentaar|geen\\sspeciale\\scommentaar|
geen\\scommentaren", ignore_case = TRUE)

previous_statements <- regex("zoals\\shierboven|zoals\\shiervoor|ervoor|zie\\sboven|hierboven|zie\\shiervoor", ignore_case = TRUE)

Now, I've got this code to create two new variables: all_comment and A_comment:

merged.comments <- source %>%
  mutate_at(vars(matches("comm|new|B1$|B3$")), ~str_remove_all(.x, "^.{1,5}$")) %>% # Remove sentences with less than 5 characters
  mutate(all_comment = paste(AComm_1, AComm_2, AComm_3, AComm_4, NEW_2, NEW_2A, NEW_4, NEW_4A, B1, B3,  
                             CComm_1, CComm_2, CComm_3, CComm_4, DComm_1, DComm_2, DComm_3, DComm_4, 
                             IComm_1, IComm_2, IComm_3, IComm_4, IComm_5, IComm_6, IComm_7, IComm_8, IComm_9, IComm_10, sep="/"), # Merges comment variables
         all_comment = str_remove_all(all_comment, blank_statements), # Removes blanks
         all_comment = str_remove_all(all_comment, "^(neen|RAS|nom|nee|non)$"), # Removes blanks 2
         all_comment = str_remove_all(all_comment, "NA"), # Removes NAs
         all_comment = str_remove_all(all_comment, "(.)\\1{2,}"), # Removes repeated characters
         all_comment = str_remove_all(all_comment, "[:cntrl:]"), # Removes control characters like /n/r
         all_comment = str_replace_all(all_comment, "\\s\\s+", " "),  #Removes duplicated /
         all_comment = str_replace_all(all_comment, "//+", "/"), # Removes extra spaces
         A_comment = paste(AComm_1, AComm_2, AComm_3, AComm_4), # Merges comment variables
         A_comment = str_remove_all(A_comment, blank_statements), # Removes blanks
         A_comment = str_remove_all(A_comment, "^(neen|RAS|nom|nee|non)$"), # Removes blanks 2
         A_comment = str_remove_all(A_comment, "NA"), # Removes NAs
         A_comment = str_remove_all(A_comment, "(.)\\1{2,}"), # Removes repeated characters
         A_comment = str_remove_all(A_comment, "[:cntrl:]"), # Removes control characters like /n/r
         A_comment = str_replace_all(A_comment, "\\s\\s+", " "), #Removes duplicated /
         A_comment = str_replace_all(A_comment, "//+", "/")) # Removes extra spaces

Unfortunately, for some reason, "geen commentaar" is properly removed but "geen commentaren" stays unchanged (they both are in AComm_1).
Also, for a weird reason, some merged string values look as they should (so with "/" divider) but others don't.

For example, final result for the sixth record is:
"geen commentarenzoals hiervoorJ’aimais le modèle B. La garantie de 5 ans est rassurante./Rien à dire/Garage un peu loin de chez moi.zoals hiervoor aangegeven"

but should be:
"zoals hiervoor/J’aimais le modèle B. La garantie de 5 ans est rassurante./Rien à dire/Garage un peu loin de chez moi.zoals hiervoor aangegeven"

I cannot get my head around it

Can you help?

andresrcs · August 22, 2019, 2:00am

You are making your example unnecessarily large and complex, when making a reprex, you are supposed to narrow down your code to just the problematic part (i.e str_remove_all() not removing all options in your regular expression) and provide sample data just large enough to reproduce your issue (you are including 64 variables but you only need one to exemplify your problem).

If you remove the new line in your blank_statements regex, all blank options get removed, see this minimal example (notice how I have narrowed down everything to just the essential part)

library(tidyverse)
library(stringr)

sample_data <- data.frame(stringsAsFactors=FALSE,
                          AComm_1 = c("Nom", "neen", "l'accueil fut excellent, les explications complètes et la photo prise devant l'A est une très bonne idée et un superbe souvenir .",
                                      "Steeds zeer vriendelijk", "geen commentaar", "geen commentaren",
                                      "Zeer vriendelijke service!")
)

blank_statements <- regex("geen\\scommentaar|geen\\sspeciale\\scommentaar|geen\\scommentaren",
                          ignore_case = TRUE)

sample_data %>%
  as_tibble() %>% 
  mutate(AComm_1 = str_remove_all(AComm_1, blank_statements))
#> # A tibble: 7 x 1
#>   AComm_1                                                                  
#>   <chr>                                                                    
#> 1 Nom                                                                      
#> 2 neen                                                                     
#> 3 l'accueil fut excellent, les explications complètes et la photo prise de…
#> 4 Steeds zeer vriendelijk                                                  
#> 5 ""                                                                       
#> 6 ""                                                                       
#> 7 Zeer vriendelijke service!

Both "geen commentaar" and "geen commentaren" get removed

Slavek · August 22, 2019, 8:41am

Thank you so much for your response.
Problem partially resolved ("geen commentaren" removed). I still cannot fix that:

Blockquote
Also, for a weird reason, some merged string values look as they should (so with "/" divider) but others don't.

For example, final result for the fourth record is:
"Steeds zeer vriendelijkZonder problemen"

but should be:
"Steeds zeer vriendelijk/Zonder problemen"

I cannot get my head around it

Can you help?

Slavek

P.S. I really wanted to use reprex but I have some issues which have not been resolved: https://forum.posit.co/t/a-good-chance-to-set-up-a-reprex/24357/3 so I follow your suggestion (Search results for 'datapasta topic:22701' - Posit Community) and use datapasta with small samples instead.

I have provided a sample with a view variables previously but solutions given on this excellent forum were not 100% helpful without taking into account the entire data file (for example: ISSUE 1 in here Mutate (not) all plus stringr issues was a result of that)

I'm still learning (no professional training or previous experience with R) but believe me, I always try to solve the problems myself first.

I really appreciate your help and I admire your knowledge!

I think this forum is the best way of finding solutions and the most helpful forum I have found in my career!

andresrcs · August 22, 2019, 1:53pm

That is not related to your original question (the one in your topic title), as I said before, topics in this forum are not supposed to be support chats where we guide you through all your different problems at once, because that wouldn't be useful for people other than you, the idea is to have well defined questions and answers so other people with similar issues can find this topics and benefit from them.

Having that said, the cause of this problem is that when working with regular expressions the order of your commands matters, if you move this line all_comment = str_remove_all(all_comment, "(.)\\1{2,}") to the end of your mutate process, the problem goes away

source %>%
  mutate_at(vars(matches("comm|new|B1$|B3$")), ~str_remove_all(.x, "^.{1,5}$")) %>% # Remove sentences with less than 5 characters
  mutate(all_comment = paste(AComm_1, AComm_2, AComm_3, AComm_4, NEW_2, NEW_2A, NEW_4, NEW_4A, B1, B3,  
                             CComm_1, CComm_2, CComm_3, CComm_4, DComm_1, DComm_2, DComm_3, DComm_4, 
                             IComm_1, IComm_2, IComm_3, IComm_4, IComm_5, IComm_6, IComm_7, IComm_8, IComm_9, IComm_10, sep="/"),
         all_comment = str_remove_all(all_comment, blank_statements),
         all_comment = str_remove_all(all_comment, "^(neen|RAS|nom|nee|non)$"),
         all_comment = str_remove_all(all_comment, "NA"),
         all_comment = str_remove_all(all_comment, "[:cntrl:]"),
         all_comment = str_replace_all(all_comment, "\\s\\s+", " "),
         all_comment = str_replace_all(all_comment, "//+", "/"),
         all_comment = str_remove_all(all_comment, "(.)\\1{2,}"), 
         A_comment = paste(AComm_1, AComm_2, AComm_3, AComm_4),
         A_comment = str_remove_all(A_comment, blank_statements),
         A_comment = str_remove_all(A_comment, "^(neen|RAS|nom|nee|non)$"),
         A_comment = str_remove_all(A_comment, "NA"),
         A_comment = str_remove_all(A_comment, "[:cntrl:]"),
         A_comment = str_replace_all(A_comment, "\\s\\s+", " "), 
         A_comment = str_replace_all(A_comment, "//+", "/"),
         A_comment = str_remove_all(A_comment, "(.)\\1{2,}")
         )

Also, just to clarify, when people here ask you to provide a "reprex" it doesn't mean that you necessarily have to use the reprex package (and BTW datapasta is not a substitute), you can make a proper "REproducible EXample" just as well with out it. You just have to make sure that you provide two things, a minimal dataset, necessary to reproduce the issue and the minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages.

jcblum · August 22, 2019, 2:54pm

If you’re learning regular expressions in R and using RStudio, definitely take a look at the regexplain add-in:

It can help you solve your own problems much more quickly!

Another fantastic tool to know about is RegExr (which inspired regexplain), although it doesn’t have R-specific features (e.g., regexplain will help you with the double backslash escaping that R requires).

Slavek · August 23, 2019, 1:45pm

Excellent!
Thank you for being so patient. I will follow your suggestions in terms of small examples from now on but I have a final question about this issue.

Adding "/" caused having this character also in merged sentences which should stay blank and in the beginning of each merged sentence.

Is any simple way to remove this character from the beginning or each merged question?

andresrcs · August 23, 2019, 2:44pm

Yes, with a regular expression, maybe like this "^/|/$"

You would save yourself a lot of time and effort if you learn about regular expressions, I would recommend you this book on the subject.

Aso, since your original question has been solved, would you mind choosing a solution?

Slavek · August 23, 2019, 3:33pm

Absolutely perfect!!!

Thank you

system · August 30, 2019, 3:33pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.