Paste function without NAs and other conditions

Hi,
I have this simple data frame:

data.frame(stringsAsFactors=FALSE,
                                                                                                                                     URN = c("test1", "test2", "test3", "test4", "test5",
                                                                                                                                             "test6", "test7", "test8", "test9", "test10", "test11"),
                                                                                                                            Test.heading = c("goede uitleg", "Goede sfeer",
                                                                                                                                             "juiste behandeling", NA, "vakkundige afhandeling",
                                                                                                                                             "No comment", NA, NA, "goede uitleg", "goede uitleg", "-"),
                                                                                                                  Recommendation.comment = c("-", "xxx", NA, NA, "zzz", "uw verkoper",
                                                                                                                                             "correcte verkoop", "goede uitleg", "professioneel !!!",
                                                                                                                                             "Goeie service", "nee"),
                                                                                                                           Other.comment = c("ab", "zeer goede uitleg over de aankoop",
                                                                                                                                             "uitleg en vriendelijkheid van de verkoper", NA,
                                                                                                                                             "Bob gaf heel goede uitleg!", NA, "eerder genoemd", NA, NA,
                                                                                                                                             "geen opmerkingen\r\n", "Genoeg info")

Now I need to merge all string variables into one. I know I can do that this way:

source$comment<- paste(source$`Test heading`,source$`Recommendation comment`,source$`Other comment`, sep = ", ")

but I would like to:

  1. Ignore NAs
  2. Ignore Blanks listed in this assignment:
blank_statements <- regex("no\\scomment?|nee|neen|^\\s*n.?a.?\\s*$", ignore_case = TRUE)
  1. Ignore comments with less than 3 characters:
str_length(string = source$`Test heading`) < 3) 
str_length(string = source$`Recommendation comment`) < 3) 
str_length(string = source$`Other comment`) < 3) 
  1. Ignore comments with same characters repeated multiple times:
str_detect(source$`Test heading`, "(.)\\1{3,}")
str_detect(source$`Recommendation comment`, "(.)\\1{3,}")
str_detect(source$`Other comment`, "(.)\\1{3,}")

in all source string variables before merging them into one called 'All Comments'. Basically, the new string should not contain words and statements mentioned above so rather than:

• having this in raw 1: "goede uitleg, -, ab", we should have "goede uitleg",
• having this in raw 2: "Goede sfeer, xxx, zeer goede uitleg over de aankoop", we should have "Goede sfeer, zeer goede uitleg over de aankoop",
• having this in raw 8: "NA, goede uitleg, NA", we should have "goede uitleg"
• having this in raw 11: " -, nee, Genoeg info", we should have " Genoeg info " etc.

Can you help please?

You can gather the columns and apply all the filtering statements you want.

library(tidyverse)
source <- data.frame(stringsAsFactors=FALSE,
                     URN = c("test1", "test2", "test3", "test4", "test5",
                             "test6", "test7", "test8", "test9", "test10", "test11"),
                     Test.heading = c("goede uitleg", "Goede sfeer",
                                      "juiste behandeling", NA, "vakkundige afhandeling",
                                      "No comment", NA, NA, "goede uitleg", "goede uitleg", "-"),
                     Recommendation.comment = c("-", "xxx", NA, NA, "zzz", "uw verkoper",
                                                "correcte verkoop", "goede uitleg", "professioneel !!!",
                                                "Goeie service", "nee"),
                     Other.comment = c("ab", "zeer goede uitleg over de aankoop",
                                       "uitleg en vriendelijkheid van de verkoper", NA,
                                       "Bob gaf heel goede uitleg!", NA, "eerder genoemd", NA, NA,
                                       "geen opmerkingen\r\n", "Genoeg info")
                     )
source %>% 
    gather(x, comment, -URN) %>% 
    select(comment) %>% 
    filter(!str_detect(comment, regex("no\\scomment?|nee|neen|^\\s*n.?a.?\\s*$", ignore_case = TRUE)),
           !str_length(comment) < 3,
           !str_detect(comment, "(.)\\1{2,}"))
#>                                      comment
#> 1                               goede uitleg
#> 2                                Goede sfeer
#> 3                         juiste behandeling
#> 4                     vakkundige afhandeling
#> 5                               goede uitleg
#> 6                               goede uitleg
#> 7                                uw verkoper
#> 8                           correcte verkoop
#> 9                               goede uitleg
#> 10                             Goeie service
#> 11         zeer goede uitleg over de aankoop
#> 12 uitleg en vriendelijkheid van de verkoper
#> 13                Bob gaf heel goede uitleg!
#> 14                            eerder genoemd
#> 15                      geen opmerkingen\r\n
#> 16                               Genoeg info

Created on 2019-07-15 by the reprex package (v0.3.0)

Thank you but if you look at the result, you'll see that something is not right (comments are not merged and we have a random comment for respondent 4 which should be blank). The result should look like the following:

A tibble: 11 x 5
   URN    `Test heading`         `Recommendation comment` `Other comment`                          comment                                                    
   <chr>  <chr>                  <chr>                    <chr>                                    <chr>                                                      
 1 test1  goede uitleg           -                        ab                                       goede uitleg                                               
 2 test2  Goede sfeer            xxx                      zeer goede uitleg over de aankoop        Goede sfeer, zeer goede uitleg over de aankoop             
 3 test3  juiste behandeling     NA                       uitleg en vriendelijkheid van de verkop~ juiste behandeling, uitleg en vriendelijkheid van de verko~
 4 test4  NA                     NA                       NA                                       NA                                                         
 5 test5  vakkundige afhandeling zzz                      Bob gaf heel goede uitleg!               vakkundige afhandeling, Bob gaf heel goede uitleg!         
 6 test6  No comment             uw verkoper              NA                                       uw verkoper                                                
 7 test7  NA                     correcte verkoop         eerder genoemd                           correcte verkoop, eerder genoemd                           
 8 test8  NA                     goede uitleg             NA                                       goede uitleg                                               
 9 test9  goede uitleg           professioneel !!!        NA                                       goede uitleg, professioneel !!!                            
10 test10 goede uitleg           Goeie service            "geen opmerkingen\r\n\r\n"               "goede uitleg, Goeie service, geen opmerkingen\r\n\r\n"    
11 test11 -                      nee                      Genoeg info                              Genoeg info           

Also, I don't really know what is going on with raw 11 as comment in the 'Other comment' field was "geen opmerkingen". I cannot work out why that was converted to "geen opmerkingen\r\n\r\n" :thinking:

I misunderstood your request, is this what you want?

library(tidyverse)
source <- data.frame(stringsAsFactors=FALSE,
                     URN = c("test1", "test2", "test3", "test4", "test5",
                             "test6", "test7", "test8", "test9", "test10", "test11"),
                     Test.heading = c("goede uitleg", "Goede sfeer",
                                      "juiste behandeling", NA, "vakkundige afhandeling",
                                      "No comment", NA, NA, "goede uitleg", "goede uitleg", "-"),
                     Recommendation.comment = c("-", "xxx", NA, NA, "zzz", "uw verkoper",
                                                "correcte verkoop", "goede uitleg", "professioneel !!!",
                                                "Goeie service", "nee"),
                     Other.comment = c("ab", "zeer goede uitleg over de aankoop",
                                       "uitleg en vriendelijkheid van de verkoper", NA,
                                       "Bob gaf heel goede uitleg!", NA, "eerder genoemd", NA, NA,
                                       "geen opmerkingen\r\n", "Genoeg info")
)

source %>%
    mutate(all_comment = paste(Test.heading, Recommendation.comment, Other.comment),
           all_comment = str_remove_all(all_comment, regex("no\\scomment?|\\snee\\s|\\sneen\\s|\\s*n.?a.?\\s*", ignore_case = TRUE)),
           all_comment = str_remove_all(all_comment, "(.)\\1{2,}"),
           all_comment = str_remove_all(all_comment, "\\b[:graph:]{1,2}\\b"),
           all_comment = str_remove_all(all_comment, "-"))
#>       URN           Test.heading Recommendation.comment
#> 1   test1           goede uitleg                      -
#> 2   test2            Goede sfeer                    xxx
#> 3   test3     juiste behandeling                   <NA>
#> 4   test4                   <NA>                   <NA>
#> 5   test5 vakkundige afhandeling                    zzz
#> 6   test6             No comment            uw verkoper
#> 7   test7                   <NA>       correcte verkoop
#> 8   test8                   <NA>           goede uitleg
#> 9   test9           goede uitleg      professioneel !!!
#> 10 test10           goede uitleg          Goeie service
#> 11 test11                      -                    nee
#>                                Other.comment
#> 1                                         ab
#> 2          zeer goede uitleg over de aankoop
#> 3  uitleg en vriendelijkheid van de verkoper
#> 4                                       <NA>
#> 5                 Bob gaf heel goede uitleg!
#> 6                                       <NA>
#> 7                             eerder genoemd
#> 8                                       <NA>
#> 9                                       <NA>
#> 10                      geen opmerkingen\r\n
#> 11                               Genoeg info
#>                                                all_comment
#> 1                                           goede uitleg  
#> 2             Goede sfeer  zeer goede uitleg over  aankoop
#> 3  juiste behandelinguitleg  vriendelijkheid van  verkoper
#> 4                                                         
#> 5       vakkundige afhandeling  Bob gaf heel goede uitleg!
#> 6                                                 verkoper
#> 7                          correcte verkoop eerder genoemd
#> 8                                             goede uitleg
#> 9                              goede uitleg professioneel 
#> 10         goede uitleg Goeie service geen opmerkingen\r\n
#> 11                                             Genoeg info

Thank you but the code it's not working on my side. Maybe I should use additional packages apart from dplyr and stringr?

The error I have is:

Error in paste(Test.heading, Recommendation.comment, Other.comment) : 
  object 'Test.heading' not found

I replaced Test.heading with 'Test heading' (and the same with 'Recommendation comment', 'Other comment'.

I don't have any errors now but results are weird:

 URN    `Test heading`         `Recommendation comment` `Other comment`                           all_comment                                  
   <chr>  <chr>                  <chr>                    <chr>                                     <chr>                                        
 1 test1  goede uitleg           -                        ab                                        Test heading Recommeion comment Other comment
 2 test2  Goede sfeer            xxx                      zeer goede uitleg over de aankoop         Test heading Recommeion comment Other comment
 3 test3  juiste behandeling     NA                       uitleg en vriendelijkheid van de verkoper Test heading Recommeion comment Other comment
 4 test4  NA                     NA                       NA                                        Test heading Recommeion comment Other comment
 5 test5  vakkundige afhandeling zzz                      Bob gaf heel goede uitleg!                Test heading Recommeion comment Other comment
 6 test6  No comment             uw verkoper              NA                                        Test heading Recommeion comment Other comment
 7 test7  NA                     correcte verkoop         eerder genoemd                            Test heading Recommeion comment Other comment
 8 test8  NA                     goede uitleg             NA                                        Test heading Recommeion comment Other comment
 9 test9  goede uitleg           professioneel !!!        NA                                        Test heading Recommeion comment Other comment
10 test10 goede uitleg           Goeie service            "geen opmerkingen\r\n"                    Test heading Recommeion comment Other comment
11 test11 -                      nee                      Genoeg info                               Test heading Recommeion comment Other comment

I don't have any other ideas...

Not really

You have to use the column names of your actual dataset, if your dataset has column names with empty spaces you have to reference them between backticks i.e. `Test heading`, although using this kind of column names is not a good practice.

Thank you but is there any solution to the errors mentioned above? Is it datapasta error?
My original heading names in the source excel have spaces (Test heading) but what you can see are names with dots replacing spaces (Test.heading).
Your result looks amazing but I cannot replicate it with the error :frowning_face:

Also, I don't really know what is going on with raw 11 as comment in the 'Other comment' field was " geen opmerkingen ". I cannot work out why that was converted to " geen opmerkingen\r\n\r\n " :thinking:

This is because using non syntactic names (e.g. with spaces) is considered a bad practice in R and datapasta gets rid of them to avoid further issues.

I already gave you a walk-around solution, use backticks, i.e. `Test heading` (not the same as 'Test heading')

Those are escape sequences
\r = carriage return and
\n = newline
They appear because there are two empty lines after "geen opmerkingen".

Thank you Master!!!
Is any way of getting rid of empty lines to avoid this issue?

Also, final thing.
I realised that your code removes all double digit words rather than double digit sentences so in line 2 and 3 merged comments are without any double digit words. I need to treat comments left in the individual variables/questions invalid if there are less than 3 characters left ("xx", "a", "--" etc.)

Ok, final attempt.
All these are trivial variations of the initial solution I gave you, so If you are going to be cleaning text a lot, I recommend you to learn about regular expressions.

library(tidyverse)
source <- data.frame(stringsAsFactors=FALSE,
                     URN = c("test1", "test2", "test3", "test4", "test5",
                             "test6", "test7", "test8", "test9", "test10", "test11"),
                     Test.heading = c("goede uitleg", "Goede sfeer",
                                      "juiste behandeling", NA, "vakkundige afhandeling",
                                      "No comment", NA, NA, "goede uitleg", "goede uitleg", "-"),
                     Recommendation.comment = c("-", "xxx", NA, NA, "zzz", "uw verkoper",
                                                "correcte verkoop", "goede uitleg", "professioneel !!!",
                                                "Goeie service", "nee"),
                     Other.comment = c("ab", "zeer goede uitleg over de aankoop",
                                       "uitleg en vriendelijkheid van de verkoper", NA,
                                       "Bob gaf heel goede uitleg!", NA, "eerder genoemd", NA, NA,
                                       "geen opmerkingen\r\n", "Genoeg info")
)

source %>%
    mutate_all(~str_remove_all(.x, "^.{1,2}$")) %>% # Remove sentences with less than 3 characters
    mutate(all_comment = paste(Test.heading, Recommendation.comment, Other.comment),
           all_comment = str_remove_all(all_comment,
                                        regex("no\\scomment?|\\snee\\s|\\sneen\\s|\\s*n.?a.?\\s*", # Remove NAs
                                              ignore_case = TRUE)),
           all_comment = str_remove_all(all_comment, "(.)\\1{2,}"), # Remove repeted characters
           all_comment = str_remove_all(all_comment, "[:cntrl:]"), # Remove control characters like /n/r
           all_comment = str_replace_all(all_comment, "\\s\\s+", " ")) # Remove extra spaces
#>       URN           Test.heading Recommendation.comment
#> 1   test1           goede uitleg                       
#> 2   test2            Goede sfeer                    xxx
#> 3   test3     juiste behandeling                   <NA>
#> 4   test4                   <NA>                   <NA>
#> 5   test5 vakkundige afhandeling                    zzz
#> 6   test6             No comment            uw verkoper
#> 7   test7                   <NA>       correcte verkoop
#> 8   test8                   <NA>           goede uitleg
#> 9   test9           goede uitleg      professioneel !!!
#> 10 test10           goede uitleg          Goeie service
#> 11 test11                                           nee
#>                                Other.comment
#> 1                                           
#> 2          zeer goede uitleg over de aankoop
#> 3  uitleg en vriendelijkheid van de verkoper
#> 4                                       <NA>
#> 5                 Bob gaf heel goede uitleg!
#> 6                                       <NA>
#> 7                             eerder genoemd
#> 8                                       <NA>
#> 9                                       <NA>
#> 10                      geen opmerkingen\r\n
#> 11                               Genoeg info
#>                                                    all_comment
#> 1                                                goede uitleg 
#> 2                Goede sfeer zeer goede uitleg over de aankoop
#> 3  juiste behandelinguitleg en vriendelijkheid van de verkoper
#> 4                                                             
#> 5            vakkundige afhandeling Bob gaf heel goede uitleg!
#> 6                                                  uw verkoper
#> 7                              correcte verkoop eerder genoemd
#> 8                                                 goede uitleg
#> 9                                  goede uitleg professioneel 
#> 10                 goede uitleg Goeie service geen opmerkingen
#> 11                                                 Genoeg info
1 Like

Your help is better than hours of searching and learning. This is the best lesson for me!
Thank you again Master :smile:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.