Specific type of variables merged using paste or unite and maybe renamed first?

Slavek · May 21, 2021, 11:20am

Hi,
I have this simple df with comments:

source <- data.frame(
  stringsAsFactors = FALSE,
                                    URN = c("aaa","bbb","ccc",
                                            "ddd","eee","fff","ggg","hhh"),
                                   Name = c("xxx","xxx","yyy",
                                            "yyy","yyy","zzzz","abcde","zzzz"),
                                     AComm1 = c("None.",NA,
                                            "No comments related to this exercise","Na",
                                            "N/A","Interesting comment","abc", "whatever is fine"),
                                     AComm2 = c("Nothing",
                                            "I have nothing in common","NA",NA,
                                            "Another comment","....?","xxxx", "All fine"),
                                     BComm1 = c("Service","All good",
                                            "aa","I don't know",
                                            "The final comment about that","Nothing.","na","Everything"),
                                    BComm2 = c("aaa","Nothing",
                                              "None","My final comments are ok", "I don't know",
                                              "Nothing.","Another comment","really"),
                                                                       Q4 = c(2019,2020,2020,2019,
                                                                              2020,2021,2021,2019)
                     )

I managed to achieve something like this (with your help )

library(dplyr)
library(stringr)
library(tidyr)

blank_statements <- regex("^(None.?|No\\scomments?.?|N.?A|Nothing.?)$", ignore_case = TRUE)

merged.comments <-  source %>% 
  mutate_if(~is.character(.) & any(nchar(.) > 15, na.rm = TRUE),
            ~str_remove_all(.x, blank_statements))%>% 
  mutate_if(~is.character(.) & any(nchar(.) > 15, na.rm = TRUE),
            ~str_remove_all(.x, "^.{1,5}$"))%>% 
  unite("all_comments", where(~is.character(.x) & any(nchar(.x) > 15)), sep = "/", remove = FALSE, na.rm = FALSE)%>% # adjust na.rm argument as needed 
  mutate(all_comments = str_remove_all(all_comments, "NA"), # Removes NAs
         all_comments = str_remove_all(all_comments, "[:cntrl:]"), # Removes control characters like /n/r
         all_comments = str_replace_all(all_comments, "\\s\\s+", " "),  #Removes duplicated /
         all_comments = str_replace_all(all_comments, "//+", "/"), # Removes extra spaces
         all_comments = str_remove (all_comments, "/$"), # Removes / in the end
         all_comments = str_remove (all_comments, "^/")) # Removes / in the beginning

What I need is similar but I need three variables with merged comments instead of just one:

all_comments: The same as above but I think I can simplify the code stating "any variable including Comm in its name"
A_comments: The same logic but merging variables including AComm only
B_comments: The same logic but merging variables including BComm only

I think I can replace this:

~is.character(.) & any(nchar(.) > 15

by something stating that variables should include Comm but do I need to repeat all last 5 lines of the code for A_comments and B_comments?

Can you help with the entire code above rewritten to get 3 new variables mentioned above and simplified?

Once the code is ready I would like to keep it universal for other datasets.
Let's imagine we do not have comment variables clearly described by their name and instead of AComm1, AComm2, BComm1, BComm2 we have just A1, A2, B1 and B2.

URN = c("aaa","bbb","ccc",
                                            "ddd","eee","fff","ggg","hhh"),
                                   Name = c("xxx","xxx","yyy",
                                            "yyy","yyy","zzzz","abcde","zzzz"),
                                     A1 = c("None.",NA,
                                            "No comments related to this exercise","Na",
                                            "N/A","Interesting comment","abc", "whatever is fine"),
                                     A2 = c("Nothing",
                                            "I have nothing in common","NA",NA,
                                            "Another comment","....?","xxxx", "All fine"),
                                     B1 = c("Service","All good",
                                            "aa","I don't know",
                                            "The final comment about that","Nothing.","na","Everything"),
                                    B2 = c("aaa","Nothing",
                                              "None","My final comments are ok", "I don't know",
                                              "Nothing.","Another comment","really"),
                                                                       Q4 = c(2019,2020,2020,2019,
                                                                              2020,2021,2021,2019)
                     )

Can we add an initial step and add "Comm" in the beginning or in the end of each variable name which is character and has responses longer than 15 characters (so URN and Name would be excluded from merging)?
How would I rename just these variables?

where(~is.character(.x) & any(nchar(.x) > 15)

Is this task challenging?

Thank you for your help.

andresrcs · May 22, 2021, 12:42am

You can use across() and contains() functions

library(tidyverse)

source <- data.frame(
    stringsAsFactors = FALSE,
    URN = c("aaa","bbb","ccc",
            "ddd","eee","fff","ggg","hhh"),
    Name = c("xxx","xxx","yyy",
             "yyy","yyy","zzzz","abcde","zzzz"),
    AComm1 = c("None.",NA,
               "No comments related to this exercise","Na",
               "N/A","Interesting comment","abc", "whatever is fine"),
    AComm2 = c("Nothing",
               "I have nothing in common","NA",NA,
               "Another comment","....?","xxxx", "All fine"),
    BComm1 = c("Service","All good",
               "aa","I don't know",
               "The final comment about that","Nothing.","na","Everything"),
    BComm2 = c("aaa","Nothing",
               "None","My final comments are ok", "I don't know",
               "Nothing.","Another comment","really"),
    Q4 = c(2019,2020,2020,2019,
           2020,2021,2021,2019)
)

blank_statements <- regex("^(None.?|No\\scomments?.?|N.?A|Nothing.?)$", ignore_case = TRUE)

source %>% 
    mutate(across(contains("Comm"), ~str_remove_all(.x, blank_statements)),
           across(contains("Comm"), ~str_remove_all(.x, "^.{1,5}$"))) %>%
    unite("all_comments", contains("Comm"), sep = "/", remove = FALSE, na.rm = FALSE) %>%
    unite("A_comments", starts_with("AComm"), sep = "/", remove = FALSE, na.rm = FALSE) %>%
    unite("B_comments", starts_with("BComm"), sep = "/", remove = FALSE, na.rm = FALSE) %>%
    mutate(across(contains("comments"), ~ str_remove_all(.x, "NA")),
           across(contains("comments"), ~ str_remove_all(.x, "[:cntrl:]")),
           across(contains("comments"), ~ str_replace_all(.x, "\\s\\s+", " ")),
           across(contains("comments"), ~ str_replace_all(.x, "//+", "/")),
           across(contains("comments"), ~ str_remove (.x, "/$")),
           across(contains("comments"), ~ str_remove (.x, "^/"))) %>% 
    select(URN, Name, all_comments, A_comments, B_comments, everything())
#>   URN  Name                                              all_comments
#> 1 aaa   xxx                                                   Service
#> 2 bbb   xxx                         I have nothing in common/All good
#> 3 ccc   yyy                      No comments related to this exercise
#> 4 ddd   yyy                     I don't know/My final comments are ok
#> 5 eee   yyy Another comment/The final comment about that/I don't know
#> 6 fff  zzzz                                       Interesting comment
#> 7 ggg abcde                                           Another comment
#> 8 hhh  zzzz               whatever is fine/All fine/Everything/really
#>                             A_comments
#> 1                                     
#> 2             I have nothing in common
#> 3 No comments related to this exercise
#> 4                                     
#> 5                      Another comment
#> 6                  Interesting comment
#> 7                                     
#> 8            whatever is fine/All fine
#>                                  B_comments
#> 1                                   Service
#> 2                                  All good
#> 3                                          
#> 4     I don't know/My final comments are ok
#> 5 The final comment about that/I don't know
#> 6                                          
#> 7                           Another comment
#> 8                         Everything/really
#>                                 AComm1                   AComm2
#> 1                                                              
#> 2                                 <NA> I have nothing in common
#> 3 No comments related to this exercise                         
#> 4                                                          <NA>
#> 5                                               Another comment
#> 6                  Interesting comment                         
#> 7                                                              
#> 8                     whatever is fine                 All fine
#>                         BComm1                   BComm2   Q4
#> 1                      Service                          2019
#> 2                     All good                          2020
#> 3                                                       2020
#> 4                 I don't know My final comments are ok 2019
#> 5 The final comment about that             I don't know 2020
#> 6                                                       2021
#> 7                                       Another comment 2021
#> 8                   Everything                   really 2019

^{Created on 2021-05-22 by the reprex package (v2.0.0)}

Slavek · May 24, 2021, 8:17am

Excellent solution!

To make this code more general and ready to apply to any data frame containing string variables with long responses I would like to rename string variables if they meet the criteria by adding Comm to their names prior to your brilliant code.
Adding a code including something like this in the beginning:

where(~is.character(.x) & any(nchar(.x) > 15)

would perhaps do that. This is the second part of my initial request.
Is it easy?

andresrcs · May 24, 2021, 12:58pm

As I explained to you before topics here must contain only one well defined question, not a list of steps to complete a task, that will only benefit you and nobody else, so please ask that question in a new topic with a relevant title and reprex.

Slavek · May 25, 2021, 10:17am

I will. Thank you for your advise.
You are a great supporter andresrcs!

system · June 1, 2021, 10:18am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.