Hi R Masters,
I have this challenge today.
I solved that but it looks awful and I'm sure it may be significantly simplified.
I have this simple dummy file with 3 character variables (with many spelling mistakes) and I want to recode them into specific categories under relevant recoded variables (with Rec prefix).
I had some issues with regex not picking up some words therefore my coding is long and complicated.
I also manually specified RecQ9a, RecQ9b and RecQ9c:
df <- data.frame(stringsAsFactors=FALSE,
URN = c("aaa", "bbb", "ccc", "ddd", "eee", "fff", "ggg", "hhh", "iii",
"jjj"),
Q9a = c("Satisfied", "Contentsatifiedstressfree",
"Happy satisfied please d", "Sattisfied",
"Veryeasytoarrangeappoinem", "Ease", "Reasonable", "Dissatisfied", "Happy",
"satisfying"),
Q9b = c(" satisfying", NA, NA, "Easy, good", "Reasonable", "Fabulous", " Profesionable",
"Unimpressed", " Reassured", "Safe"),
Q9c = c("Confident", NA, NA, "Better", "Professional", "Professional",
"Timing", "Disappointef", "Sattisfied", "Enjoyable")
)
df
library(stringr)
results <- df %>% mutate(
RecQ9a = case_when(
str_detect(Q9a, regex("Satisfied|sattisfied|satissfied.Satisfied?|.Sattisfied?|.Satissfied?|.satisfying|Satisfying|.Satisfying|Satisified|.Satisified|Satified|.Satified|Satisfaction|.Satisfaction",
ignore_case = TRUE, multiline = TRUE))
& !str_detect(Q9a, regex(".Disatisfied?|.Dissatisfied?|.Dissattisfied?|.unsatisfied?|.unssatisfied?|.unsattisfied?|.unsatisfactory?|.unsatifactory?",
ignore_case = TRUE, multiline = TRUE)) ~ "Satisfied",
str_detect(Q9a, regex("Professional|Proffesional|.Professional?|.Proffesional?|Profesionable|.Profesionable|Proffessional|.Proffessional|Profesional|.Profesional", ignore_case = TRUE, multiline = TRUE)) ~ "Professional",
str_detect(Q9a, regex("Easy|.Easy|Ease|.Ease", ignore_case = TRUE, multiline = TRUE))
& !str_detect(Q9a, regex("reasonable", ignore_case = TRUE, multiline = TRUE))
& !str_detect(Q9a, regex("reassured|reassuring|reasure|reasured", ignore_case = TRUE, multiline = TRUE))
& !str_detect(Q9a, regex("pleased", ignore_case = TRUE, multiline = TRUE)) ~ "Easy"),
RecQ9b = case_when(
str_detect(Q9b, regex("Satisfied|sattisfied|satissfied.Satisfied?|.Sattisfied?|.Satissfied?|.satisfying|Satisfying|.Satisfying|Satisified|.Satisified|Satified|.Satified|Satisfaction|.Satisfaction",
ignore_case = TRUE, multiline = TRUE))
& !str_detect(Q9b, regex(".Disatisfied?|.Dissatisfied?|.Dissattisfied?|.unsatisfied?|.unssatisfied?|.unsattisfied?|.unsatisfactory?|.unsatifactory?",
ignore_case = TRUE, multiline = TRUE)) ~ "Satisfied",
str_detect(Q9b, regex("Professional|Proffesional|.Professional?|.Proffesional?|Profesionable|.Profesionable|Proffessional|.Proffessional|Profesional|.Profesional", ignore_case = TRUE, multiline = TRUE)) ~ "Professional",
str_detect(Q9b, regex("Easy|.Easy|Ease|.Ease", ignore_case = TRUE, multiline = TRUE))
& !str_detect(Q9b, regex("reasonable", ignore_case = TRUE, multiline = TRUE))
& !str_detect(Q9b, regex("reassured|reassuring|reasure|reasured", ignore_case = TRUE, multiline = TRUE))
& !str_detect(Q9b, regex("pleased", ignore_case = TRUE, multiline = TRUE)) ~ "Easy"),
RecQ9c = case_when(
str_detect(Q9c, regex("Satisfied|sattisfied|satissfied.Satisfied?|.Sattisfied?|.Satissfied?|.satisfying|Satisfying|.Satisfying|Satisified|.Satisified|Satified|.Satified|Satisfaction|.Satisfaction",
ignore_case = TRUE, multiline = TRUE))
& !str_detect(Q9c, regex(".Disatisfied?|.Dissatisfied?|.Dissattisfied?|.unsatisfied?|.unssatisfied?|.unsattisfied?|.unsatisfactory?|.unsatifactory?",
ignore_case = TRUE, multiline = TRUE)) ~ "Satisfied",
str_detect(Q9c, regex("Professional|Proffesional|.Professional?|.Proffesional?|Profesionable|.Profesionable|Proffessional|.Proffessional|Profesional|.Profesional", ignore_case = TRUE, multiline = TRUE)) ~ "Professional",
str_detect(Q9c, regex("Easy|.Easy|Ease|.Ease", ignore_case = TRUE, multiline = TRUE))
& !str_detect(Q9c, regex("reasonable", ignore_case = TRUE, multiline = TRUE))
& !str_detect(Q9c, regex("reassured|reassuring|reasure|reasured", ignore_case = TRUE, multiline = TRUE))
& !str_detect(Q9c, regex("pleased", ignore_case = TRUE, multiline = TRUE)) ~ "Easy"))
results
What I need is:
- I have Q9a, Q9b and Q9c variables in this file but in the other they might be Q8a, Q8b, Q8c so I would like to use a code for any variables ending with a, b or c (to apply that to different data)
- I would like to use one list of case_when rather then repeating the same thing for the first, the second and the third variable to be analysed (Q9a, Q9b and Q9c in this case)
- I would like to simplify regex as I have a feeling that "satis", sattis" & exclude phrases starting with "dis" and "un" (like "dissatisfied", "unsattisfied" etc) would work better than stating all possible spelling options. I had similar issues with "Easy" where many irrelevant words (such as "reassured") were picked up therefore I had to make coding long and complicated.
Can you help?