Use String to determine value in corresponding column

zmcclean · October 30, 2023, 4:23pm

I have the following example data set from a survey about injuries and what side they occur on. The injuries are described in columns (Lat_Ankle_Sprain, knee_sprain, shoulder_sprain) that are abbreviated for easier coding and they have values "Left", "Right", "Both", or "NA". Survey respondents then say which of these injuries caused restricted sport participation (Injury_RestrictedParticipation). one, multiple, or no injuries could have caused restricted sport participation. The injuries listed in Injury_RestrictedParticipation are not abbreviated (see below code chunk):

library(tidyverse)
library(stringr)

df<- structure(list(Lat_Ankle_Sprain = c("Left", "Right", "Left", 
                                    "Left", "NA", "NA"), Knee_Sprain = c("Right", "Right", "Left", 
                                                                         "Both", "NA", "Left"), Shoulder_Sprain = c("Right", "Left", "NA", 
                                                                                                                    "Right", "Right", "Right"), Injury_RestrictedParticipation = c("Lateral Ankle Sprain (i.e., inversion sprain),Knee Sprain (or ligament tear)", 
                                                                                                                                                                                   "Knee Sprain (or ligament tear),Lateral Ankle Sprain (i.e., inversion sprain)", 
                                                                                                                                                                                   "Knee Sprain (or ligament tear)", "Knee Sprain (or ligament tear),Shoulder or Arm Strain (e.g., rotator cuff, biceps),Lateral Ankle Sprain (i.e., inversion sprain)", 
                                                                                                                                                                                   "NA", "Shoulder or Arm Strain (e.g., rotator cuff, biceps)")), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                                                                                      -6L))

                                                
abbreviations <- c("Lateral Ankle Sprain (i.e., inversion sprain)" = "Lat_Ankle_Sprain",
                   "Knee Sprain (or ligament tear)" = "Knee_Sprain",
                   "Shoulder or Arm Strain (e.g., rotator cuff, biceps)" =  "Shoulder_Sprain")

The Goal is to determine what side the primary injury occurred on and put this data in a column called "Primary_Injury_Side"
To determine this, I aim to take the list of injuries in Injury_RestrictedParticipation, find the corresponding columns, and determine what side the injury occurs. If all the injuries occur on the same side (i.e., Left) then that is the Primary_Injury_Side (i.e., Left). However, in cases where multiple injuries are listed and they occur on different sides, Primary_Injury_Side should be "Both". And if no injuries are listed, Primary_Injury_Side should be "None".

I will provide my code that is not quite working below.

get_primary_side <- function(injury_names) {
  injuries <- strsplit(as.character(injury_names), "(?<=\\.(?:i\\.e\\.|e\\.g\\.)),", perl = TRUE)[[1]] #splits injury list by comma - but ignores cases were comma is used after e.g. or i.e. within the injury name
  side_values <- sapply(injuries, function(injury) get_side_for_injury(injury))
  
  unique_sides <- unique(side_values[!is.na(side_values)])
  if (length(unique_sides) == 1) {
    return(as.character(unique_sides))
  } else {
    return("Both")
  }
}

get_side_for_injury <- function(injury) {
  side_value <- get_abbrev_column_names(injury)
  if (!is.na(side_value)) {
    return(side_value)
  } else {
    return(NA)
  }
}

get_abbrev_column_names <- function(injury_name) {
  return(abbreviations[injury_name])
}

Injury_df <- df %>% 
             mutate(Primary_Injury_Side = case_when(
             is.na(Injury_RestrictedParticipation) ~ "None",
             TRUE ~ get_primary_side(Injury_RestrictedParticipation)))

A second aim of mine is to create a column called LowerBody_Injury_Side which takes the Injury_RestrictedParticipation and follows the same logic as above but ignores columns that are associated with the upper body.

Thank you in advance for you help on this! I would love to provide any additional information that may be helpful.

bcavinee · November 3, 2023, 11:42pm

Hello, I think this solution should get you the data frame you are looking for. I replaced the comma outside of the parenthesis with a bar. Split that column into three columns, one for each injury. Then set the injuries to the injury values provided in the abbreviations vector. I was then able to use the injury name and get the value in the original column that states which side the injury occurred. From there, a case_when statement can be used to set the Primary_Injury_Side column to either right, left, or both based on unique values in the three injury columns.

library(tidyverse)
library(stringr)

df<- structure(list(Lat_Ankle_Sprain = c("Left", "Right", "Left", 
                                         "Left", "NA", "NA"), Knee_Sprain = c("Right", "Right", "Left", 
                                                                              "Both", "NA", "Left"), Shoulder_Sprain = c("Right", "Left", "NA", 
                                                                                                                         "Right", "Right", "Right"), Injury_RestrictedParticipation = c("Lateral Ankle Sprain (i.e., inversion sprain),Knee Sprain (or ligament tear)", 
                                                                                                                                                                                        "Knee Sprain (or ligament tear),Lateral Ankle Sprain (i.e., inversion sprain)", 
                                                                                                                                                                                        "Knee Sprain (or ligament tear)", "Knee Sprain (or ligament tear),Shoulder or Arm Strain (e.g., rotator cuff, biceps),Lateral Ankle Sprain (i.e., inversion sprain)", 
                                                                                                                                                                                        "NA", "Shoulder or Arm Strain (e.g., rotator cuff, biceps)")), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                                                                                           -6L))


abbreviations <- c("Lateral Ankle Sprain (i.e., inversion sprain)" = "Lat_Ankle_Sprain",
                   "Knee Sprain (or ligament tear)" = "Knee_Sprain",
                   "Shoulder or Arm Strain (e.g., rotator cuff, biceps)" =  "Shoulder_Sprain")                     


new_df_two= df %>% mutate(injury_split_bar= str_replace_all(Injury_RestrictedParticipation ,regex(',(?![^()]*\\))'), "|")) %>% 
  separate(injury_split_bar, into = paste0("A",1:length(abbreviations)), sep = "\\|") %>% 
  mutate(across(A1:A3, ~ abbreviations[.x])) %>% 
  mutate(across(A1:A3, ~ case_when(.x == 'Lat_Ankle_Sprain' ~ Lat_Ankle_Sprain, .x == 'Knee_Sprain' ~ Knee_Sprain, .x == 'Shoulder_Sprain'
                                   ~ Shoulder_Sprain))) %>%
  rowwise() %>% 
  mutate(unique_check= list(c(A1,A2,A3))) %>% mutate(unique_value= n_distinct(unique_check,na.rm = T)) %>% 
  mutate(Primary_Injury_Side= case_when(unique_value == 3 ~ 'Both', unique_value == 2 ~ 'Both',
                                        unique_value == 1 ~ A1,
                                        unique_value == 0 ~ 'None')) %>% select(-unique_check, -unique_value)

system · November 22, 2023, 9:14pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.