Create a new variable based on multiple columns

Mayasak · September 1, 2023, 5:26am

Hi,

I have the following data frame:
LAB <- data.frame("ResOrg" = c("Positive", "Detected", "ESCHERICHIA COLI-CRE", "Ebola Disease", "Legionella Finding", " ", "Ecoli Detected", "Salmonella organism", " ", " "),
"TestResult" = c("ECOCRE", "Acinetobacter baumannii", " ", "Ecoli", "MSN", "CARBAPENEM RESISTANT KLEBSIELLA PNEUMONIAE", "CRKP", "Ebola","Candida auris"," "),
"Result" = c("CITDCR", "ETACRE", "Test", "KPS", " ", "Ecoli", "KLEPNE", "Ebola"," ","ENTEROBACTER (KLEBSIELLA) AERO"),
"Test" = c("Ebola ", "Ecoli KPS", "Organism", " ", "Klebsiella", "MSN", " ", "NDM", "KLEPNE", " "),
"Notes" = c(" ", " ", "SPECIMEN NOTES:258531008 WOUND SWAB L 20130731 V1: Disease RESISTANT EcoliTesting", "SPECIMEN:COLLECTION METHOD ID: URINE - E COLIGREATER THAN 100,000 COLONY FORMING UNITS PER MLOTHER", " ", "Positive for disease Producing", " ", "Tested for NDM, KPS and MSN", "KPS", "Carbapenem ResistantKlebsiella"))

I want to create a new column based on the existing columns as follows:

LAB_Org<-LAB %>% #order matters, the overriding text string should be placed at the top (e.g., code Klebsiella aerogenes before Enterobacter aerogenes)
mutate(Organism = (case_when(
is.na(ResOrg) ~ "No Organism",
grepl("Ecoli Detected", ResOrg, ignore.case = TRUE)~"Escherichia coli",
grepl("Candida auris", ResOrg, ignore.case = TRUE)~"Candida auris",
grepl("ENTEROBACTER (KLEBSIELLA) AERO", ResOrg, ignore.case = TRUE)~"Klebsiella aerogenes",
grepl("CARBAPENEM RESISTANT KLEBSIELLA PNEUMONIAE", ResOrg, ignore.case = TRUE)~"Klebsiella pneumoniae",
grepl("Legionella Finding", ResOrg, ignore.case = TRUE)~"Legionella",
grepl("Ecoli", ResOrg, ignore.case = TRUE)~"Escherichia coli",
grepl("Ebola Disease", ResOrg, ignore.case = TRUE)~"Ebola",
grepl("CARBAPENEM RESISTANT ENTEROBACTER (KLEBSIELLA) AEROGENES", ResOrg, ignore.case = TRUE)~"Klebsiella aerogenes",
grepl("ESCHERICHIA COLI-CRE", ResOrg, ignore.case = TRUE)~"Escherichia coli",
grepl("Salmonella organism", ResOrg, ignore.case = TRUE)~"Salmonella",
TRUE~"No Organism")))

LAB_Org2<-LAB_Org %>%
mutate(Organism= (case_when(
grepl("ECOCRE", TestResult)~"Escherichia coli",
grepl("CITDCR", TestResult)~"Citrobacter Koseri",
grepl("CRKP", TestResult)~"Klebsiella pneumoniae",
grepl("KLEPNE", TestResult)~"Klebsiella pneumoniae",
grepl("ETACRE", TestResult)~"Klebsiella aerogenes",
TRUE~Organism)))
LAB_Org3<-LAB_Org2 %>%
mutate(Organism= (case_when(
grepl("ECOCRE", Result)~"Escherichia coli",
grepl("CITDCR", Result)~"Citrobacter Koseri",
grepl("CRKP", Result)~"Klebsiella pneumoniae",
grepl("KLEPNE", Result)~"Klebsiella pneumoniae",
grepl("ETACRE", Result)~"Klebsiella aerogenes",
TRUE~Organism)))
LAB_Org4<-LAB_Org %>%
mutate(Organism= (case_when(
grepl("Disease RESISTANT Ecoli", Notes)~"Escherichia coli",
grepl("E COLIGREATER", Notes)~"Escherichia coli",
grepl("Carbapenem ResistantKlebsiella", Notes)~"Klebsiella",
TRUE~Organism)))

There are issues with this way:

RestOrg is the column that has the most values that need to be extracted into the new column. Values (organisms) need to be extracted from other columns only when the RestOrg value is missing.
I need a more efficient code since I have a very extensive list of values to be extracted into the new column.
The values that were extracted in the second step overwrote the values in the first step.
I need to add a code that takes care of positive and/or detected in "ResOrg" and to extract values from the "Test" column.

Thank you

technocrat · September 1, 2023, 9:01am

I'd start with fixing the " " values and knocking everything down to lowercase

LAB <- data.frame("ResOrg" = c("Positive", "Detected", "ESCHERICHIA COLI-CRE", "Ebola Disease", "Legionella Finding", " ", "Ecoli Detected", "Salmonella organism", " ", " "),
                  "TestResult" = c("ECOCRE", "Acinetobacter baumannii", " ", "Ecoli", "MSN", "CARBAPENEM RESISTANT KLEBSIELLA PNEUMONIAE", "CRKP", "Ebola","Candida auris"," "),
                  "Result" = c("CITDCR", "ETACRE", "Test", "KPS", " ", "Ecoli", "KLEPNE", "Ebola"," ","ENTEROBACTER (KLEBSIELLA) AERO"),
                  "Test" = c("Ebola ", "Ecoli KPS", "Organism", " ", "Klebsiella", "MSN", " ", "NDM", "KLEPNE", " "),
                  "Notes" = c(" ", " ", "SPECIMEN NOTES:258531008 WOUND SWAB L 20130731 V1: Disease RESISTANT EcoliTesting", "SPECIMEN:COLLECTION METHOD ID: URINE - E COLIGREATER THAN 100,000 COLONY FORMING UNITS PER MLOTHER", " ", "Positive for disease Producing", " ", "Tested for NDM, KPS and MSN", "KPS", "Carbapenem ResistantKlebsiella"))


replace_blanks <- function(x) gsub("^[^:alpha:]$",NA,LAB[,1])

LAB <- apply(LAB,2,replace_blanks)
LAB <- apply(LAB,2,tolower)
LAB
#>       ResOrg                 TestResult             Result                
#>  [1,] "positive"             "positive"             "positive"            
#>  [2,] "detected"             "detected"             "detected"            
#>  [3,] "escherichia coli-cre" "escherichia coli-cre" "escherichia coli-cre"
#>  [4,] "ebola disease"        "ebola disease"        "ebola disease"       
#>  [5,] "legionella finding"   "legionella finding"   "legionella finding"  
#>  [6,] NA                     NA                     NA                    
#>  [7,] "ecoli detected"       "ecoli detected"       "ecoli detected"      
#>  [8,] "salmonella organism"  "salmonella organism"  "salmonella organism" 
#>  [9,] NA                     NA                     NA                    
#> [10,] NA                     NA                     NA                    
#>       Test                   Notes                 
#>  [1,] "positive"             "positive"            
#>  [2,] "detected"             "detected"            
#>  [3,] "escherichia coli-cre" "escherichia coli-cre"
#>  [4,] "ebola disease"        "ebola disease"       
#>  [5,] "legionella finding"   "legionella finding"  
#>  [6,] NA                     NA                    
#>  [7,] "ecoli detected"       "ecoli detected"      
#>  [8,] "salmonella organism"  "salmonella organism" 
#>  [9,] NA                     NA                    
#> [10,] NA                     NA

^{Created on 2023-09-01 with reprex v2.0.2}

and then do the pattern matching with %in% as suggested here.

For dealing with the case where ResOrg has value of No Organism just give the subset (or filtered) portion of LAB_Org as the argument to the fallbacks. I don't see how #3 happens because you are creating all new objects.

system · October 13, 2023, 9:02am

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.