columns searching

Having 2 datasets. The cases dataset, I want to create a column (find) and set to 'Yes' if OTHRDX1 or OTHRDX2 or OTHRDX3 contains value that can be found in vector cd1121. For example:find in row2 and row5 are 'Yes', and rest are 'No'. How to do the columns searching? Thanks in advance!

cases <- data.frame(
stringsAsFactors = FALSE,
CaseNumber = c(1, 2, 3, 4, 5),
OTHRDX1 = c("P101", "E7420", "P101", "P101", "P101"),
OTHRDX2 = c("P90", NA, "P90", "P90", "P90"),
OTHRDX3 = c("P910", NA, "P910", "P910", "E7429")
)

target <- data.frame(
stringsAsFactors = FALSE,
Table_Number = c("Table 11.21", "Table 11.21", "Table 11.21"),
Code = c("E7420", "E7421", "E7429"),
Short_Desc = c("Disorders of galactose metabolism, unspecified","Galactosemia",
"Other disorders of galactose metabolism")
)

cd1121 <- as.vector(target$Code)

For a single column, that can be tested with:

cases$OTHRDX1 %in% cd1121
#> [1] FALSE  TRUE FALSE FALSE FALSE

So you can do the same with an OR |:

cases$OTHRDX1 %in% cd1121 |
  cases$OTHRDX2 %in% cd1121 |
  cases$OTHRDX3 %in% cd1121
#> [1] FALSE  TRUE FALSE FALSE  TRUE

So now you want to transform this vector of logicals into text, the easiest way is with ifelse():

ifelse(cases$OTHRDX1 %in% cd1121 |
         cases$OTHRDX2 %in% cd1121 |
         cases$OTHRDX3 %in% cd1121,
       "Yes", "No")
#> [1] "No"  "Yes" "No"  "No"  "Yes"

And finally we can add it as a column to cases:

cases$find <- ifelse(cases$OTHRDX1 %in% cd1121 |
                       cases$OTHRDX2 %in% cd1121 |
                       cases$OTHRDX3 %in% cd1121,
                     "Yes", "No")
cases <- data.frame(
  stringsAsFactors = FALSE,
  CaseNumber = c(1, 2, 3, 4, 5),
  OTHRDX1 = c("P101", "E7420", "P101", "P101", "P101"),
  OTHRDX2 = c("P90", NA, "P90", "P90", "P90"),
  OTHRDX3 = c("P910", NA, "P910", "P910", "E7429")
)

target <- data.frame(
  stringsAsFactors = FALSE,
  Table_Number = c("Table 11.21", "Table 11.21", "Table 11.21"),
  Code = c("E7420", "E7421", "E7429"),
  Short_Desc = c("Disorders of galactose metabolism, unspecified","Galactosemia",
                 "Other disorders of galactose metabolism")
)

cd1121 <- as.vector(target$Code)

has_hit <- function(x) x %in% cd1121    
target$find <- ifelse(apply(cases[,-1],2,has_hit) |> colSums(),"Yes","No")
target
#>   Table_Number  Code                                     Short_Desc find
#> 1  Table 11.21 E7420 Disorders of galactose metabolism, unspecified  Yes
#> 2  Table 11.21 E7421                                   Galactosemia   No
#> 3  Table 11.21 E7429        Other disorders of galactose metabolism  Yes

Created on 2023-10-12 with reprex v2.0.2

1 Like

I like the style, but need little explanation. How does that apply work? what is |> and colsums() for? Thank you so much!

1 Like

apply() takes an object, in this case the data frame cases, a choice of rows (1) or columns (2) and then uses that as an argument to a function.

|> is the pipe like the OS pipe | or the R {magritter} package %>% operator. It passes on the output from the left to the next function on the right. It's just syntactic candy for

colSums(somefunction(someargument)

The left hand side (LHS) produces a truth table of TRUE FALSE values. These evaluate for 1,0 for purposes of colSums in the RHS, which ifelse() then treats as
TRUE/FALSE to make its classification.

Because most of us suffer to some degree from school algebra punctuation trauma the functional approach of {base} R can be daunting compared to the imperative approach of tidyverse—verb this then verb that. Once I got over the trauma, I found that the functional approach keeps me much more focused on the what I am trying to do, which avoids getting bogged down in figuring out how to do something, especially when I'm unclear on the concept of what it is that I am actually trying to accomplish.

1 Like

Thanks! I always were amazed by the repeat operation (or loop vertically /horizontally) in r. Hardly for me to think how it goes through all columns and rows to searching or comparing. Also it is hard for me to find explicitly which 'item' in 'array' contains the value that I were looking for.

Yes, R's vectorization can be disturbing if you come from another programming language, but once you get used to it a bit, it makes things faster and easier to reason with.

Note that you could always write things in a more C-like way (not that it's a good idea), for example my solution

ases$find <- ifelse(cases$OTHRDX1 %in% cd1121 |
                       cases$OTHRDX2 %in% cd1121 |
                       cases$OTHRDX3 %in% cd1121,
                     "Yes", "No")

could be written as

n <- nrow(cases)
cases$find <- NA
for(row in seq_len(n)){
  if(cases$OTHRDX1[[row ]] %in% cd1121 ||
                       cases$OTHRDX2[[row ]] %in% cd1121 ||
                       cases$OTHRDX3[[row ]] %in% cd1121) {
    cases$find[[row ]] <- "Yes"
  } else{
    cases$find[[row ]] <- "No"
  }
}

And technocrat's answer could be rewritten as:

cols_of_interest <- colnames(cases)[-1]
cases$find <- NA

for(row in seq_len(nrow(cases))){
  
  number_of_matches <- 0
  for(col in cols_of_interest){
    if(cases[row, col] %in% cd1121){
      number_of_matches <- number_of_matches + 1
    }
  }
  
  if(number_of_matches > 0){
    cases$find[row] <- "Yes"
  } else{
    cases$find[row] <- "No"
  }
  
}

Or replacing the inner loop with a (vectorized) call to has_hit():

has_hit <- function(x) x %in% cd1121    

cols_of_interest <- colnames(cases)[-1]
cases$find <- NA

for(row in seq_len(nrow(cases))){
  any_hit <- any(has_hit(cases[row,]))
  
  if(any_hit){
    cases$find[row] <- "Yes"
  } else{
    cases$find[row] <- "No"
  }
  
}

But these versions are less readable, and probably slower for a big data frame.

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.