Faster than row-wise

pavel · June 26, 2022, 8:58pm

Create a column based on single match, two matches, etc... tidyverse

Hello everybody! I'm trying to create a column and/or several columns that would indicate: a)at least one true present in a row ( within specific columns) b)one true is present in a row( within specific columns) b) two TRUE present in a row( within specific columns) c) more than two "TRUE" present in a row( within specific columns) I was only able to solve the simplest case with str_detect. Your help would be greatly appreciated. library(tidyverse) df <- data.frame(flag1 = c(FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE), flag2 = c(FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE), …

Hello!
Is there a faster solution for these 2 questions without using row-wise? It worked well on the subset of the sample, but using the complete sample ( ~ 10 million rows) it's been loading for 3H +.
Thank you!

library(tidyverse)
#> Warning: package 'tibble' was built under R version 4.1.2
df <- data.frame(flag1 = c(FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE), 
                 flag2 = c(FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE),
                 flag3 = c(TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE))
dfNew <- df |> rowwise() |> 
  mutate(AnyTrue = sum(c_across(flag1:flag3)) > 0,
         OneTrue = sum(c_across(flag1:flag3)) == 1,
         TwoTrue = sum(c_across(flag1:flag3)) == 2,
         MoreThanTwo = sum(c_across(flag1:flag3)) > 2)
dfNew

library(tidyverse)
#> Warning: package 'tibble' was built under R version 4.1.2
df<- tribble(~A,~B,~C,~D,
             "I123","I121","I1908","I129",
             "I128","I123","I124","I109",
             "I126","I1855","I129","I183",
             "I121","I163","F121","I8773",
             "I123","I129","I1563","I121",
             "I129","I1665","I128", "F843",
             "X","Y","Z","ZZ")


df <- df |> mutate(Row=row_number()) 
Long <- df |> pivot_longer(cols = A:D,names_to = "name")
Long <-  Long |> group_by(Row) |> 
  summarize(C1 = any(str_detect(value,"I123|I128")),
            C2 = any(str_detect(value,"I121")),
            C3 = any(str_detect(value,"I129"))) |> 
  rowwise() |> 
  mutate(WHICH=which(c_across(C1:C3))[1])
FINAL <- inner_join(df,Long,by="Row")
FINAL

jrkrideau · June 26, 2022, 9:13pm

I think we need a FAQ: How to do a minimal reproducible example ( reprex ) for beginners

Saying *row-wise does not really give us a good idea of what you are doing.

A handy way to supply some sample data is the dput() function. In the case of a large dataset something like dput(head(mydata, 100)) should supply the data we need.

pavel · June 26, 2022, 10:03pm

The original post was addended as per request.

nirgrahamuk · June 27, 2022, 11:45am

I got about 1000x speedup comparing my solution without rowwise to your first example

library(tidyverse)
library(bench)
df <- data.frame(flag1 = c(FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE), 
                 flag2 = c(FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE),
                 flag3 = c(TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE))

bigdf <- map_dfr(1:100,~{.x;
  df}) 
bench::mark(
f1={bigdf |> rowwise() |> 
  mutate(AnyTrue = sum(c_across(flag1:flag3)) > 0,
         OneTrue = sum(c_across(flag1:flag3)) == 1,
         TwoTrue = sum(c_across(flag1:flag3)) == 2,
         MoreThanTwo = sum(c_across(flag1:flag3)) > 2) |> ungroup()},
  f2 = {
    mutate(bigdf,
           x = rowSums(cur_data()),
           AnyTrue = x>0,
           OneTrue = x==1,
           TwoTrue = x==2,
           MoreThanTwo = x>2) |> select(-x) |> tibble()
  })

pavel · June 27, 2022, 5:57pm

Thank you @nirgrahamuk . Indeed, I think it took longer for me to install bench than for this piece of code to execute -)

system · July 4, 2022, 5:58pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.