How to use mutate to create variable applying function by rows

nmolanog · April 18, 2018, 8:51pm

I have a data frame and I want to create a new variable applying a function that works within rows. See the example below.

library(tidyverse)

n <- 100
z0 <- data.frame(A = sample(c("y","n",NA), n, replace = T, prob = c(.4,.4,.1)),
B = sample(c("y","n",NA), n, replace = T, prob = c(.4,.4,.1)),
C = sample(c("y","n",NA), n, replace = T, prob = c(.4,.4,.1)))

The functions is as follows

z0 %>% apply(1, function(x) any("y" == x)) -> z0$new

The desired output is z0$new, and I would like to do it using mutate rather than apply.

Thanks for any sugestion

cderv · April 18, 2018, 8:57pm

You can use mutate and purrr::pmap to iterate in a data.frame over rows.

There is a webinar of RStudio on this topic, it can give you ideas and examples
https://www.rstudio.com/resources/webinars/thinking-inside-the-box-you-can-do-that-inside-a-data-frame/

EDIT: /!\ incorrect code here, left for learning purposes. Function in pmap needs to deal with the three elements not just the first one. See others answers. (and help about pmap) /!\

library(tidyverse)

n <- 100
z0 <- data.frame(A = sample(c("y","n",NA), n, replace = T, prob = c(.4,.4,.1)),
                 B = sample(c("y","n",NA), n, replace = T, prob = c(.4,.4,.1)),
                 C = sample(c("y","n",NA), n, replace = T, prob = c(.4,.4,.1)))

z0 %>%
  mutate(new = pmap(., ~any("y" == .x)))
#> Warning: le package 'bindrcpp' a été compilé avec la version R 3.4.4
#>        A    B    C   new
#> 1      n    n    y FALSE
#> 2      y    y    y FALSE
#> 3      n    y    n FALSE
#> 4      n    n    y FALSE
#> 5      n    y    n FALSE
#> 6      y    y    n FALSE
#> 7      n    n    y FALSE
#> 8      n    n    n FALSE
#> 9   <NA> <NA>    n    NA
#> 10     n    y    y FALSE
#> 11     y    y    n FALSE
#> 12     y    n    n FALSE
#> 13  <NA> <NA>    n    NA
#> 14     n    n    n FALSE
#> 15     y    y    n FALSE
#> 16     n    n    n FALSE
#> 17     y <NA>    n FALSE
#> 18  <NA>    n    y    NA
#> 19     n    y    n FALSE
#> 20     y    n    n FALSE
#> 21     y    n    n FALSE
#> 22     y    n    y FALSE
#> 23     y    y    n FALSE
#> 24     n <NA>    n FALSE
#> 25  <NA>    n    n    NA
#> 26  <NA>    n    n    NA
#> 27     n    n <NA> FALSE
#> 28     n    y    y FALSE
#> 29     y <NA>    y FALSE
#> 30     n    y    n FALSE
#> 31     y    y <NA> FALSE
#> 32     n    n    y FALSE
#> 33     y    n    n FALSE
#> 34     y    y <NA> FALSE
#> 35     y    y    y FALSE
#> 36     y    y    n FALSE
#> 37     y    y    n FALSE
#> 38     y    n <NA> FALSE
#> 39     n    y    n FALSE
#> 40     n    n    n FALSE
#> 41     n <NA>    n FALSE
#> 42     n <NA>    y FALSE
#> 43     n    n    y FALSE
#> 44     y    y    y FALSE
#> 45     y    n    y FALSE
#> 46     n    y    n FALSE
#> 47     n    y    n FALSE
#> 48     y    n    n FALSE
#> 49  <NA>    n    n    NA
#> 50     y    y    n FALSE
#> 51     y    y    y FALSE
#> 52     n <NA>    y FALSE
#> 53     y    n    n FALSE
#> 54     n <NA>    n FALSE
#> 55     y <NA>    y FALSE
#> 56     n    n    y FALSE
#> 57     n    n    y FALSE
#> 58     y <NA>    n FALSE
#> 59     y    n    y FALSE
#> 60     y    y    n FALSE
#> 61     n <NA>    n FALSE
#> 62     y    n    y FALSE
#> 63     y    y    n FALSE
#> 64     y    y    n FALSE
#> 65     n    y    n FALSE
#> 66     y    y    y FALSE
#> 67  <NA>    y <NA>    NA
#> 68     y    n    y FALSE
#> 69     n <NA> <NA> FALSE
#> 70     y    n    n FALSE
#> 71     n    y    y FALSE
#> 72     n    y <NA> FALSE
#> 73     n    y    n FALSE
#> 74     n    y    y FALSE
#> 75  <NA>    y    n    NA
#> 76  <NA>    y    y    NA
#> 77     y    n    y FALSE
#> 78     n    y    n FALSE
#> 79     n    y    y FALSE
#> 80  <NA>    y    y    NA
#> 81     n <NA>    n FALSE
#> 82     y    n <NA> FALSE
#> 83     n    y    n FALSE
#> 84  <NA>    n    y    NA
#> 85     y    y    y FALSE
#> 86     y    y    n FALSE
#> 87     n    y    y FALSE
#> 88     y    y    y FALSE
#> 89     y    y    y FALSE
#> 90  <NA>    n    n    NA
#> 91     n    n    y FALSE
#> 92  <NA>    n    n    NA
#> 93     n <NA>    n FALSE
#> 94     n    n <NA> FALSE
#> 95     n    y    y FALSE
#> 96     n    y    y FALSE
#> 97     n    n    y FALSE
#> 98     n    n <NA> FALSE
#> 99     n    y    n FALSE
#> 100    y    y    y FALSE

Created on 2018-04-18 by the reprex package (v0.2.0).

nmolanog · April 18, 2018, 9:25pm

If you compare your solution with my apply solution they differ. indeed, first row of your example should be TRUE since there is an "y" in column C

EconomiCurtis · April 18, 2018, 10:00pm

I think part of the problem was in setting up z0 as a data.frame coerces A, B, and C into factors that then get coerced into ints in pmap.

Here's one solution, but you have to be explicit about how you use columns A, B, and C with ..1,..2,..3. Check out docs and discussions of pmap for more details obviously.

library(tidyverse)
n <- 100
z0 <- tibble(
  A = sample(c("y","n",NA), n, replace = TRUE, prob = c(.4,.4,.1)),
  B = sample(c("y","n",NA), n, replace = TRUE, prob = c(.4,.4,.1)),
  C = sample(c("y","n",NA), n, replace = TRUE, prob = c(.4,.4,.1))
  )
  
z0 %>%
  mutate(new = pmap(., .f=~any( "y" %in% c(..1, ..2, ..3))) %>% unlist)
#> # A tibble: 100 x 4
#>    A     B     C     new  
#>    <chr> <chr> <chr> <lgl>
#>  1 y     n     y     T    
#>  2 n     n     y     T    
#>  3 n     <NA>  n     F    
#>  4 y     n     y     T    
#>  5 <NA>  y     y     T    
#>  6 n     n     y     T    
#>  7 y     <NA>  y     T    
#>  8 n     n     y     T    
#>  9 n     n     n     F    
#> 10 n     y     y     T    
#> # ... with 90 more rows

Created on 2018-04-18 by the reprex package (v0.2.0).

cderv · April 19, 2018, 6:14am

Sorry I was in a hurry when I answered and yet willing to point you to mutate + pmap.
Obiously,we need to make a function that handles a 3 component list - the row of df.
Sorry for that.

Thank you @EconomiCurtis for correcting my answer. This is how to use pmap here.

To complete, it is possible to name your arguments' function and use the column name.
Also, you can use pmap_lgl to flatten the result.

library(tidyverse)

n <- 100
z0 <- data_frame(A = sample(c("y","n",NA), n, replace = T, prob = c(.4,.4,.1)),
                 B = sample(c("y","n",NA), n, replace = T, prob = c(.4,.4,.1)),
                 C = sample(c("y","n",NA), n, replace = T, prob = c(.4,.4,.1)))

z0 %>%
  mutate(new = pmap_lgl(., function(A, B, C) any("y" %in% c(A, B, C))))
#> Warning: le package 'bindrcpp' a été compilé avec la version R 3.4.4
#> # A tibble: 100 x 4
#>    A     B     C     new  
#>    <chr> <chr> <chr> <lgl>
#>  1 y     n     n     TRUE 
#>  2 n     n     n     FALSE
#>  3 <NA>  y     n     TRUE 
#>  4 y     y     <NA>  TRUE 
#>  5 <NA>  n     y     TRUE 
#>  6 y     y     n     TRUE 
#>  7 <NA>  n     n     FALSE
#>  8 n     <NA>  n     FALSE
#>  9 n     n     y     TRUE 
#> 10 n     y     y     TRUE 
#> # ... with 90 more rows

Created on 2018-04-19 by the reprex package (v0.2.0).

nmolanog · April 19, 2018, 3:37pm

Thank you for your comments. However, behavior of %in% is different from ==. See a complete discussion of this issue here.

It is worht to note that working with factors is trublesome.

martin.R · April 19, 2018, 4:05pm

Please note the policies on cross-posting to avoid duplicating people's efforts:

krose · April 19, 2018, 6:56pm

You can also use the dplyr rowwise function. It works like a group_by but for rows instead of grouping by a variable.

library(tidyverse)

n <- 100
z0 <- data.frame(A = sample(c("y","n",NA), n, replace = T, prob = c(.4,.4,.1)),
                 B = sample(c("y","n",NA), n, replace = T, prob = c(.4,.4,.1)),
                 C = sample(c("y","n",NA), n, replace = T, prob = c(.4,.4,.1)))

z0$new <- z0 %>% apply(1, function(x) any("y" == x))

identical(
  as.tibble(z0),
  
  z0 %>% 
    dplyr::rowwise() %>%
    mutate(new = any("y" == A, "y" == B, "y" == C)) %>%
    ungroup()
)

nmolanog · April 19, 2018, 7:11pm

Sorry about that. I will adhere to the policies.