I have a data frame and I want to create a new variable applying a function that works within rows. See the example below.
library(tidyverse)
n <- 100
z0 <- data.frame(A = sample(c("y","n",NA), n, replace = T, prob = c(.4,.4,.1)),
B = sample(c("y","n",NA), n, replace = T, prob = c(.4,.4,.1)),
C = sample(c("y","n",NA), n, replace = T, prob = c(.4,.4,.1)))
The functions is as follows
z0 %>% apply(1, function(x) any("y" == x)) -> z0$new
The desired output is z0$new
, and I would like to do it using mutate
rather than apply
.
Thanks for any sugestion
1 Like
cderv
April 18, 2018, 8:57pm
2
You can use mutate
and purrr::pmap
to iterate in a data.frame over rows.
There is a webinar of RStudio on this topic, it can give you ideas and examples
https://www.rstudio.com/resources/webinars/thinking-inside-the-box-you-can-do-that-inside-a-data-frame/
EDIT : /!\ incorrect code here, left for learning purposes. Function in pmap
needs to deal with the three elements not just the first one. See others answers. (and help about pmap) /!\
library(tidyverse)
n <- 100
z0 <- data.frame(A = sample(c("y","n",NA), n, replace = T, prob = c(.4,.4,.1)),
B = sample(c("y","n",NA), n, replace = T, prob = c(.4,.4,.1)),
C = sample(c("y","n",NA), n, replace = T, prob = c(.4,.4,.1)))
z0 %>%
mutate(new = pmap(., ~any("y" == .x)))
#> Warning: le package 'bindrcpp' a été compilé avec la version R 3.4.4
#> A B C new
#> 1 n n y FALSE
#> 2 y y y FALSE
#> 3 n y n FALSE
#> 4 n n y FALSE
#> 5 n y n FALSE
#> 6 y y n FALSE
#> 7 n n y FALSE
#> 8 n n n FALSE
#> 9 <NA> <NA> n NA
#> 10 n y y FALSE
#> 11 y y n FALSE
#> 12 y n n FALSE
#> 13 <NA> <NA> n NA
#> 14 n n n FALSE
#> 15 y y n FALSE
#> 16 n n n FALSE
#> 17 y <NA> n FALSE
#> 18 <NA> n y NA
#> 19 n y n FALSE
#> 20 y n n FALSE
#> 21 y n n FALSE
#> 22 y n y FALSE
#> 23 y y n FALSE
#> 24 n <NA> n FALSE
#> 25 <NA> n n NA
#> 26 <NA> n n NA
#> 27 n n <NA> FALSE
#> 28 n y y FALSE
#> 29 y <NA> y FALSE
#> 30 n y n FALSE
#> 31 y y <NA> FALSE
#> 32 n n y FALSE
#> 33 y n n FALSE
#> 34 y y <NA> FALSE
#> 35 y y y FALSE
#> 36 y y n FALSE
#> 37 y y n FALSE
#> 38 y n <NA> FALSE
#> 39 n y n FALSE
#> 40 n n n FALSE
#> 41 n <NA> n FALSE
#> 42 n <NA> y FALSE
#> 43 n n y FALSE
#> 44 y y y FALSE
#> 45 y n y FALSE
#> 46 n y n FALSE
#> 47 n y n FALSE
#> 48 y n n FALSE
#> 49 <NA> n n NA
#> 50 y y n FALSE
#> 51 y y y FALSE
#> 52 n <NA> y FALSE
#> 53 y n n FALSE
#> 54 n <NA> n FALSE
#> 55 y <NA> y FALSE
#> 56 n n y FALSE
#> 57 n n y FALSE
#> 58 y <NA> n FALSE
#> 59 y n y FALSE
#> 60 y y n FALSE
#> 61 n <NA> n FALSE
#> 62 y n y FALSE
#> 63 y y n FALSE
#> 64 y y n FALSE
#> 65 n y n FALSE
#> 66 y y y FALSE
#> 67 <NA> y <NA> NA
#> 68 y n y FALSE
#> 69 n <NA> <NA> FALSE
#> 70 y n n FALSE
#> 71 n y y FALSE
#> 72 n y <NA> FALSE
#> 73 n y n FALSE
#> 74 n y y FALSE
#> 75 <NA> y n NA
#> 76 <NA> y y NA
#> 77 y n y FALSE
#> 78 n y n FALSE
#> 79 n y y FALSE
#> 80 <NA> y y NA
#> 81 n <NA> n FALSE
#> 82 y n <NA> FALSE
#> 83 n y n FALSE
#> 84 <NA> n y NA
#> 85 y y y FALSE
#> 86 y y n FALSE
#> 87 n y y FALSE
#> 88 y y y FALSE
#> 89 y y y FALSE
#> 90 <NA> n n NA
#> 91 n n y FALSE
#> 92 <NA> n n NA
#> 93 n <NA> n FALSE
#> 94 n n <NA> FALSE
#> 95 n y y FALSE
#> 96 n y y FALSE
#> 97 n n y FALSE
#> 98 n n <NA> FALSE
#> 99 n y n FALSE
#> 100 y y y FALSE
Created on 2018-04-18 by the reprex package (v0.2.0).
1 Like
If you compare your solution with my apply solution they differ. indeed, first row of your example should be TRUE since there is an "y" in column C
I think part of the problem was in setting up z0
as a data.frame coerces A, B, and C into factors that then get coerced into ints in pmap.
Here's one solution, but you have to be explicit about how you use columns A, B, and C with ..1,..2,..3
. Check out docs and discussions of pmap for more details obviously.
library(tidyverse)
n <- 100
z0 <- tibble(
A = sample(c("y","n",NA), n, replace = TRUE, prob = c(.4,.4,.1)),
B = sample(c("y","n",NA), n, replace = TRUE, prob = c(.4,.4,.1)),
C = sample(c("y","n",NA), n, replace = TRUE, prob = c(.4,.4,.1))
)
z0 %>%
mutate(new = pmap(., .f=~any( "y" %in% c(..1, ..2, ..3))) %>% unlist)
#> # A tibble: 100 x 4
#> A B C new
#> <chr> <chr> <chr> <lgl>
#> 1 y n y T
#> 2 n n y T
#> 3 n <NA> n F
#> 4 y n y T
#> 5 <NA> y y T
#> 6 n n y T
#> 7 y <NA> y T
#> 8 n n y T
#> 9 n n n F
#> 10 n y y T
#> # ... with 90 more rows
Created on 2018-04-18 by the reprex package (v0.2.0).
cderv
April 19, 2018, 6:14am
6
Sorry I was in a hurry when I answered and yet willing to point you to mutate
+ pmap
.
Obiously,we need to make a function that handles a 3 component list - the row of df.
Sorry for that.
Thank you @EconomiCurtis for correcting my answer. This is how to use pmap
here.
To complete, it is possible to name your arguments' function and use the column name.
Also, you can use pmap_lgl
to flatten the result.
library(tidyverse)
n <- 100
z0 <- data_frame(A = sample(c("y","n",NA), n, replace = T, prob = c(.4,.4,.1)),
B = sample(c("y","n",NA), n, replace = T, prob = c(.4,.4,.1)),
C = sample(c("y","n",NA), n, replace = T, prob = c(.4,.4,.1)))
z0 %>%
mutate(new = pmap_lgl(., function(A, B, C) any("y" %in% c(A, B, C))))
#> Warning: le package 'bindrcpp' a été compilé avec la version R 3.4.4
#> # A tibble: 100 x 4
#> A B C new
#> <chr> <chr> <chr> <lgl>
#> 1 y n n TRUE
#> 2 n n n FALSE
#> 3 <NA> y n TRUE
#> 4 y y <NA> TRUE
#> 5 <NA> n y TRUE
#> 6 y y n TRUE
#> 7 <NA> n n FALSE
#> 8 n <NA> n FALSE
#> 9 n n y TRUE
#> 10 n y y TRUE
#> # ... with 90 more rows
Created on 2018-04-19 by the reprex package (v0.2.0).
2 Likes
Thank you for your comments. However, behavior of %in%
is different from ==
. See a complete discussion of this issue here.
r, apply, dplyr
It is worht to note that working with factors is trublesome.
Please note the policies on cross-posting to avoid duplicating people's efforts:
Posting the same question both here and on other sites
Posting the same question to multiple forums at the same time is often considered impolite. We don't completely ban such cross-posting, but we ask you to think hard before you do it and to follow some rules.
Cross-post sparingly
Rather than post the same thing here and elsewhere from the get-go, post in one place at a time. Let enough time go by (think days, not hours) before you take your question somewhere else. Sometimes people at another site may suggest you post here if your question doesn't fit within the scope of the other site.
Always link to your other posts, and update everywhere with any solution…
1 Like
krose
April 19, 2018, 6:56pm
9
You can also use the dplyr rowwise function. It works like a group_by but for rows instead of grouping by a variable.
library(tidyverse)
n <- 100
z0 <- data.frame(A = sample(c("y","n",NA), n, replace = T, prob = c(.4,.4,.1)),
B = sample(c("y","n",NA), n, replace = T, prob = c(.4,.4,.1)),
C = sample(c("y","n",NA), n, replace = T, prob = c(.4,.4,.1)))
z0$new <- z0 %>% apply(1, function(x) any("y" == x))
identical(
as.tibble(z0),
z0 %>%
dplyr::rowwise() %>%
mutate(new = any("y" == A, "y" == B, "y" == C)) %>%
ungroup()
)
1 Like
Sorry about that. I will adhere to the policies.