Is there a tidy way to iterate a data.frame/tibble and produce side effects based on the value of each row

GraemeS · February 11, 2020, 8:42am

With purrr this can be done on lists and vectors, but I can't find a way of doing it with rows of a data frame / tibble. This is one of the most common patterns in other languages, such as using foreach in c#.

What would be the tidy equivalent of the following code:

df <- data.frame(code = letters[1:26], value = rnorm(26))

for (i in seq_along(df$code)) {
    row <- df[i, ]
    
    print(row) # or do something much more complex with the row
}

nirgrahamuk · February 11, 2020, 10:56am

library(purrr)
df <- data.frame(code = letters[1:26],
                 value = rnorm(26),
                 stringsAsFactors = FALSE)

forfunc <- function(df){
for (i in seq_along(df$code)) {
  row <- df[i, ]
  
  print(row) # or do something much more complex with the row
}}



walkfor <- function(df){
  walk(1:nrow(df), ~print(df[., ]))
}


pwalkfunc <- function(df){
  pwalk(.l = df, .f = ~print(paste0(..1," ",..2)))
}

directfunc <- function(df){
  cat(paste0(df[,1]," ",df[,2],"\n"))
}


library(microbenchmark)

microbenchmark(forfunc(df),times = 10L,unit="us")
microbenchmark(walkfor(df),times = 10L,unit="us")
microbenchmark(pwalkfunc(df),times = 10L,unit="us")
microbenchmark(print(df),times = 10L,unit="us")
microbenchmark(directfunc(df),times = 10L,unit="us")
EDITED:: to add slide

# expr          min     lq       mean     median     uq     max    neval
# forfunc(df)    9770   10224    11303    11061   11814   13777    10
# walkfor(df)   10170   10864    11808    11416   11917   20878    10
# pwalkfunc(df)   857    1306     1472    1561     1722    1821    10
# print(df)       749     904     1248    1333     1501    1592    10
# directfunc(df)  521     621      900     657      735    2693    10
# slidefunc(df)    71      72      223      73       76    1249    10

in conclusion, since R is vectorised, its usually best to go direct if you can. You dont need a for loop or any programmer determined iterator when the base language (and many packaged extensions ) has a default to iterate for you

nirgrahamuk · February 11, 2020, 11:13am

Lol. I think I deleted a digit next to mean for the walkfor, when I was manually editing off decimal points. obviously the mean is not less than the min!

GraemeS · February 11, 2020, 11:32am

Hi,

I was wondering about that too

The walk solution seems almost identical to the basic for iteration, just using . instead of i. It also has this disadvantage of being more difficult to debug, one nice thing about using iterators is that if something goes wrong, all the variables are still set. I’m usually creating SQL scripts based on each row, so it is easy to just copy and run the last script to see what went wrong.

However being able to parallelize with pwalk is something I hadn’t really taken consideration of, and that is quite interesting.

mfherman · February 11, 2020, 2:25pm

Another option is the new slider package, which uses map()-like syntax to iterate over rows (instead of columns).

library(purrr)
library(slide)

df <- data.frame(code = letters[1:26],
                 value = rnorm(26),
                 stringsAsFactors = FALSE)

slide(df, ~.x)
#> [[1]]
#>   code     value
#> 1    a 0.6423173
#> 
#> [[2]]
#>   code     value
#> 1    b 0.5751017
#> 
#> [[3]]
#>   code      value
#> 1    c -0.5670547
#> 
#> [[4]]
#>   code     value
#> 1    d 0.3301827
#> 
#> [[5]]
#>   code      value
#> 1    e 0.01724866
#> 
#> [[6]]
#>   code      value
#> 1    f -0.3050139
#> 
#> [[7]]
#>   code   value
#> 1    g 1.07928
#> 
#> [[8]]
#>   code     value
#> 1    h -1.555621
#> 
#> [[9]]
#>   code      value
#> 1    i -0.2743272
#> 
#> [[10]]
#>   code      value
#> 1    j -0.1304349
#> 
#> [[11]]
#>   code    value
#> 1    k 1.023238
#> 
#> [[12]]
#>   code     value
#> 1    l -1.149406
#> 
#> [[13]]
#>   code      value
#> 1    m -0.7004986
#> 
#> [[14]]
#>   code     value
#> 1    n 0.3426395
#> 
#> [[15]]
#>   code     value
#> 1    o 0.2735956
#> 
#> [[16]]
#>   code      value
#> 1    p -0.9509219
#> 
#> [[17]]
#>   code     value
#> 1    q -1.576721
#> 
#> [[18]]
#>   code      value
#> 1    r -0.9072278
#> 
#> [[19]]
#>   code     value
#> 1    s -1.280973
#> 
#> [[20]]
#>   code     value
#> 1    t -2.960007
#> 
#> [[21]]
#>   code      value
#> 1    u -0.2768247
#> 
#> [[22]]
#>   code      value
#> 1    v -0.1905839
#> 
#> [[23]]
#>   code     value
#> 1    w 0.4464869
#> 
#> [[24]]
#>   code    value
#> 1    x 1.871529
#> 
#> [[25]]
#>   code     value
#> 1    y 0.1432761
#> 
#> [[26]]
#>   code     value
#> 1    z 0.6543933

Another nice thing about this as compared to pmap() is that you can refer to the columns by name in the function you apply to each row:

slide_dbl(df, ~.x$value + 1)
#>  [1]  1.64231731  1.57510169  0.43294532  1.33018271  1.01724866  0.69498611
#>  [7]  2.07928021 -0.55562117  0.72567283  0.86956514  2.02323772 -0.14940556
#> [13]  0.29950139  1.34263952  1.27359560  0.04907812 -0.57672104  0.09277225
#> [19] -0.28097344 -1.96000711  0.72317534  0.80941607  1.44648691  2.87152866
#> [25]  1.14327608  1.65439329

^{Created on 2020-02-11 by the reprex package (v0.3.0)}

nirgrahamuk · February 11, 2020, 2:37pm

nice always good to have options, it seems very fast too !!

GraemeS · February 13, 2020, 12:41am

Slider looks like quite a useful package, and does exactly what I was looking for, as well as having some other quite useful features, ie the sliding part of slide. Thanks.

system · March 5, 2020, 12:41am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.