how to create a loop over variables and create means? (similar to foreach in STATA)

GabbyJ · March 2, 2022, 12:47am

Hi, I'm new to this forum and would be so very grateful if you could help me.

I am trying to create means for a few variables in a loop function (data sample is below).
(i.e mean of ff_l4l5 and ff_l5s1, mean of gc_l4l5 and gc_5s1)

I am able to do this one variable a time by searching for a stem (i.e. either ff_ or gc_) but not sure how to do this in a loop.
Could you help me create a function that loops though the stems? (similar to foreach function in STATA)

Here is the code that works for one stem:

library(dplyr)
df <- df %>%
mutate(ff_ave = rowMeans(x = select(.data = .,
starts_with(match = "ff_"))))

Here is the data sample:

    id   ff_l4l5      ff_l5s1   gc_l4l5   gc_l5s1   volume_l4l5   volume_l5s1
     1     24.16      30.32         2              2              28203.11           12173.37  
     2      7.61      12.27         0              0              29032.61            18699.53  
     3      3.77      7.87          1              1              16189.05             7394.11  
     4     18.67      23.86         2              2              37662.74             16002.27  
     5     16.59      31.86         2              2              26130.15             13880.79

Thank you so much! Appreciate your help.
...

mjfrigaard · March 2, 2022, 8:00pm

Hi @GabbyJ !

I don't know if this is a viable solution (it's also way more code ), but I pivoted to long, grouped on the prefix, summarized, then pivoted wider.

It looks like you are missing a 'group' variable for the prefix? i.e. ff, gc, or volume?

library(datapasta)
library(tidyverse)


df <- tibble::tribble(
    ~id, ~ff_l4l5, ~ff_l5s1, ~gc_l4l5, ~gc_l5s1, ~volume_l4l5, ~volume_l5s1,
      1,    24.16,    30.32,        2,        2,     28203.11,     12173.37,
      2,     7.61,    12.27,        0,        0,     29032.61,     18699.53,
      3,     3.77,     7.87,        1,        1,     16189.05,      7394.11,
      4,    18.67,    23.86,        2,        2,     37662.74,     16002.27,
      5,    16.59,    31.86,        2,        2,     26130.15,     13880.79
    )

df_long <- df %>% 
    # switch data to long
    pivot_longer(cols = -id, 
                 names_to = "variable", 
                 values_to = "value") %>% 
    # create grouping variable based on regex
    mutate(k = case_when(
        str_detect(variable, "ff_") ~ "ff", 
        str_detect(variable, "gc_") ~ "gc", 
        str_detect(variable, "volume_") ~ "volume", 
    )) 

df_long
#> # A tibble: 30 × 4
#>       id variable       value k     
#>    <dbl> <chr>          <dbl> <chr> 
#>  1     1 ff_l4l5        24.2  ff    
#>  2     1 ff_l5s1        30.3  ff    
#>  3     1 gc_l4l5         2    gc    
#>  4     1 gc_l5s1         2    gc    
#>  5     1 volume_l4l5 28203.   volume
#>  6     1 volume_l5s1 12173.   volume
#>  7     2 ff_l4l5         7.61 ff    
#>  8     2 ff_l5s1        12.3  ff    
#>  9     2 gc_l4l5         0    gc    
#> 10     2 gc_l5s1         0    gc    
#> # … with 20 more rows

This generates a table of means based on the 'group', so I am not sure if it's more helpful than what you're already doing?

But now we can calculate summary stats by k.

df_avgs <- df_long %>% 
    # group by id and k
    group_by(id, k) %>% 
    # get mean and any other statistics 
    summarize(avgs = mean(value)) %>% 
    # pivot back to wide 
    pivot_wider(names_from = k, values_from = avgs)
#> `summarise()` has grouped output by 'id'. You can override using the `.groups`
#> argument.

df_avgs
#> # A tibble: 5 × 4
#> # Groups:   id [5]
#>      id    ff    gc volume
#>   <dbl> <dbl> <dbl>  <dbl>
#> 1     1 27.2      2 20188.
#> 2     2  9.94     0 23866.
#> 3     3  5.82     1 11792.
#> 4     4 21.3      2 26833.
#> 5     5 24.2      2 20005.

If you don't have a huge number of k (and a consistent naming convention), it might be worthwhile to create a grouping variable.

^{Created on 2022-03-02 by the reprex package (v2.0.1)}

GabbyJ · March 2, 2022, 8:53pm

Dear Martin,

Thank you so much for taking time to code this and reply. Your code works great!

Could I ask you one follow-up question?
when I add some demographic columns to the df (i.e. age, sex) the code gives me mostly NA's for FF. Could you kindly help me understand how to incorporate these demographic vars in the wide->long->wide transformation code?

Thank you again!
Gabby

system · March 23, 2022, 8:53pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.