Replace looping with apply in simple Permutation Test code

BrentHutto · December 9, 2018, 1:20pm

I am learning R programming by way of reviewing basic stats with Julian Faraway's book "Linear Models with R". Very early in the book he gives example code to implement a Permutation Test on one of his datasets (included in the "faraway" package).

Am I correct that using a for() looping construct in this manner is a poor habit to get into when programming in R?

In theory I'd think the sort of thing he's doing is best implemented with one of the apply-family functions. But for some reason I seem to have a total mental block about how exactly one passes parameters into and out of a non-built-in function supplied to the apply().

Here's the way he does it. I just spent an hour-plus looking at various examples of apply() and can't seem to get my mind around it. Not sure what it is about apply() that I find so opaque.

library(faraway)
library(tidyverse)

data(gala,package="faraway")
lmod <- lm( Species ~ Nearest + Scruz, gala)
lms<-summary(lmod)

nreps<-4000
set.seed(123)
fstats<-numeric(nreps)
for(i in 1:nreps){
  lmods<-lm(sample(Species)~Nearest+Scruz,gala)
  fstats[i]<-summary(lmods)$fstat[1]
}
 
mean(fstats > lms$fstat[1])
#> [1] 0.55825

^{Created on 2018-12-09 by the reprex package (v0.2.1.9000)}

Here's where I was when I finally gave up on writing my own version, after a couple dozen variations and attempts...

fstats<-apply(fstats,1,function(x) summary(lm(sample(Species)~Nearest+Scruz,gala))$fstat[1])

mean(fstats > lms$fstat[1])

cderv · December 9, 2018, 1:53pm

As see you use tidyverse . In the tidyverse, there is the purrr that helps with everything that is iteration. You'll find a function purrr::rerun that will repeat a function n times. I think this is what you are looking for.

library(faraway)
library(tidyverse)

data(gala, package="faraway")

lmod <- lm( Species ~ Nearest + Scruz, gala)
lms<-summary(lmod)


nreps <- 4000
set.seed(123)
# Use rerun to repeat the operation nreps times
fstats <- purrr::rerun(nreps, {
  lmods <- lm(sample(Species)~Nearest+Scruz,gala)
  summary(lmods)$fstat[[1]]
}) %>%
  # flatten the list to a vector
  flatten_dbl()

# get the mean search
mean(fstats > lms$fstat[[1]])
#> [1] 0.55825

^{Created on 2018-12-09 by the reprex package (v0.2.1)}

cderv · December 9, 2018, 2:14pm

And also here is an example how to use broom to tidy model results that allow to use dplyr functions to further manipulate the result.

library(faraway)
library(tidyverse)

data(gala, package="faraway")

lmod <- lm( Species ~ Nearest + Scruz, gala)
lms<-summary(lmod)

nreps <- 4000
set.seed(123)
lmods <- rerun(nreps, {
    lm(sample(Species) ~ Nearest + Scruz, data = gala)
})

# use broom to get a summary table
summary_table <- lmods %>%
  map_dfr(broom::glance)
  
head(summary_table)
#> # A tibble: 6 x 11
#>   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
#>       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>
#> 1   0.127         0.0625   111.    1.97     0.160     3  -182.  373.  378.
#> 2   0.0784        0.0101   114.    1.15     0.332     3  -183.  374.  380.
#> 3   0.0234       -0.0489   117.    0.324    0.726     3  -184.  376.  382.
#> 4   0.00191      -0.0720   119.    0.0259   0.974     3  -184.  377.  382.
#> 5   0.0629       -0.00646  115.    0.907    0.416     3  -183.  375.  380.
#> 6   0.0458       -0.0248   116.    0.649    0.531     3  -184.  375.  381.
#> # ... with 2 more variables: deviance <dbl>, df.residual <int>

# get the mean of the statistic
summary_table %>%
  summarise(fstat = mean(statistic > lms$fstatistic[[1]]))
#> # A tibble: 1 x 1
#>   fstat
#>   <dbl>
#> 1 0.558

^{Created on 2018-12-09 by the reprex package (v0.2.1)}

BrentHutto · December 9, 2018, 2:54pm

Wow, that's better advice than I'd even hoped for. Since I'm learning R nearly from scratch I've been trying to whenever possible go ahead and learn the tidy/dplyr/ggplot2/etc. forms of things rather than sticking to base R. I will take the second suggestion (using broom) under advisement!

In the meanwhile I ended up slightly elaborating your suggested approach by moving the comparison to the original fitted model F-statistic inside the loop. So the vector we end up with has 1's and 0's that can be piped directly into mean().

The drawback would be that I'm basically discarding the actual F-stats so in the big picture brooming everything into a summary "tibble" is going to be worth doing.

# Use rerun to repeat the operation nreps times
fstats_gt <- purrr::rerun(nreps, {
  lmods <- lm(sample(Species)~Nearest+Scruz,gala)
  summary(lmods)$fstat[[1]]>lms$fstat[1]
}) %>%
  # flatten the list to a vector
  flatten_dbl() %>% 
  mean()

cderv · December 9, 2018, 2:58pm

broom is really awesome to get tidy model ouput.
You should also take a look at tidymodels

And If your question's been answered, would you mind choosing a solution? It helps other people see which questions still need help, or find solutions if they have similar problems. Here’s how to do it:

Max · December 9, 2018, 6:31pm

Also, if the function being permuted takes a while to compute, you could do it in parallel:

library(tidyverse)
library(furrr)
#> Loading required package: future
library(broom)

data(gala, package="faraway")

f_stat <- function(x)
  lm(sample(Species) ~ Nearest + Scruz, data = gala) %>%
  glance() %>% 
  pull("statistic")

nreps <- 4000

# on macOS or *nix
plan(multicore)

prm_f_par <- future_map_dbl(1:nreps, f_stat)
head(prm_f_par)
#> [1] 0.72107203 0.01121396 0.15114937 0.43945546 0.15784493 0.07619823

Created on 2018-12-09 by the reprex package (v0.2.1)

system · December 16, 2018, 6:31pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.