First, let's simplify a bit the example:
This works:
init_dta <- mtcars$disp
fit <- lm(init_dta ~ 1)
bc <- boxcox(fit)
lambda <- bc$x[bc$y == max(bc$y)]
result <- (init_dta^lambda-1)/lambda
But this doesn't work:
my_lm <- function(.x){
lm(.x)
}
init_dta <- mtcars$disp
fit2 <- my_lm(init_dta ~ 1)
bc <- boxcox(fit2)
#> Error in stats::model.frame(formula = .x, drop.unused.levels = TRUE): object '.x' not found
We can look at the traceback:
10. stats::model.frame(formula = .x, drop.unused.levels = TRUE)
9. eval(mf, parent.frame())
8. eval(mf, parent.frame())
7. lm(formula = .x, y = TRUE, qr = TRUE)
6. eval(call, parent.frame())
5. eval(call, parent.frame())
4. update.default(object, y = TRUE, qr = TRUE, ...)
3. update(object, y = TRUE, qr = TRUE, ...)
2. boxcox.lm(fit2)
1. boxcox(fit2)
So, the problem comes from the call to update()
, when running the boxcox function, it is trying to re-run the initial fit, but can not find the variables. You can check the source code of boxcox.lm()
to see that call. And we can look at our lm object:
fit2
#>
#> Call:
#> lm(formula = .x)
#>
#> Coefficients:
#> (Intercept)
#> 230.7
Here the important part is the Call
: upon running, lm()
captured the exact expression that was used; but since we called it within a function, the formula
argument is meaningless. Hence, the later call to update()
fails.
So, let's go back to the source code of boxcox.lm()
(you can display it with View(boxcox)
, or ctrl+click on function name, or F2 while on function name, use the menu on top to select boxcox.lm
). It looks like this:
function (object, lambda = seq(-2, 2, 1/10), plotit = TRUE,
interp = (plotit && (m < 100)), eps = 1/50, xlab = expression(lambda),
ylab = "log-Likelihood", ...)
{
m <- length(lambda)
if (is.null(object$y) || is.null(object$qr))
object <- update(object, y = TRUE, qr = TRUE, ...)
result <- NextMethod()
if (plotit)
invisible(result)
else result
}
So, the call to update()
only happens when y
and qr
are not present in the lm object. Let's look back at the arguments of the lm()
function:
lm(formula, data, subset, weights, na.action,
method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
singular.ok = TRUE, contrasts = NULL, offset, ...)
By default, qr = TRUE
, but y = FALSE
, so when calling boxcox, it can't find the data for y
, and thus tries to recompute the lm
fit. But since the call
contains references to variable names instead of the names in the data, it fails to rerun. We finally get our working solution:
library(MASS)
# Box-Cox-Normalization of a numerical vector
boxcox_transform <- function(x) {
# Note that x~1 counts as a linear model in R, so you can use the
# boxcox function from MASS
bc <- boxcox(lm(x ~ 1, y = TRUE))
# lambda that maximises the likelihood
lambda <- bc$x[bc$y == max(bc$y)]
# print lambda and 95% confidence limits
# lambda
# range(out$x[out$y > max(out$y)-qchisq(0.95,1)/2])
# Box-Cox-transformed variable
result <- (x^lambda-1)/lambda
# histogram
# histogram(result)
# return transformed variable
return(result)
}
res1 <- boxcox_transform(mtcars$disp)
init_dta <- mtcars$disp
fit <- lm(init_dta ~ 1)
bc <- boxcox(fit)
lambda <- bc$x[bc$y == max(bc$y)]
res2 <- (init_dta^lambda-1)/lambda
waldo::compare(res1, res2)
#> v No differences