Let's start with the first line:
iris_col_sd <- sd(iris$Petal.Length)
becomes
iris_col_sd <- sd(df_input$column_input)
Here you might notice already a problem: outside the function, you are giving a column name explicitly, without quotes. So, this is the same inside a function, you are telling R to look for a column called column_input
, whereas what you really want is to tell R to look for a column whose name is the content of the variable column_input
. That's actually easy to do just with base R:
df_input[[column_input]]]
note that I didn't use quotes, if column_input <- Petal.Length
this is equivalent to writing:
df_input[["Petal.Length"]]
The df_input
on the other hand works well, the data frame given as input gets used inside the function:
print_df <- function(df_input){
df_input
}
print_df(iris)
So we now know how to take the sd:
print_df_sd <- function(df_input, column_input){
sd(df_input[[column_input]])
}
print_df_sd(iris, "Petal.Length")
This works, and is a classic base R style of function. But, as you can see, when calling the function you need to provide the column name "Petal.Length" in quotes. In the tidyverse, many functions take column names unquoted. This relies on a special mechanism called Non-Standard Evaluation. I would strongly recommend not to try to do that in your own functions: most of the time it just makes things a lot more complicated, for a limited benefit. But if you really want to, see the "Tidy selection" section of your link, a simple way is to use other dplyr functions:
print_df_sd <- function(df_input, ...){
sd(pull(df_input, ...))
}
print_df_sd(iris, Petal.Length)
Or if you really want to go all the way (but don't ask me to explain):
print_df_sd <- function(df_input, column_input){
quo_column_input <- rlang::enquo(column_input)
sd(pull(df_input, !!quo_column_input))
}
print_df_sd(iris, Petal.Length)
So now we have a way to rewrite the first 3 lines, I would recommend using a standard string for the column name (but if you really want to, you can use the enquo
pattern):
return_outliers_params <- function (df_input, column_input, multiplier_input) {
iris_col_sd <- sd(df_input[[column_input]])
thres_min <- mean(df_input[[column_input]]) - (multiplier_input * iris_col_sd)
thres_max <- mean(df_input[[column_input]]) + (multiplier_input * iris_col_sd)
list(iris_col_sd, thres_min, thres_max)
}
return_outliers_params(iris, "Petal.Length", 3)
So we're left with the mutate()
. If we're passing the column name as a character, we need to use the pronoun .data
as described in the "data masking" section of your link:
return_outliers <- function (df_input, column_input, multiplier_input) {
iris_col_sd <- sd(df_input[[column_input]])
thres_min <- mean(df_input[[column_input]]) - (multiplier_input * iris_col_sd)
thres_max <- mean(df_input[[column_input]]) + (multiplier_input * iris_col_sd)
df_input %>%
mutate(outliers = if_else( .data[[column_input]] > thres_max | .data[[column_input]] < thres_min,
"outlier",""))
}
return_outliers(iris, "Petal.Length", 1.5)
Now, if you're passing the variable unquoted, it gets more complicated. But because I'm lazy, I notice an interesting pattern: for all these sd and mean, you always use the contents of the same column. So it's easier to extract these contents once in the beginning, that's a lot less text:
return_outliers_params <- function (df_input, column_input, multiplier_input) {
quo_column_input <- rlang::enquo(column_input)
col_data <- pull(df_input, !!quo_column_input)
col_sd <- sd(col_data)
thres_min <- mean(col_data) - (multiplier_input * col_sd)
thres_max <- mean(col_data) + (multiplier_input * col_sd)
list(col_sd, thres_min, thres_max)
}
return_outliers_params(iris, Petal.Length, 3)
And we're left with the mutate()
, and you got it perfectly right with the {{ }}
!