How can I refer to a specific variable within a function? The goal is to write a function that will easily allow me to summarize a large dataset by various variables for different analyses.
The problem is figuring out how to translate the variable defined for the function to a reference to an actual dataframe in my environment.
Here is an example of the syntax I've been using
#ex data
place <- c("house", "car", "park", "house", "car")
year <- c("2010", "2010", "2010", "2011", "2011")
var1 <- c(20, 10, 100, 50, 200)
df <- data.frame(place, year, var1)
#what I want to do
function(myvar){
df %>% split(place) %>%
map(mutate, "change"= myvar - lag(myvar)) %>% return()
}
but how do I get myvar to refer to df$var1 as opposed to just "var1"?
$car
place year var1 change
2 car 2010 10 NA
5 car 2011 200 190
$house
place year var1 change
1 house 2010 20 NA
4 house 2011 50 30
$park
place year var1 change
3 park 2010 100 NA
In your example, it's not necessary split the data frame. You can use group_by instead:
This is great! But I'm still having trouble having it recognize my variable
#here is a subset of my actual data
GEOID <- c(1001, 1001, 1001, 1001, 1001)
year <- c(2010, 2011, 2012, 2013, 2014)
race <- c("White", "non-white", "White", "non-white", "White")
employment <- c(19812, 4529, 19853, 4286, 19689)
`emp rate` <- c(92.83, 85.58, 92.10, 81.79, 91.18)
emp_dt <- data.frame(GEOID, year, race, employment, `emp rate`)
func <- function(data=emp_dt, groupvar1=GEOID, groupvar2=race, myvar){
tmp_dt_wt <- data %>% group_by({{groupvar1}}, {{groupvar2}}) %>%
arrange(year) %>%
mutate("change" = {{myvar}} - lag({{myvar}}))
}
func(`emp rate`)
# Error in group_by(., { : object 'emp rate' not found
#I tried with a variable without space in the name too
func(employment)
#Error in group_by(., { : object 'employment' not found
Check the errors in the code creating each vector of values, then make sure that emp_dt has all the variables you expect it to have. Does the code run once those errors are fixed?
as per joels suggestion, check the names of your columns
names(emp_dt)
consider that functions need to be told what to return.
so either dont assign your function internals to the tmp_dt_wt name and let them be returned directly, or do assign them to that name, but then place that name on the final line of the function body so that that is what is returned.
finally, when you call your function, also consider the function parameter orders, do you intend the employment rate variable to be used as myvar ? because that wont happen undless you specify myvar= , or its the 4th parameter you pass.
Got it, part of the problem was that the variable name emp rate was getting changed to emp.rate within the function even though the spaces remained in the dataframe outside of the function.