Function Vectorization and Input Handling

I'm building a calculator that simplifies an extremely complex calculation.

When vectorizing a function, what is the best way to handle the user leaving out given values?

The goal is to be able to pass it a tibble with a variable assigned per column. I know, from debug suffering, that I should use ifelse() throughout and & instead of && and have recently refactored around that.

In our case below, column c may or may not have entries, as sometimes the value is needed, but when it's not, I want to skip past the more complicated function and use a simpler one.

Sometimes, the column may not exist/will not be passed to the function, and sometimes rows may simply not have the value.


test_cars <- data.frame(weight = c(2500,3000, 4000),
   distance = c(300, 450, 400),
   tank = c(16, 15, NA))

test_cars2<- data.frame(weight = c(2500,3000, 4000),
   distance = c(300, 450, 400))

mileage_calculation <- function( weight, distance, tank_volume=NULL) {
 
ifelse(exists(weight) & exists(distance) & exists(tank_volume), 
   gas_mileage(weight,distance, tank_volume),
   electric_mileage(weight, distance) )
}

I know that 'exists()' is probably not the way to go, as it'll trip on NULL, which I believe is needed to handle optional inputs. I'm considering '!is.null()', but is there a more effective function to use to check user inputs? What's the best practice to prevent logical mistakes?

Honestly, everything is plug-and-chug based on whether variables are given, or not. Otherwise, it's all base-r computation or flow logic to address the scenarios.

There are really two separate questions about columns and rows.

Columns

From your description, the function takes a tibble as input, but your example code takes the individual vectors weight, distance, tank_volume as input. You have to choose one.

It's easier here if you make a function that takes a tibble/data.frame as input, assuming a column with a particular name exists:

mileage_calculation_from_df <- function( df ) {
  
  stopifnot( "weight" %in% colnames(df) )
  stopifnot( "distance" %in% colnames(df) )
  
  if("tank" %in% colnames(df)){
    message("`Tank` is given; doing complex computation")
  } else{
    message("No `tank` column, doing simple computation")
  }
  
  message("---- This is run either way")
}

mileage_calculation_from_df(test_cars)
mileage_calculation_from_df(test_cars2)

Alternatively, if you want to use individual vectors:

mileage_calculation_from_vecs <- function( weight, distance, tank_volume=NULL ) {
  
  
  if( ! missing(tank_volume) ){
    message("`Tank` is given; doing complex computation")
  } else{
    message("No `tank` column, doing simple computation")
  }
  
  message("---- This is run either way")
}

mileage_calculation_from_vecs(test_cars$weight,
                              test_cars$distance,
                              test_cars$tank)

mileage_calculation_from_vecs(test_cars2$weight,
                              test_cars2$distance)

Note that here nothing is about vectorization. This is all input handling.

Rows

A separate question is how to handle missing rows, illustrated in the fact that tank has an NA. This is a separate problem from the column: at this point you already have checked that the column exists, this is no longer in doubt.

The difficulty is that ifelse(test, yes, no) will evaluate all of yes and no as soon as there are both TRUE and FALSE in test. This can be seen clearly with the first examples of ?ifelse:

x <- c(6:-4)
sqrt(x)  #- gives warning
sqrt(ifelse(x >= 0, x, NA))  # no warning

## Note: the following also gives the warning !
ifelse(x >= 0, sqrt(x), NA)

It is clear why if you check the source code of ifelse(). A simplified version is:

ans <- test

ypos <- which(test)
npos <- which(!test)

ans[ypos] <- yes[ypos]
ans[npos] <- no[npos]

ans

So in practice, you have to implement mileage_calculation() so that

  • either you do not vectorize, you run a loop and use if/else for each element
  • or the subsequent function can handle NAs.

Note, it's important: here we always have the same number of input columns (we already addressed input handling), it's just that the column can include NAs.

If for some reason you already have vectorized functions that don't handle NAs and you can't easily modify, one workaround is to preselect the vectors and run them separately. Schematically:

is_gas <- which(! is.na(df[["tank"]]) )
is_elec <- which( is.na(df[["tank"]]) )

df_gas <- df[ is_gas,  ]
df_elec <- df[ is_elec, ]

mileage_gas <- gas_mileage( df_gas )
mileage_elec <- elec_mileage( df_elec )

mileage <- double(length = nrow(df))
mileage[ is_gas ] <- mileage_gas
mileage[ is_elec] <- mileage_elec
2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.