Is there an argument against translating str_sub() to SUBSTRING() in dbplyr?

Hlynur · February 25, 2018, 5:03am

Working with dbplyr on a Microsoft SQL Server odbc connection, I'm wondering why stringr::str_sub does not get translated to the SUBSTRING() window function but substr does. I've located the bit of code in the dbplyr github repo that does this translation for substr, but before submitting a pull request with a similiar code snippet for str_sub I wanted to ask around and see if anyone knew of any low-level differences between str_sub and substr that would result in this SQL translation being a bad idea, or even some other sort of reasoning (tidyverse philosophy) that would go against this. All input appreciated.

library(tidyverse)
library(stringr)
library(odbc)
library(dbplyr)

#Let's pretend we have a Microsoft SQL Server <OdbcConnection>
#where the data is nycflights13::flights

flights %>% 
  filter(substr(tailnum, 1L, 2L) != "N1") %>% 
  collect()
#This works

flights %>% 
  filter(substring(tailnum, 1L, 2L) != "N1") %>% 
  collect()
#This also works

flights %>% 
  filter(str_sub(tailnum, 1L, 2L) != "N1") %>% 
  collect()
#This doesn't work

edgararuiz · February 25, 2018, 11:07pm

Hi @Hlynur, that's a great idea. So far, from what I can see, some of the nuanced differences will be in the defaults. For example, str_sub() defaults start to 1L whereas substr() will err out if you don't pass the start arg. I'd recommend that you test your translation against the examples in the help page: http://stringr.tidyverse.org/reference/str_sub.html#examples