dbplyr Issues/Errors with Hive Native Queries

martin.R · June 5, 2019, 3:27pm

I extract data from Hive databases on our Cloudera Hadoop cluster using dbplyr combined with DBI and odbc.

Now IT is imposing a change in that odbc Hive connections should be made using native queries to reduce performance overhead. On Linux this means including UseNativeQuery=1 in the odbc.ini file. On Windows it's under the advanced options in the ODBC Data Source Adminstrator tool.

This change causes issues when using dbplyr. DBI still works with the appropriate SQL though.

Please note that this issue may share common elements with this previous post:

I have uploaded a copy of mtcars to our container, so the following is not directly reproducible, but hopefully it demonstrates the issues and errors:

suppressPackageStartupMessages(library(dplyr))  # 0.8.0.1
#> Warning: package 'dplyr' was built under R version 3.4.4
suppressPackageStartupMessages(library(dbplyr)) # 1.4.0

# odbc connection using native queries
con_native <- DBI::dbConnect(odbc::odbc(), "authentication")

# unsuccessful with DBI, but managed to upload copy of mtcars to Hive via:
# RODBC::sqlSave(con, mtcars, "container_name.tmp_mtcars", rownames = "car")

# container_name hidden
mtcars_native <- tbl(con_native, in_schema("container_name", "tmp_mtcars"))

# output is as expected with table name prefix for native queries
# `mpg` is recognised as column name even without table name prefix
mtcars_native %>% 
  filter(mpg > 30)
#> # Source:   lazy query [?? x 12]
#> # Database: Hive 1.1.0-cdh5.14.4[@Hive/HIVE]
#>   tmp_mtcars.car tmp_mtcars.mpg tmp_mtcars.cyl tmp_mtcars.disp
#>   <chr>                   <dbl>          <dbl>           <dbl>
#> 1 Fiat 128                 32.4              4            78.7
#> 2 Honda Civic              30.4              4            75.7
#> 3 Toyota Corolla           33.9              4            71.1
#> 4 Lotus Europa             30.4              4            95.1
#> # ... with 8 more variables: tmp_mtcars.hp <dbl>, tmp_mtcars.drat <dbl>,
#> #   tmp_mtcars.wt <dbl>, tmp_mtcars.qsec <dbl>, tmp_mtcars.vs <dbl>,
#> #   tmp_mtcars.am <dbl>, tmp_mtcars.gear <dbl>, tmp_mtcars.carb <dbl>

# same result (not run)
# head(DBI::dbGetQuery(con_native, "SELECT * FROM container_name.tmp_mtcars WHERE (`mpg` > 30.0)"))

# column names not recognised
mtcars_native %>% 
  select(car, mpg)
#> Error in .f(.x[[i]], ...): object 'car' not found

# output successful, but without table name prefix
head(DBI::dbGetQuery(con_native, "SELECT `car`, `mpg` FROM container_name.tmp_mtcars"))
#>             car  mpg
#> 1     Mazda RX4 21.0
#> 2 Mazda RX4 Wag 21.0
#> 3     Merc 280C 17.8
#> 4    Merc 450SE 16.4
#> 5    Merc 450SL 17.3
#> 6   Merc 450SLC 15.2

# same scenario with mutate (not run)
# mtcars_native %>% 
#   mutate(kpl = mpg * 0.425)

# output successful even with mutate, but without table name prefix
mtcars_native %>% 
  group_by(cyl) %>% 
  summarise(mpg_mean = mean(mpg, na.rm = TRUE)) %>% 
  mutate(kpl_mean = mpg_mean * 0.425)
#> # Source:   lazy query [?? x 3]
#> # Database: Hive 1.1.0-cdh5.14.4[@Hive/HIVE]
#>     cyl mpg_mean kpl_mean
#>   <dbl>    <dbl>    <dbl>
#> 1     4     26.7    11.3 
#> 2     6     19.7     8.39
#> 3     8     15.1     6.42

# same result (not run)
# head(DBI::dbGetQuery(con_native, "SELECT `cyl`, `mpg_mean`, `mpg_mean` * 0.425 AS `kpl_mean` FROM 
#                      (SELECT `cyl`, AVG(`mpg`) AS `mpg_mean` FROM container_name.tmp_mtcars GROUP BY `cyl`) `dbplyr_001`"))

DBI::dbDisconnect(con_native)

Created on 2019-06-05 by the reprex package (v0.3.0)

Standard select (and mutate) queries fail with dbplyr due to the column names not being recognised without the table name prefix, but are successful with DBI. However, other standard syntax (filter, group_by, summarise) works with just the raw column names. select/mutate are, however, successful following such a summarise operation.

Therefore my questions are:

Can dbplyr be enhanced to allow the use of Hive native queries?
Or are there any other workarounds using native queries which do not involve having to use SQL strings which become unreadable with complex queries?

I'm happy to provide any further system details I can, but I may be limited because the system is not managed by our team.

edgararuiz · June 5, 2019, 4:28pm

Hi, it may be better to move this to an Issue in dbplyr's GitHub repository, can you open one there please? https://github.com/tidyverse/dbplyr/issues

martin.R · June 5, 2019, 7:03pm

Done:

github.com/tidyverse/dbplyr

Hive native queries errors with select()/mutate()

opened 07:02PM - 05 Jun 19 UTC

closed 10:03PM - 14 Apr 20 UTC

martin-a-wade

Problem described here: https://community.rstudio.com/t/dbplyr-issues-errors-wi…th-hive-native-queries/32443 Hive native queries result in errors using dbplyr and select()/mutate(), but not with DBI. The following is not directly reproducible, but hopefully it demonstrates the issues and errors in differt simple scenarios: ```r suppressPackageStartupMessages(library(dplyr)) # 0.8.0.1 #> Warning: package 'dplyr' was built under R version 3.4.4 suppressPackageStartupMessages(library(dbplyr)) # 1.4.0 # odbc connection using native queries con_native <- DBI::dbConnect(odbc::odbc(), "authentication") # unsuccessful with DBI, but managed to upload copy of mtcars to Hive via: # RODBC::sqlSave(con, mtcars, "container_name.tmp_mtcars", rownames = "car") # container_name hidden mtcars_native <- tbl(con_native, in_schema("container_name", "tmp_mtcars")) # output is as expected with table name prefix for native queries # `mpg` is recognised as column name even without table name prefix mtcars_native %>% filter(mpg > 30) #> # Source: lazy query [?? x 12] #> # Database: Hive 1.1.0-cdh5.14.4[@Hive/HIVE] #> tmp_mtcars.car tmp_mtcars.mpg tmp_mtcars.cyl tmp_mtcars.disp #> <chr> <dbl> <dbl> <dbl> #> 1 Fiat 128 32.4 4 78.7 #> 2 Honda Civic 30.4 4 75.7 #> 3 Toyota Corolla 33.9 4 71.1 #> 4 Lotus Europa 30.4 4 95.1 #> # ... with 8 more variables: tmp_mtcars.hp <dbl>, tmp_mtcars.drat <dbl>, #> # tmp_mtcars.wt <dbl>, tmp_mtcars.qsec <dbl>, tmp_mtcars.vs <dbl>, #> # tmp_mtcars.am <dbl>, tmp_mtcars.gear <dbl>, tmp_mtcars.carb <dbl> # same result (not run) # head(DBI::dbGetQuery(con_native, "SELECT * FROM container_name.tmp_mtcars WHERE (`mpg` > 30.0)")) # column names not recognised mtcars_native %>% select(car, mpg) #> Error in .f(.x[[i]], ...): object 'car' not found # output successful, but without table name prefix head(DBI::dbGetQuery(con_native, "SELECT `car`, `mpg` FROM container_name.tmp_mtcars")) #> car mpg #> 1 Mazda RX4 21.0 #> 2 Mazda RX4 Wag 21.0 #> 3 Merc 280C 17.8 #> 4 Merc 450SE 16.4 #> 5 Merc 450SL 17.3 #> 6 Merc 450SLC 15.2 # same scenario with mutate (not run) # mtcars_native %>% # mutate(kpl = mpg * 0.425) # output successful even with mutate, but without table name prefix mtcars_native %>% group_by(cyl) %>% summarise(mpg_mean = mean(mpg, na.rm = TRUE)) %>% mutate(kpl_mean = mpg_mean * 0.425) #> # Source: lazy query [?? x 3] #> # Database: Hive 1.1.0-cdh5.14.4[@Hive/HIVE] #> cyl mpg_mean kpl_mean #> <dbl> <dbl> <dbl> #> 1 4 26.7 11.3 #> 2 6 19.7 8.39 #> 3 8 15.1 6.42 # same result (not run) # head(DBI::dbGetQuery(con_native, "SELECT `cyl`, `mpg_mean`, `mpg_mean` * 0.425 AS `kpl_mean` FROM # (SELECT `cyl`, AVG(`mpg`) AS `mpg_mean` FROM container_name.tmp_mtcars GROUP BY `cyl`) `dbplyr_001`")) DBI::dbDisconnect(con_native) ```

martin.R · June 26, 2019, 7:03pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.