I extract data from Hive databases on our Cloudera Hadoop cluster using dbplyr
combined with DBI
and odbc
.
Now IT is imposing a change in that odbc Hive connections should be made using native queries to reduce performance overhead. On Linux this means including UseNativeQuery=1
in the odbc.ini
file. On Windows it's under the advanced options in the ODBC Data Source Adminstrator
tool.
This change causes issues when using dbplyr
. DBI
still works with the appropriate SQL though.
Please note that this issue may share common elements with this previous post:
I have uploaded a copy of mtcars
to our container, so the following is not directly reproducible, but hopefully it demonstrates the issues and errors:
suppressPackageStartupMessages(library(dplyr)) # 0.8.0.1
#> Warning: package 'dplyr' was built under R version 3.4.4
suppressPackageStartupMessages(library(dbplyr)) # 1.4.0
# odbc connection using native queries
con_native <- DBI::dbConnect(odbc::odbc(), "authentication")
# unsuccessful with DBI, but managed to upload copy of mtcars to Hive via:
# RODBC::sqlSave(con, mtcars, "container_name.tmp_mtcars", rownames = "car")
# container_name hidden
mtcars_native <- tbl(con_native, in_schema("container_name", "tmp_mtcars"))
# output is as expected with table name prefix for native queries
# `mpg` is recognised as column name even without table name prefix
mtcars_native %>%
filter(mpg > 30)
#> # Source: lazy query [?? x 12]
#> # Database: Hive 1.1.0-cdh5.14.4[@Hive/HIVE]
#> tmp_mtcars.car tmp_mtcars.mpg tmp_mtcars.cyl tmp_mtcars.disp
#> <chr> <dbl> <dbl> <dbl>
#> 1 Fiat 128 32.4 4 78.7
#> 2 Honda Civic 30.4 4 75.7
#> 3 Toyota Corolla 33.9 4 71.1
#> 4 Lotus Europa 30.4 4 95.1
#> # ... with 8 more variables: tmp_mtcars.hp <dbl>, tmp_mtcars.drat <dbl>,
#> # tmp_mtcars.wt <dbl>, tmp_mtcars.qsec <dbl>, tmp_mtcars.vs <dbl>,
#> # tmp_mtcars.am <dbl>, tmp_mtcars.gear <dbl>, tmp_mtcars.carb <dbl>
# same result (not run)
# head(DBI::dbGetQuery(con_native, "SELECT * FROM container_name.tmp_mtcars WHERE (`mpg` > 30.0)"))
# column names not recognised
mtcars_native %>%
select(car, mpg)
#> Error in .f(.x[[i]], ...): object 'car' not found
# output successful, but without table name prefix
head(DBI::dbGetQuery(con_native, "SELECT `car`, `mpg` FROM container_name.tmp_mtcars"))
#> car mpg
#> 1 Mazda RX4 21.0
#> 2 Mazda RX4 Wag 21.0
#> 3 Merc 280C 17.8
#> 4 Merc 450SE 16.4
#> 5 Merc 450SL 17.3
#> 6 Merc 450SLC 15.2
# same scenario with mutate (not run)
# mtcars_native %>%
# mutate(kpl = mpg * 0.425)
# output successful even with mutate, but without table name prefix
mtcars_native %>%
group_by(cyl) %>%
summarise(mpg_mean = mean(mpg, na.rm = TRUE)) %>%
mutate(kpl_mean = mpg_mean * 0.425)
#> # Source: lazy query [?? x 3]
#> # Database: Hive 1.1.0-cdh5.14.4[@Hive/HIVE]
#> cyl mpg_mean kpl_mean
#> <dbl> <dbl> <dbl>
#> 1 4 26.7 11.3
#> 2 6 19.7 8.39
#> 3 8 15.1 6.42
# same result (not run)
# head(DBI::dbGetQuery(con_native, "SELECT `cyl`, `mpg_mean`, `mpg_mean` * 0.425 AS `kpl_mean` FROM
# (SELECT `cyl`, AVG(`mpg`) AS `mpg_mean` FROM container_name.tmp_mtcars GROUP BY `cyl`) `dbplyr_001`"))
DBI::dbDisconnect(con_native)
Created on 2019-06-05 by the reprex package (v0.3.0)
Standard select
(and mutate
) queries fail with dbplyr
due to the column names not being recognised without the table name prefix, but are successful with DBI
. However, other standard syntax (filter
, group_by
, summarise
) works with just the raw column names. select
/mutate
are, however, successful following such a summarise
operation.
Therefore my questions are:
- Can
dbplyr
be enhanced to allow the use of Hive native queries? - Or are there any other workarounds using native queries which do not involve having to use SQL strings which become unreadable with complex queries?
I'm happy to provide any further system details I can, but I may be limited because the system is not managed by our team.