Error using dbplyr 1.4 with row_number()

martin.R · May 1, 2019, 1:17pm

I just got the following error message using dbplyr to extract data from a Hive database:

Error in new_result(connection@ptr, statement) : 
  nanodbc/nanodbc.cpp:1344: HY000: [Cloudera][Hardy] (80) Syntax or semantic analysis error thrown in server while executing query. Error message from server: Error while compiling statement: FAILED: SemanticException [Error 10128]: Line 8:7 Not yet supported place for UDAF 'row_number'

The line generating the error is filter(row_number() == 1L) as part of a windowed rank function.

I upgraded dbplyr to 1.4 just before running this and I haven't had this error prior to the upgrade, so I'm assuming this is the cause (unless the database backend changed in some way).

I'm happy to supply more detail if you cannot replicate the error.

cderv · May 1, 2019, 2:55pm

Can you share the dplyr pipe workflow and can you share the SQL query generated ?

There may have been a change that impacted HIVE translation. From the error, it seems that it is not translated...

martin.R · May 1, 2019, 4:21pm

Thanks, I should have done that. I managed to find a server still running dbplyr 1.3.0, so the differences become obvious via show_query() which should hopefully be enough as explain() is a much longer output.

Here is a simplified form of the code with names changed:

library(dplyr)
library(dbplyr)

con <- DBI::dbConnect(odbc::odbc(), "my_DSN")

my_query <- 
  tbl(con, in_schema("container_name", "table_name")) %>% 
  select(id_column, datestamp) %>% 
  filter(datestamp == "2019-04-30") %>% 
  group_by(id_column) %>% 
  filter(row_number() == 1L) %>% 
  ungroup()

This removes duplicate id values. (There may be better ways of doing this, but this is a simplified query).

dplyr 1.3.0:

> my_query %>% show_query()
<SQL>
SELECT `id_column`, `datestamp`
FROM (SELECT `id_column`, `datestamp`, row_number() OVER (PARTITION BY `id_column`) AS `zzz9`
FROM (SELECT *
FROM (SELECT `id_column`, `datestamp`
FROM container_name.table_name) `sbbdxdpnnx`
WHERE (`datestamp` = '2019-04-30')) `qcpxocjxmq`) `akxsynfdxk`
WHERE (`zzz9` = 1)

dplyr 1.4.0:

> my_query %>% show_query()
<SQL>
SELECT *
FROM (SELECT *
FROM (SELECT `id_column`, `datestamp`
FROM container_name.table_name) `dbplyr_030`
WHERE (`datestamp` = '2019-04-30')) `dbplyr_031`
WHERE (row_number() = 1)

dbplyr 1.4.0 therefore omits the ... row_number() OVER (PARTITION BYid_column) windowing and errors when it comes to WHERE (row_number() = 1).

I tried to upload mtcars as a sample dataset to our database, but failed and our admin is on holiday. I hope that this example provides enough detail to explain the cause of the error.

cderv · May 1, 2019, 8:42pm

Thanks !

There is a new vignette on how to do reprex with dbplyr, specifically to deal with translation issue. See : Reprexes for dbplyr • dbplyr

Here is a minimal reprex for the change of translation

library(dplyr)
library(dbplyr)

tab <- tbl_lazy(iris, con = simulate_hive())
query <- tab %>%
  group_by(Species) %>%
  filter(row_number() == 1L)
query %>% show_query()
#> <SQL>
#> SELECT *
#> FROM `df`
#> WHERE (row_number() = 1)

^{Created on 2019-05-01 by the reprex package (v0.2.1.9000)}

I think you can open an issue in the dblyr repo. I use git bisect to find where this comes from and it is this commit that has changed the behavior

Before, base odbc translation where used (base_odbc_agg and base_odbc_win) and it worked with hive. Now, the translation are customized and row_number() is missing, as window function.

You should open the issue explaining all this, and if you know the correct SQL it could be awesome. I don't have a hive database to test and try so it is better if it is you. I can help with a PR if you want - it is rather easy to add some new translation. Hive translation live there

github.com

tidyverse/dbplyr/blob/main/R/backend-hive.R

#' Backend: Hive
#'
#' @description
#' See `vignette("translation-function")` and `vignette("translation-verb")` for
#' details of overall translation technology. Key differences for this backend
#' are a scattering of custom translations provided by users.
#'
#' Use `simulate_hive()` with `lazy_frame()` to see simulated SQL without
#' converting to live access database.
#'
#' @name backend-hive
#' @aliases NULL
#' @examples
#' library(dplyr, warn.conflicts = FALSE)
#'
#' lf <- lazy_frame(a = TRUE, b = 1, d = 2, c = "z", con = simulate_hive())
#' lf %>% transmute(x = cot(b))
#' lf %>% transmute(x = bitwShiftL(c, 1L))
#' lf %>% transmute(x = str_replace_all(c, "a", "b"))
#'

This file has been truncated. show original

If you prefer that I open the issue, please do not hesitate.

Note for those who don't know git bisect (yet):
it is a very useful tool in git to find which is the first commit to introduce the bug or regression. With this tool, and a reprex it is very efficient!
Recently Jim Hester has done a video on this topic - a great way to discover: video

cderv · May 1, 2019, 8:48pm

Ok ! writing my answer and looking again I found the error. Just a typo that introduced the regression. I'll do a PR

EDIT: I pushed a fix in this PR. You could try it

github.com/tidyverse/dbplyr

Hive translations for window functions were no more working

tidyverse:master ← cderv:hive-window

opened 08:54PM - 01 May 19 UTC

cderv

+3 -1

This comes from a question on [community.rstudio.com](https://community.rstudio.…com/t/error-using-dbplyr-1-4-with-row-number/29779/5) In dbplyr 1.3.0 ```r library(dplyr) library(dbplyr) tab <- tbl_lazy(iris, src = simulate_hive()) query <- tab %>% group_by(Species) %>% filter(row_number() == 1L) query %>% show_query() <SQL> SELECT `Sepal.Length`, `Sepal.Width`, `Petal.Length`, `Petal.Width`, `Species` FROM (SELECT `Sepal.Length`, `Sepal.Width`, `Petal.Length`, `Petal.Width`, `Species`, row_number() OVER (PARTITION BY `Species`) AS `zzz1` FROM `df`) `ohjggmqvbj` WHERE (`zzz1` = 1) ``` With 1.4.0 ```r tab <- tbl_lazy(iris, src = simulate_hive()) query <- tab %>% group_by(Species) %>% filter(row_number() == 1L) query %>% show_query() <SQL> SELECT * FROM `df` WHERE (row_number() = 1) ``` A `git bisect` points to this commit: f4fcfdc9f73d56bbf2d93eb705b57709ccae52e5. This comes from a small typo - odbc window function were no more used. This PR changes that and now it is working as before ```r tab <- tbl_lazy(iris, src = simulate_hive()) query <- tab %>% group_by(Species) %>% filter(row_number() == 1L) query %>% show_query() <SQL> SELECT `Sepal.Length`, `Sepal.Width`, `Petal.Length`, `Petal.Width`, `Species` FROM (SELECT `Sepal.Length`, `Sepal.Width`, `Petal.Length`, `Petal.Width`, `Species`, ROW_NUMBER() OVER (PARTITION BY `Species`) AS `zzz1` FROM `df`) `dbplyr_001` WHERE (`zzz1` = 1) ``` I can add tests later if you want. There are not so many tests for hive curently, and nothing about window function

martin.R · May 2, 2019, 9:48am

@cderv, thank you so much for digging into this and for submitting the PR.

I wasn't aware of this functionality: tab <- tbl_lazy(iris, con = simulate_hive()), so this should help me produce future reprexes.

martin.R · May 9, 2019, 9:48am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.