I just got the following error message using dbplyr to extract data from a Hive database:
Error in new_result(connection@ptr, statement) :
nanodbc/nanodbc.cpp:1344: HY000: [Cloudera][Hardy] (80) Syntax or semantic analysis error thrown in server while executing query. Error message from server: Error while compiling statement: FAILED: SemanticException [Error 10128]: Line 8:7 Not yet supported place for UDAF 'row_number'
The line generating the error is filter(row_number() == 1L) as part of a windowed rank function.
I upgraded dbplyr to 1.4 just before running this and I haven't had this error prior to the upgrade, so I'm assuming this is the cause (unless the database backend changed in some way).
I'm happy to supply more detail if you cannot replicate the error.
Thanks, I should have done that. I managed to find a server still running dbplyr 1.3.0, so the differences become obvious via show_query() which should hopefully be enough as explain() is a much longer output.
Here is a simplified form of the code with names changed:
This removes duplicate id values. (There may be better ways of doing this, but this is a simplified query).
dplyr 1.3.0:
> my_query %>% show_query()
<SQL>
SELECT `id_column`, `datestamp`
FROM (SELECT `id_column`, `datestamp`, row_number() OVER (PARTITION BY `id_column`) AS `zzz9`
FROM (SELECT *
FROM (SELECT `id_column`, `datestamp`
FROM container_name.table_name) `sbbdxdpnnx`
WHERE (`datestamp` = '2019-04-30')) `qcpxocjxmq`) `akxsynfdxk`
WHERE (`zzz9` = 1)
dplyr 1.4.0:
> my_query %>% show_query()
<SQL>
SELECT *
FROM (SELECT *
FROM (SELECT `id_column`, `datestamp`
FROM container_name.table_name) `dbplyr_030`
WHERE (`datestamp` = '2019-04-30')) `dbplyr_031`
WHERE (row_number() = 1)
dbplyr 1.4.0 therefore omits the ... row_number() OVER (PARTITION BYid_column) windowing and errors when it comes to WHERE (row_number() = 1).
I tried to upload mtcars as a sample dataset to our database, but failed and our admin is on holiday. I hope that this example provides enough detail to explain the cause of the error.
Created on 2019-05-01 by the reprex package (v0.2.1.9000)
I think you can open an issue in the dblyr repo. I use git bisect to find where this comes from and it is this commit that has changed the behavior
Before, base odbc translation where used (base_odbc_agg and base_odbc_win) and it worked with hive. Now, the translation are customized and row_number() is missing, as window function.
You should open the issue explaining all this, and if you know the correct SQL it could be awesome. I don't have a hive database to test and try so it is better if it is you. I can help with a PR if you want - it is rather easy to add some new translation. Hive translation live there
If you prefer that I open the issue, please do not hesitate.
Note for those who don't know git bisect (yet): it is a very useful tool in git to find which is the first commit to introduce the bug or regression. With this tool, and a reprex it is very efficient!
Recently Jim Hester has done a video on this topic - a great way to discover: video