(This is related to a recent post however I'm asking a clean and fresh version of the question)
My dbplyr sql commands seem to be going unrecognized and causing errors. These are very basic commands (select, filter, mutate) that I believe are expected to translate well, leading me to believe I have a issue elsewhere.
I have a connection to our Hive instance using odbc and as I mentioned I'm using dbplyr package to pull data from hive into R.
Without actually providing a odbc connection to our Hive instance to people on the internet, it's tricky to provide a reproducible example. However, I'm hoping that the information I provide here can help diagnose.
Here's the top of my script as I connect to a Hive table:
library(odbc)
library(tidyverse)
library(dbplyr)
library(lubridate)
library(zoo)
# hive connection
con <- dbConnect(odbc::odbc(), "HiveProd")
# tables
order_discrepancy <- tbl(con, sql("select *
from myname.pro_programe_compare_uds_order
where (inorder is null) or (intos is null)"))
Now, if I glimpse order_discrepancy:
> glimpse(order_discrepancy)
Observations: ??
Variables: 6
$ czycmgxdvd.pro_shopper_id <chr> "100319101", "100944143", "101742363", "102141637", "10290598", "103123012", "10421410", "104398857", "10536771", "105412334", "105686556...
$ czycmgxdvd.tos_signed_date <date> NA, 2015-03-19, NA, 2016-01-21, NA, NA, 2015-10-07, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2016-10-03, NA, NA, NA, NA, 2017-02-22, ...
$ czycmgxdvd.uds_order_date <date> 2018-12-07, NA, 2018-12-07, NA, 2018-12-07, 2018-12-07, NA, 2018-09-17, 2018-12-09, 2018-12-09, 2018-12-08, 2018-12-10, 2018-12-10, 2018...
$ czycmgxdvd.date_diff_days <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ czycmgxdvd.inorder <lgl> TRUE, NA, TRUE, NA, TRUE, TRUE, NA, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, NA, TRUE, TRUE, TRUE, TRUE, NA, TRU...
$ czycmgxdvd.intos <lgl> NA, TRUE, NA, TRUE, NA, NA, TRUE, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, TRUE, NA, NA, NA, NA, TRUE, NA, NA, NA, NA, NA, NA, TRUE, T...
So everything looks good, I see some data.
Here's a very basic command that works:
> order_discrepancy %>% filter(inorder == TRUE) %>% head()
# Source: lazy query [?? x 6]
# Database: Hive 1.2.2[@Hive/HIVE]
czuqppwbbq.pro_shopper_id czuqppwbbq.tos_signed_date czuqppwbbq.uds_order_date czuqppwbbq.date_diff_days czuqppwbbq.inorder czuqppwbbq.intos
<chr> <date> <date> <int> <lgl> <lgl>
1 100319101 NA 2018-12-07 NA TRUE NA
2 101742363 NA 2018-12-07 NA TRUE NA
3 10290598 NA 2018-12-07 NA TRUE NA
4 103123012 NA 2018-12-07 NA TRUE NA
5 104398857 NA 2018-09-17 NA TRUE NA
6 10536771 NA 2018-12-09 NA TRUE NA
So far so good. But, look what happens if I add select command:
> order_discrepancy %>% filter(inorder == TRUE) %>% select(pro_shopper_id:inorder) %>% head()
Error in is_character(x, encoding = encoding, n = 1) :
object 'pro_shopper_id' not found
So, then I tried just selecting field 'inorder' which I just successfully filtered on in the block 2 above:
> order_discrepancy %>% filter(inorder == TRUE) %>% select(inorder) %>% head()
Error in .f(.x[[i]], ...) : object 'inorder' not found
Interestingly, if I remove the filter condition and try using select only, I receive the same error message. I also tried:
> order_discrepancy %>% mutate(blah = 1) %>% head()
Error in new_result(connection@ptr, statement) :
nanodbc/nanodbc.cpp:1344: 42000: [Hortonworks][Hardy] (80) Syntax or semantic analysis error thrown in server while executing query. Error message from server: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference 'zzz1.pro_shopper_id': (possible column names are: pro_shopper_id, tos_signed_date, uds_order_date, date_diff_days, inorder, intos)
Here is the result of show_query() command for the above code blocks that failed:
> order_discrepancy %>% mutate(blah = 1) %>% head() %>% show_query()
<SQL>
SELECT `zzz1.pro_shopper_id`, `zzz1.tos_signed_date`, `zzz1.uds_order_date`, `zzz1.date_diff_days`, `zzz1.inorder`, `zzz1.intos`, 1.0 AS `blah`
FROM (select *
from gcameron.pro_programe_compare_uds_order
where (inorder is null) or (intos is null)) `tgzubexmdm`
LIMIT 6
This one won't show the query but I tried:
> order_discrepancy %>% select(inorder) %>% head() %>% show_query()
Error in .f(.x[[i]], ...) : object 'inorder' not found
Here's the one that does actually work:
> order_discrepancy %>% filter(inorder == TRUE) %>% head() %>% show_query()
<SQL>
SELECT *
FROM (select *
from gcameron.pro_programe_compare_uds_order
where (inorder is null) or (intos is null)) `smwemchtqz`
WHERE (`inorder` = TRUE)
LIMIT 6
So, it looks like I can use some commands, e.g. filter, but not others.
In case it's relevant, here is my .odbc.ini configuration with our company data hidden:
[ODBC]
# Specify any global ODBC configuration here such as ODBC tracing.
[ODBC Data Sources]
HiveProd=Hortonworks Hive ODBC Driver 64-bit
[HiveProd]
# Description: DSN Description.
# This key is not necessary and is only to give a description of the data source.
Description=Hortonworks Hive ODBC Driver (64-bit) DSN
# Driver: The location where the ODBC driver is installed to.
Driver=/usr/lib/hive/lib/native/Linux-amd64-64/libhortonworkshiveodbc64.so
# When using No Service Discovery, specify the IP address or host name of the Hive server.
# When using ZooKeeper as the Service Discovery Mode, specify a comma-separated list of ZooKeeper
# servers in the following format:
# <zk_host1:zk_port1>,<zk_host2:zk_port2>,...
HOST=[our hosts here]
# The TCP port Hive server is listening. This is not required when using ZooKeeper as the service
# discovery mode as the port is specified in the HOST connection attribute.
PORT=
# The name of the database schema to use when a schema is not explicitly specified in a query.
Schema=default
# Set to 0 to when connecting directory to Hive Server 2 (No Service Discovery).
# Set to 1 to do Hive Server 2 service discovery using ZooKeeper.
# Note service discovery is not support when using Hive Server 1.
ServiceDiscoveryMode=1
# The namespace on ZooKeeper under which Hive Server 2 znodes are added. Required only when doing
# HS2 service discovery with ZooKeeper (ServiceDiscoveryMode=1).
ZKNamespace=/hive/hiveserver2
# Set to 1 if you are connecting to Hive Server 1. Set to 2 if you are connecting to Hive Server 2.
HiveServerType=2
# The authentication mechanism to use for the connection.
# Set to 0 for No Authentication
# Set to 1 for Kerberos
# Set to 2 for User Name
# Set to 3 for User Name and Password
# Note only No Authentication is supported when connecting to Hive Server 1.
AuthMech=1
# The Thrift transport to use for the connection.
# Set to 0 for Binary
# Set to 1 for SASL
# Set to 2 for HTTP
# Note for Hive Server 1 only Binary can be used.
ThriftTransport=1
# When this option is enabled (1), the driver does not transform the queries emitted by an
# application, so the native query is used.
# When this option is disabled (0), the driver transforms the queries emitted by an application and
# converts them into an equivalent from in HiveQL.
UseNativeQuery=0
# Set the UID with the user name to use to access Hive when using AuthMech 2 to 8.
UID=
# The following is settings used when using Kerberos authentication (AuthMech 1 and 10)
# The fully qualified host name part of the of the Hive Server 2 Kerberos service principal.
# For example if the service principal name of you Hive Server 2 is:
# hive/myhs2.mydomain.com@EXAMPLE.COM
# Then set KrbHostFQDN to myhs2.mydomain.com
KrbHostFQDN=hive.hadoop.ourdomain.com
# The service name part of the of the Hive Server 2 Kerberos service principal.
# For example if the service principal name of you Hive Server 2 is:
# hive/myhs2.mydomain.com@EXAMPLE.COM
# Then set KrbServiceName to hive
KrbServiceName=hive
# The realm part of the of the Hive Server 2 Kerberos service principal.
# For example if the service principal name of you Hive Server 2 is:
# hive/myhs2.mydomain.com@EXAMPLE.COM
# Then set KrbRealm to EXAMPLE.COM
KrbRealm=HADOOP.PROD.INT.OURNAME.COM
# Set to 1 to enable SSL. Set to 0 to disable.
SSL=0
# Set to 1 to enable two-way SSL. Set to 0 to disable. You must enable SSL in order to
# use two-way SSL.
TwoWaySSL=0
# The file containing the client certificate in PEM format. This is required when using two-way SSL.
ClientCert=
# The client private key. This is used for two-way SSL authentication.
ClientPrivateKey=
# The password for the client private key. Password is only required for password protected
# client private key.
ClientPrivateKeyPassword=
Any information to help overcome these issues would be very much appreciated