Hi,
I usually connect to hive / impala and manipulate data in a Cloudera platform by RJDBC
(for example, con <- dbConnect(drv, url="jdbc:hive2://xxx", user="yyy", password="zzz"))
My problem is that 'tbl(con, "table") %>% group_by( c ) %>% mutate( a = sum( b ) )' does not work in my hive / impala setting and the command generates an error message.
I expected that "group_by + mutate(sum)" combination should work well in hive / impala setting as it does in RPostgreSQL or maybe sparklyr setting (sparklyr - Manipulating Data with dplyr).
Is the "group_by + mutate( window function )" combination not supported by dbplyr in hive or impala setting? Otherwise, did I make some mistake?
practice = tbl(con, "practice0")
# Source: lazy query [?? x 2]
# Database: JDBCConnection
# pt_dt stat_value
# <dbl> <dbl>
# 1 6 21
# 2 6 21
# 3 4 22.8
# 4 6 21.4
# 5 8 18.7
# 6 6 18.1
# 7 8 14.3
# 8 4 24.4
# 9 4 22.8
#10 6 19.2
dat = practice %>% group_by(pt_dt) %>% mutate(a = sum(stat_value))
#######This is an output when the above command works in Postgresql setting#####
# Source: lazy query [?? x 3]
# Database: JDBCConnection
# Groups: pt_dt
# pt_dt stat_value a
# <dbl> <dbl> <dbl>
# 1 4 22.8 70
# 2 4 22.8 70
# 3 4 24.4 70
# 4 6 18.1 101.
# 5 6 19.2 101.
# 6 6 21 101.
# 7 6 21.4 101.
# 8 6 21 101.
# 9 8 18.7 33
#10 8 14.3 33
dat = practice %>% group_by(pt_dt) %>% mutate(a = sum(stat_value))
###########This is an error message in hive setting###########
#Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set", :
#Unable to retrieve JDBC result set
#JDBC ERROR: [Cloudera] [HiveJDBCDriver] (5000051)
#ERROR processing query/statement. Error Code: 40000, SQL state:
#TStatus(statusCode:ERROR_STATUS,
#infoMessages:[*org.apache.hive.service.cli.HiveSQLException:Error
#while compiling statement: FAILED:
#ParseException line 1:105 cannot recognize input near 'AS' '"a"' 'FROM' in
#selection target:17:16, #org.apache.hive.service.cli.operation.Operation:
#toSQLException:Operation.java:362, #org.apache.hive.service.cli.operation.SQLOperation:prepare:
#SQLOperation.java:206,
#org.apache.hive.service.cli.operation.SQLOperation:runInternal:SQLOperation.java:260,
#org.apache.hive.service.cli.operation.Operation:run:Operation.java:274,
#org.apache.hive.service.cli.session.HiveSessionImpL
#executeStatementInternalLHiveSessionImpl.javaL565,
#org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementAsync:
#HiveSessionImpl.java:551,
#org.apache.hive.service.clu.CLIService:executestatementAsync:
#CLIService.java:315, or
I also tried
dat = practice %>% group_by(pt_dt) %>% mutate(a = sql("sum(stat_value)")) #It even generates incorrect SQL
dat = practice %>% mutate(a = sql("sum(stat_value) over (partition by pt_dt) ")).
However, both generate similar errors.
Please let me know some solution.
Thank you