Does "group_by() %>% mutate(a = sum(b)) work in connection to hive / impala by rjdbc?

Hi,

I usually connect to hive / impala and manipulate data in a Cloudera platform by RJDBC
(for example, con <- dbConnect(drv, url="jdbc:hive2://xxx", user="yyy", password="zzz"))

My problem is that 'tbl(con, "table") %>% group_by( c ) %>% mutate( a = sum( b ) )' does not work in my hive / impala setting and the command generates an error message.

I expected that "group_by + mutate(sum)" combination should work well in hive / impala setting as it does in RPostgreSQL or maybe sparklyr setting (sparklyr - Manipulating Data with dplyr).

Is the "group_by + mutate( window function )" combination not supported by dbplyr in hive or impala setting? Otherwise, did I make some mistake?

practice = tbl(con, "practice0") 

# Source:   lazy query [?? x 2]
# Database: JDBCConnection
#   pt_dt    stat_value
#   <dbl>      <dbl>
# 1     6         21  
# 2     6         21  
# 3     4         22.8
# 4     6         21.4
# 5     8         18.7
# 6     6         18.1
# 7     8         14.3
# 8     4         24.4
# 9     4         22.8
#10     6         19.2

dat = practice %>% group_by(pt_dt) %>% mutate(a = sum(stat_value)) 

#######This is an output when the above command works in Postgresql setting#####
# Source:   lazy query [?? x 3]
# Database: JDBCConnection
# Groups:   pt_dt
#   pt_dt    stat_value     a
#   <dbl>      <dbl>      <dbl>
# 1     4       22.8       70 
# 2     4       22.8       70 
# 3     4       24.4       70 
# 4     6       18.1       101.
# 5     6       19.2       101.
# 6     6       21         101.
# 7     6       21.4       101.
# 8     6       21         101.
# 9     8       18.7       33 
#10     8       14.3      33 

dat = practice %>% group_by(pt_dt) %>% mutate(a = sum(stat_value)) 

###########This is an error message in hive setting###########
#Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set", :
#Unable to retrieve JDBC result set
#JDBC ERROR: [Cloudera] [HiveJDBCDriver] (5000051) 
#ERROR processing query/statement. Error Code: 40000, SQL state: 
#TStatus(statusCode:ERROR_STATUS, 
#infoMessages:[*org.apache.hive.service.cli.HiveSQLException:Error 
#while compiling statement: FAILED: 
#ParseException line 1:105 cannot recognize input near 'AS' '"a"' 'FROM' in 
#selection target:17:16, #org.apache.hive.service.cli.operation.Operation:
#toSQLException:Operation.java:362, #org.apache.hive.service.cli.operation.SQLOperation:prepare:
#SQLOperation.java:206, 
#org.apache.hive.service.cli.operation.SQLOperation:runInternal:SQLOperation.java:260, 
#org.apache.hive.service.cli.operation.Operation:run:Operation.java:274, 
#org.apache.hive.service.cli.session.HiveSessionImpL
#executeStatementInternalLHiveSessionImpl.javaL565,
#org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementAsync:
#HiveSessionImpl.java:551, 
#org.apache.hive.service.clu.CLIService:executestatementAsync:
#CLIService.java:315, or

I also tried
dat = practice %>% group_by(pt_dt) %>% mutate(a = sql("sum(stat_value)")) #It even generates incorrect SQL
dat = practice %>% mutate(a = sql("sum(stat_value) over (partition by pt_dt) ")).
However, both generate similar errors.

Please let me know some solution.

Thank you

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.