Does "group_by() %>% mutate(a = sum(b)) work in connection to hive / impala by rjdbc?

SimonD · April 18, 2022, 11:52am

Hi,

I usually connect to hive / impala and manipulate data in a Cloudera platform by RJDBC
(for example, con <- dbConnect(drv, url="jdbc:hive2://xxx", user="yyy", password="zzz"))

My problem is that 'tbl(con, "table") %>% group_by( c ) %>% mutate( a = sum( b ) )' does not work in my hive / impala setting and the command generates an error message.

I expected that "group_by + mutate(sum)" combination should work well in hive / impala setting as it does in RPostgreSQL or maybe sparklyr setting (sparklyr - Manipulating Data with dplyr).

Is the "group_by + mutate( window function )" combination not supported by dbplyr in hive or impala setting? Otherwise, did I make some mistake?

practice = tbl(con, "practice0") 

# Source:   lazy query [?? x 2]
# Database: JDBCConnection
#   pt_dt    stat_value
#   <dbl>      <dbl>
# 1     6         21  
# 2     6         21  
# 3     4         22.8
# 4     6         21.4
# 5     8         18.7
# 6     6         18.1
# 7     8         14.3
# 8     4         24.4
# 9     4         22.8
#10     6         19.2

dat = practice %>% group_by(pt_dt) %>% mutate(a = sum(stat_value)) 

#######This is an output when the above command works in Postgresql setting#####
# Source:   lazy query [?? x 3]
# Database: JDBCConnection
# Groups:   pt_dt
#   pt_dt    stat_value     a
#   <dbl>      <dbl>      <dbl>
# 1     4       22.8       70 
# 2     4       22.8       70 
# 3     4       24.4       70 
# 4     6       18.1       101.
# 5     6       19.2       101.
# 6     6       21         101.
# 7     6       21.4       101.
# 8     6       21         101.
# 9     8       18.7       33 
#10     8       14.3      33 

dat = practice %>% group_by(pt_dt) %>% mutate(a = sum(stat_value)) 

###########This is an error message in hive setting###########
#Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set", :
#Unable to retrieve JDBC result set
#JDBC ERROR: [Cloudera] [HiveJDBCDriver] (5000051) 
#ERROR processing query/statement. Error Code: 40000, SQL state: 
#TStatus(statusCode:ERROR_STATUS, 
#infoMessages:[*org.apache.hive.service.cli.HiveSQLException:Error 
#while compiling statement: FAILED: 
#ParseException line 1:105 cannot recognize input near 'AS' '"a"' 'FROM' in 
#selection target:17:16, #org.apache.hive.service.cli.operation.Operation:
#toSQLException:Operation.java:362, #org.apache.hive.service.cli.operation.SQLOperation:prepare:
#SQLOperation.java:206, 
#org.apache.hive.service.cli.operation.SQLOperation:runInternal:SQLOperation.java:260, 
#org.apache.hive.service.cli.operation.Operation:run:Operation.java:274, 
#org.apache.hive.service.cli.session.HiveSessionImpL
#executeStatementInternalLHiveSessionImpl.javaL565,
#org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementAsync:
#HiveSessionImpl.java:551, 
#org.apache.hive.service.clu.CLIService:executestatementAsync:
#CLIService.java:315, or

I also tried
dat = practice %>% group_by(pt_dt) %>% mutate(a = sql("sum(stat_value)")) #It even generates incorrect SQL
dat = practice %>% mutate(a = sql("sum(stat_value) over (partition by pt_dt) ")).
However, both generate similar errors.

Please let me know some solution.

Thank you

system · May 9, 2022, 11:53am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.