I get this error message when trying to send json to S3:
Error : Unable to append onto table: revenue_predictions.mtcars
org.apache.hadoop.hive.ql.io.orc.OrcSerde: Is currently not supported by noctua
I am using noctua package to send data to S3. During the creation of the table in hive, When I create the table in question, I can use e.g. parquet file format:
CREATE EXTERNAL TABLE IF NOT EXISTS revenue_predictions.mtcars
(
mpg double,
cyl double
)
PARTITIONED BY (
year string,
month string,
day string
)
stored as parquet
LOCATION
's3://ourco-emr/tables/example.db/mtcars';
Then, I can send this to S3 using noctua:
dbWriteTable(conn = con_s3,
name = "examples.mtcars",
value = mtcars %>% select(mpg, cyl),
append = T,
overwrite = F,
file.type = "parquet",
partition = c(year = yr, month = mt, day = d),
s3.location = "s3://ourco-emr/tables/example.db"
So far so good. If I run this I can see data in S3 using Athena or Hive.
I would like to change this to json format instead of parquet format. Noctua dbWriteTable() does seem to offer this fucntionality with the file.type parameter, where I'd change the above call to file.type = "json"
.
To use json format, our engineers told me to change the create table hive call from the above code block to this:
CREATE EXTERNAL TABLE IF NOT EXISTS revenue_predictions.mtcars
(
mpg double,
cyl double
)
PARTITIONED BY (
year string,
month string,
day string
)
row format serde
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
stored as inputformat
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
outputformat
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
's3://ourco-emr/tables/example.db/mtcars';
So, the line stored as parquet
has been changed to:
row format serde
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
stored as inputformat
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
outputformat
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
When I create this table and then send data to S3, the json does not seem to be detected.
In order to get Hive and Athena to recognize the json sent to S3 by noctua, how should I define the json when creating the table? I found several differing methods when searching, it seems it's not as straight forwards as parquet, e.g. I cannot just do stored as json
but instead have to refer to serde. Having tried the above and failed, what's the correct way to create a table in Hive in json format that will recognise json I send to S3 via noctua?
Put another way, I'm confused about how to define the json format when creating the table so that it can read json sent by noctua?