This may or may not be a specifically r question, but posting here since my environment in rstudio and I'm working on an r application that needs to send data to AWS S3.
We use AWS s3 for data storage and query using Athena. I would like to send a data frame to s3. It appears I was able to do this, yet the data do not appear as expected. I suspect it's maybe to do with file types, since I'm sending a dataframe from memory to S3 as opposed to e.g. a csv file.
I created a new table in hive for mtcars and partitioned by year, month and day:
CREATE EXTERNAL TABLE IF NOT EXISTS adhoc.mtcars
(
mpg integer,
cyl integer,
disp integer,
hp integer,
drat integer,
wt integer,
qsec integer,
vs integer,
am integer,
gear integer,
carb integer)
partitioned by (year string, month string, day string)
stored as orc
location 's3://ourcompany-emr/tables/adhoc.db/mtcars';
This results in a new empty table in adhoc/mtcars.
With my aws key and secret in .Renviron, I run this code:
library(tidyverse)
library(aws.s3)
# to get a particular bucket
b <- get_bucket(bucket = 'ourcompany-emr')
my_mtcars <- mtcars %>%
mutate(year = "2020",
month = "02",
day = "01")
# sent my_mtcars to s3
s3save(my_mtcars, bucket = "s3://ourcompany-emr/", object = "tables/adhoc.db/mtcars/")
This seems to run fine, I can see a new file in S3 at that location.
However, when I attempt to query it:
select *
from adhoc.mtcars
Returns an empty table.
The only thing I can think of is the data type? aws.s3 seems to send a data frame straight from memory in r to S3. I'm not sure if I should expect that it will just appear in our table in that way... I was expecting/hoping that the table adhoc.mtcars would now contain the data that I sent with s3Save()
.
How can I populate the newly created adhoc.mtcars with the dataframe I sent?
[edit]
Since my table is an orc table, doing some Google searching I found that some of R's spark packages allow for converting data frames to orc format. If that's the case, how could I save a regular dataframe as a orc file on s3?