EMR Write data to S3?

dougfir · March 4, 2020, 10:36pm

This may or may not be a specifically r question, but posting here since my environment in rstudio and I'm working on an r application that needs to send data to AWS S3.

We use AWS s3 for data storage and query using Athena. I would like to send a data frame to s3. It appears I was able to do this, yet the data do not appear as expected. I suspect it's maybe to do with file types, since I'm sending a dataframe from memory to S3 as opposed to e.g. a csv file.

I created a new table in hive for mtcars and partitioned by year, month and day:

CREATE EXTERNAL TABLE IF NOT EXISTS adhoc.mtcars
(
mpg integer,
cyl integer,
disp integer,
hp integer,
drat integer,
wt integer,
qsec integer,
vs integer,
am integer,
gear integer,
carb integer)
partitioned by (year string, month string, day string)
stored as orc
location 's3://ourcompany-emr/tables/adhoc.db/mtcars';

This results in a new empty table in adhoc/mtcars.

With my aws key and secret in .Renviron, I run this code:

library(tidyverse)
library(aws.s3)

# to get a particular bucket
b <- get_bucket(bucket = 'ourcompany-emr')

my_mtcars <- mtcars %>% 
  mutate(year = "2020",
         month = "02",
         day = "01")

# sent my_mtcars to s3
s3save(my_mtcars, bucket = "s3://ourcompany-emr/", object = "tables/adhoc.db/mtcars/")

This seems to run fine, I can see a new file in S3 at that location.

However, when I attempt to query it:

select *
from adhoc.mtcars

Returns an empty table.

The only thing I can think of is the data type? aws.s3 seems to send a data frame straight from memory in r to S3. I'm not sure if I should expect that it will just appear in our table in that way... I was expecting/hoping that the table adhoc.mtcars would now contain the data that I sent with s3Save().

How can I populate the newly created adhoc.mtcars with the dataframe I sent?

[edit]
Since my table is an orc table, doing some Google searching I found that some of R's spark packages allow for converting data frames to orc format. If that's the case, how could I save a regular dataframe as a orc file on s3?

technocrat · March 5, 2020, 12:43am

Take a look at RAthena, which looks promising for your issue.

dougfir · March 5, 2020, 1:22am

Thanks for the suggestion, this does indeed look promising.

dougfir · March 5, 2020, 8:23pm

I tried to use RAthena but was unable to create an AthenaConnection.

con <- dbConnect(RAthena::athena())
Error: Boto3 is not detected please install boto3 using either: `pip install boto3` in terminal or `install_boto()`.
            Alternatively `reticulate::use_python` or `reticulate::use_condaenv` will have to be used if boto3 is in another environment.

I tried install_boto():

install_boto()
Using virtual environment 'RAthena' ...
Requirement already up-to-date: cython in /home/rstudio-gavin/.virtualenvs/RAthena/lib/python3.6/site-packages (0.29.15)
Collecting boto3
  Downloading boto3-1.12.15-py2.py3-none-any.whl (128 kB)
Collecting botocore<1.16.0,>=1.15.15
  Downloading botocore-1.15.15-py2.py3-none-any.whl (5.9 MB)
Requirement already satisfied, skipping upgrade: jmespath<1.0.0,>=0.7.1 in /home/rstudio-gavin/.virtualenvs/RAthena/lib/python3.6/site-packages (from boto3) (0.9.5)
Requirement already satisfied, skipping upgrade: s3transfer<0.4.0,>=0.3.0 in /home/rstudio-gavin/.virtualenvs/RAthena/lib/python3.6/site-packages (from boto3) (0.3.3)
Requirement already satisfied, skipping upgrade: urllib3<1.26,>=1.20; python_version != "3.4" in /home/rstudio-gavin/.virtualenvs/RAthena/lib/python3.6/site-packages (from botocore<1.16.0,>=1.15.15->boto3) (1.25.8)
Requirement already satisfied, skipping upgrade: docutils<0.16,>=0.10 in /home/rstudio-gavin/.virtualenvs/RAthena/lib/python3.6/site-packages (from botocore<1.16.0,>=1.15.15->boto3) (0.15.2)
Requirement already satisfied, skipping upgrade: python-dateutil<3.0.0,>=2.1 in /home/rstudio-gavin/.virtualenvs/RAthena/lib/python3.6/site-packages (from botocore<1.16.0,>=1.15.15->boto3) (2.8.1)
Requirement already satisfied, skipping upgrade: six>=1.5 in /home/rstudio-gavin/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.16.0,>=1.15.15->boto3) (1.14.0)
Installing collected packages: botocore, boto3
  Attempting uninstall: botocore
    Found existing installation: botocore 1.15.14
    Uninstalling botocore-1.15.14:
      Successfully uninstalled botocore-1.15.14
  Attempting uninstall: boto3
    Found existing installation: boto3 1.12.14
    Uninstalling boto3-1.12.14:
      Successfully uninstalled boto3-1.12.14
Successfully installed boto3-1.12.15 botocore-1.15.15

Installation complete. Please restart R.

I tried restarting r and again:

con <- dbConnect(RAthena::athena())

But I got the same error message. I tried the other suggestions in the error message too. It looks like RAthena works in conjunction with Python. Even though python is installed on my Linux server, I could not get this to work.

Are there any other good packages for pushing objects to s3 in r?

larefly · March 8, 2020, 7:53pm

Hi @dougfir,

I am the author of RAthena. I am sorry to hear you are having issue with the package. I have raised a ticket on the github so I can track it there (https://github.com/DyfanJones/RAthena/issues/87).

In the mean time you might need to check what Python environment you are working from. install_boto installs into RAthena environment.

When you start up the connection to python you might need to specify the environment where boto3 has been installed for example:

library(DBI)

# specifying python conda environment
reticulate::use_condaenv("RAthena")

# Or specifying python virtual enviroment
reticulate::use_virtualenv("RAthena")

con <- dbConnect(RAthena::athena())

Please let me know if this solves you issue.

An alternative package to RAthena is noctua. noctua uses the R AWS sdk paws, this means python isn't required.

I am currently updating all the documentation to make it easier for users to work with the packages.

dougfir · March 10, 2020, 12:21am

Hi @larefly thank you for the suggestions. I was able to get what I need using noctua.

I tried your suggestions:

library(DBI)
library(RAthena)
> reticulate::use_condaenv("RAthena")
Error in conda_env_python[[1]] : subscript out of bounds
> reticulate::use_virtualenv("RAthena")
> con <- dbConnect(RAthena::athena())
Error: Boto3 is not detected please install boto3 using either: `pip install boto3` in terminal or `install_boto()`.
            Alternatively `reticulate::use_python` or `reticulate::use_condaenv` will have to be used if boto3 is in another environment.

In case it's helpful:

sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Amazon Linux 2

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RAthena_1.7.1.9000 DBI_1.1.0         

loaded via a namespace (and not attached):
[1] compiler_3.6.0    tools_3.6.0       rappdirs_0.3.1    Rcpp_1.0.3        reticulate_1.14  
[6] data.table_1.12.8 jsonlite_1.6.1    packrat_0.5.0

Let me know if you'd like any more info from me? My immediate need is now solved by using noctua.

larefly · March 10, 2020, 11:07am

I am glad noctua is able to support your requirements. I am trying to increase the support of different file types, for example json has recently been added as a file type to upload to AWS Athena. I am still looking into how to do orc files but there doesn't seem to be a maintained R package for this.

For the issue with RAthena I believe this is down to the wrong Python being used by reticulate.

The specific version of Python needs to be called before RAthena is loaded into the environment or it will just used the default systems. The package follows reticulate-packages with a delay import of boto3 however it appears there is a possible issue with the python.

For example:

library(DBI)

# as you are using virtual environments:
reticulate::use_virtualenv("RAthena")

con <- dbConnect(RAthena::athena())

# check what python is being used:
reticulate::py_config()

#python:         /Users/.../.virtualenvs/RAthena/bin/python
#libpython:      //anaconda3/lib/libpython3.7m.dylib
#pythonhome:     //anaconda3://anaconda3
#version:        3.7.3 (default, Mar 27 2019, 16:54:48)  [Clang 4.0.1 (tags/RELEASE_401/final)]
#numpy:          /anaconda3/lib/python3.7/site-packages/numpy
#numpy_version:  1.16.4
#boto3:          /Users/.../.virtualenvs/RAthena/lib/python3.7/site-packages/boto3

This should fix your issue you are having with RAthena. If not the following closed issue might resolve the issue: https://github.com/DyfanJones/RAthena/issues/85

larefly · March 10, 2020, 11:33am

Here is example what I believe is happening with your system.

library(DBI)

# Forcing system python to be initiated:
reticulate::py_config()
#python:         /usr/bin/python
#libpython:      /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/config/libpython2.7.dylib
#pythonhome:     /System/Library/Frameworks/Python.framework/Versions/2.7:/System/Library/Frameworks/Python.framework/Versions/2.7
#version:        2.7.10 (default, Oct  6 2017, 22:29:07)  [GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.31)]
#numpy:          /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy
#numpy_version:  1.8.0

# Try to change to RAthena virtual environment
reticulate::use_virtualenv("RAthena")

con <- dbConnect(RAthena::athena())
#Error: Boto3 is not detected please install boto3 using either: `pip install boto3` in terminal or `install_boto()`.
#            Alternatively `reticulate::use_python` or `reticulate::use_condaenv` will have to be used if boto3 is in another environment.

reticulate::py_config()
#python:         /usr/bin/python
#libpython:      /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/config/libpython2.7.dylib
#pythonhome:     /System/Library/Frameworks/Python.framework/Versions/2.7:/System/Library/Frameworks/Python.framework/Versions/2.7
#version:        2.7.10 (default, Oct  6 2017, 22:29:07)  [GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.31)]
#numpy:          /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy
#numpy_version:  1.8.0

In this example we can see that the python hasn't changed to virtual environment RAthena.

dougfir · March 12, 2020, 4:07am

Thank you very much for your help here.

system · March 19, 2020, 4:07am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.