Connecting Sparklyr to s3a (AWS S3) via instance roles

davidski · March 31, 2020, 1:27pm

I'm having a hard time getting sparkly (spark 2.4.3) to connect to AWS S3 (s3a://) data sources when using instance roles (EC2 Metadata service). When I have a known working IAM credentials in the EC2 metadata service (tested via cloudyr/aws.ec2metadata and cloudyr/aws.s3), I'm getting error messages that start:

Error: com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: REDACTED, AWS Error Code: null, AWS Error Message: Bad Request, S3 Extended Request ID: REDACTED

My spark initialization is pretty simple, following https://spark.rstudio.com/guides/aws-s3/, and looks like:

conf <- spark_config()
conf$sparklyr.defaultPackages <- "org.apache.hadoop:hadoop-aws:2.7.7"
conf$fs.s3a.endpoint <- "s3.us-east-2.amazonaws.com"
Sys.setenv(AWS_ACCESS_KEY_ID="")
Sys.setenv(AWS_SECRET_ACCESS_KEY="")
sc <- spark_connect(master = "local", config = conf)
stream_read_text(sc, "s3a://REDACTED_BUT_KNOWN_WORKING_PATH")

I've tried both up and down leveling the version of the hadoop-aws package and tried both with and without setting those AWS environment variables to empty strings (the env var method came from https://stackoverflow.com/questions/45924785/how-to-access-s3-data-from-rstudio-on-ec2-using-iam-role-authentication).

Would be very grateful for any tips to get this working!

davidski · April 9, 2020, 2:04am

This took a tremendous amount of work, but I finally cracked the code to get this working.

conf <- spark_config()
conf$sparklyr.defaultPackages <- "org.apache.hadoop:hadoop-aws:2.7.3"
conf$sparklyr.shell.conf <- "spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4"
sc <- spark_connect(master = "local", config = conf, version = "2.4.4")
ctx <- spark_context(sc)
jsc <- invoke_static(sc, 
                     "org.apache.spark.api.java.JavaSparkContext",
                     "fromSparkContext",
                     ctx)
hconf <- jsc %>% invoke("hadoopConfiguration")  

# we always want the s3a file system with V4 signatures
hconf %>% invoke("set", "fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
hconf %>% invoke("set", "com.amazonaws.services.s3a.enableV4", "true")

# connect to us-east-2 endpoint
hconf %>% invoke("set", "fs.s3a.endpoint", "s3.us-east-2.amazonaws.com")

# ensure we always use bucket owner full control ACL in case of cross-account access
hconf %>% invoke("set", "fs.s3a.acl.default", "BucketOwnerFullControl")
# use EC2 metadata service to authenticate
hconf %>% invoke("set", "fs.s3a.aws.credentials.provider", 
                 "com.amazonaws.auth.InstanceProfileCredentialsProvider")

I have to say, the documentation on this (particularly the distinction between spark_config() and hadoop config) is a bit...rough about the edges.

system · April 16, 2020, 2:04am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.