I'm having a hard time getting sparkly (spark 2.4.3) to connect to AWS S3 (s3a://) data sources when using instance roles (EC2 Metadata service). When I have a known working IAM credentials in the EC2 metadata service (tested via cloudyr/aws.ec2metadata
and cloudyr/aws.s3
), I'm getting error messages that start:
Error: com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: REDACTED, AWS Error Code: null, AWS Error Message: Bad Request, S3 Extended Request ID: REDACTED
My spark initialization is pretty simple, following https://spark.rstudio.com/guides/aws-s3/ , and looks like:
conf <- spark_config()
conf$sparklyr.defaultPackages <- "org.apache.hadoop:hadoop-aws:2.7.7"
conf$fs.s3a.endpoint <- "s3.us-east-2.amazonaws.com"
Sys.setenv(AWS_ACCESS_KEY_ID="")
Sys.setenv(AWS_SECRET_ACCESS_KEY="")
sc <- spark_connect(master = "local", config = conf)
stream_read_text(sc, "s3a://REDACTED_BUT_KNOWN_WORKING_PATH")
I've tried both up and down leveling the version of the hadoop-aws
package and tried both with and without setting those AWS environment variables to empty strings (the env var method came from https://stackoverflow.com/questions/45924785/how-to-access-s3-data-from-rstudio-on-ec2-using-iam-role-authentication ).
Would be very grateful for any tips to get this working!
This took a tremendous amount of work, but I finally cracked the code to get this working.
conf <- spark_config()
conf$sparklyr.defaultPackages <- "org.apache.hadoop:hadoop-aws:2.7.3"
conf$sparklyr.shell.conf <- "spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4"
sc <- spark_connect(master = "local", config = conf, version = "2.4.4")
ctx <- spark_context(sc)
jsc <- invoke_static(sc,
"org.apache.spark.api.java.JavaSparkContext",
"fromSparkContext",
ctx)
hconf <- jsc %>% invoke("hadoopConfiguration")
# we always want the s3a file system with V4 signatures
hconf %>% invoke("set", "fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
hconf %>% invoke("set", "com.amazonaws.services.s3a.enableV4", "true")
# connect to us-east-2 endpoint
hconf %>% invoke("set", "fs.s3a.endpoint", "s3.us-east-2.amazonaws.com")
# ensure we always use bucket owner full control ACL in case of cross-account access
hconf %>% invoke("set", "fs.s3a.acl.default", "BucketOwnerFullControl")
# use EC2 metadata service to authenticate
hconf %>% invoke("set", "fs.s3a.aws.credentials.provider",
"com.amazonaws.auth.InstanceProfileCredentialsProvider")
I have to say, the documentation on this (particularly the distinction between spark_config() and hadoop config) is a bit...rough about the edges.
system
Closed
April 16, 2020, 2:04am
3
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.