Skip to content

Accessing your own S3 buckets

It is possible to access your own S3 buckets using Sedona-DB. This chapter describes how to configure Sedona context to access your own S3 buckets.

Accessing Public Buckets

To access publicly readable buckets, you need to specify the following configuration when creating the Sedona context:

sedona = SedonaContext. \
    builder(). \
    config("spark.hadoop.fs.s3a.bucket.<bucket_name>.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider"). \
    getOrCreate()

This will allow Sedona-DB to access the publicly readable bucket named <bucket_name> anonymously.

Access Private S3 Bucket Using Access Keys

To use access keys to access your private bucket named <bucket_name>, you need to specify the following configuration when creating the Sedona context:

sedona = SedonaContext. \
    builder(). \
    config("spark.hadoop.fs.s3a.bucket.<bucket_name>.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider"). \
    config("spark.hadoop.fs.s3a.bucket.<bucket_name>.access.key", "<your-access-key-id>"). \
    config("spark.hadoop.fs.s3a.bucket.<bucket_name>.secret.key", "<your-secret-access-key>"). \
    getOrCreate()

Please replace <your-access-key-id> and <your-secret-access-key> with your own access key ID and secret access key,

It is generally recommended to use S3 bucket in the same region with the Wherobots cluster, otherwise the performance may be affected and additional cost may be incurred.

Access Private S3 Bucket Using IAM Roles

To use IAM roles to access your private bucket named <bucket>, you need an IAM role in your AWS account which has access to the bucket. We'll refer to the ARN of the IAM role as <your-iam-role-arn> in the following sections.

Configure IAM roles for cross account access

This section describes how to configure IAM roles for cross-account access. Conceptually, the IAM role for accessing your S3 bucket needs to be configured to trust the IAM role of the Wherobots cluster; and the IAM role of the Wherobots cluster needs to be configured to allow assuming user's IAM role. Users can refer to the AWS documentation for more details.

1. Add a trust relationship to your IAM role

User needs to add a trust relationship to the IAM role of Wherobots cluster. The trust relationship should look like the following:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "<wherobots-cloud-iam-role-arn>"
      },
      "Action": "sts:AssumeRole",
      "Condition": {}
    }
  ]
}

The <wherobots-cloud-iam-role-arn> can be obtained by running the following command in your Wherobots cluster. You can launch a terminal window in the jupyter lab and run the command:

echo $AWS_ROLE_ARN

2. Add permissions to the IAM role of Wherobots cluster

Warning

TODO: This step cannot be done by the end user for now. We need to expose a way for the end user to add permissions to the IAM role of Wherobots cluster.

{
  "Version": "2012-10-17",
  "Statement": {
    "Effect": "Allow",
    "Action": "sts:AssumeRole",
    "Resource": "<your-iam-role-arn>"
  }
}

3. Verify the configuration

You can verify the configuration by running the following command in your Wherobots cluster. You can launch a terminal window in the jupyter lab and run the command:

aws sts assume-role --role-arn <your-iam-role-arn> --role-session-name cross-account-s3-access

If the command succeeded and outputs STS credentials, then the configuration is correct.

Configure Sedona Context

You can configure the IAM role to access your S3 bucket in the Sedona context:

sedona = SedonaContext. \
    builder(). \
    config("spark.hadoop.fs.s3a.bucket.<bucket_name>.aws.credentials.provider", "org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider"). \
    config("spark.hadoop.fs.s3a.bucket.<bucket_name>.assumed.role.credentials.provider", "com.amazonaws.auth.WebIdentityTokenCredentialsProvider"). \
    config("spark.hadoop.fs.s3a.bucket.<bucket_name>.assumed.role.arn", "<your-iam-role-arn>"). \
    getOrCreate()

It is generally recommended to use S3 bucket in the same region with the Wherobots cluster, otherwise the performance may be affected and additional cost may be incurred.


Last update: January 23, 2024 02:04:53