Skip to content

Access Your Own S3 Buckets

It is possible to use your own S3 buckets as the storage of Havasu. This chapter describes how to configure Sedona context to store Havasu tables in your own S3 buckets.

Using Access Keys

Prerequisities

To use access keys, you need to have the following:

  1. A bucket in your AWS account which you want to use as Havasu warehouse. We'll refer to the bucket as <bucket_name> in the following sections.
  2. An IAM user in your AWS account which has access to the bucket you want to use as Havasu warehouse.
  3. Access key ID and secret access key of the IAM user. We will refer to the access key ID as <your-access-key-id> and secret access key as <your-secret-access-key> in the following sections.

Configure Sedona Context

You can configure Havasu to use your S3 bucket as warehouse path. You can configure the access keys in the Sedona context:

sedona = SedonaContext. \
    builder(). \
    config("spark.sql.catalog.<your_catalog>.type", "hadoop"). \
    config("spark.sql.catalog.<your_catalog>", "org.apache.iceberg.spark.SparkCatalog"). \
    config("spark.sql.catalog.<your_catalog>.warehouse", "s3://<bucket_name>/path/to/your/warehouse"). \
    config("spark.sql.catalog.<your_catalog>.io-impl", "org.apache.iceberg.aws.s3.S3FileIO"). \
    config("spark.sql.catalog.<your_catalog>.client.region", "<your-region>"). \
    config("spark.sql.catalog.<your_catalog>.s3.access-key-id",  "<your-access-key-id>"). \
    config("spark.sql.catalog.<your_catalog>.s3.secret-access-key", "<your-secret-access-key>"). \
    config("spark.hadoop.fs.s3a.bucket.<bucket_name>.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider"). \
    config("spark.hadoop.fs.s3a.bucket.<bucket_name>.access.key", "<your-access-key-id>"). \
    config("spark.hadoop.fs.s3a.bucket.<bucket_name>.secret.key", "<your-secret-access-key>"). \
    getOrCreate()

Note

The AWS credentials need to be configured for both Havasu (spark.sql.catalog.<your_catalog>.*) and Hadoop FileSystem (spark.hadoop.fs.s3a.bucket.<bucket_name>.*).

It is generally recommended to use S3 bucket in the same region with the Wherobots cluster, otherwise the performance may be affected and additional cost may be incurred. spark.sql.catalog.<your_catalog>.client.region can be omitted if the region of your S3 bucket is the same with the Wherobots cluster. If you are using S3 bucket in a different region, you must configure spark.sql.catalog.<your_catalog>.client.region to the region of your S3 bucket.

Although it is easy to setup cross account Havasu warehouse using access keys, it is not recommended to use access keys in production environment. It is recommended to use IAM roles instead.

Using IAM Roles

Prerequisities

To use IAM roles, you need to have the following:

  1. A bucket in your AWS account which you want to use as Havasu warehouse. We'll refer to the bucket as <bucket_name> in the following sections.
  2. An IAM role in your AWS account which has access to the bucket you want to use as Havasu warehouse. We'll refer to the ARN of the IAM role as <your-iam-role-arn> in the following sections.

Configure IAM roles for cross account access

This section describes how to configure IAM roles for cross-account access. Conceptually, the IAM role for accessing your S3 bucket needs to be configured to trust the IAM role of the Wherobots cluster; and the IAM role of the Wherobots cluster needs to be configured to allow assuming user's IAM role. Users can refer to the AWS documentation for more details.

1. Add a trust relationship to your IAM role

User needs to add a trust relationship to the IAM role of Wherobots cluster. The trust relationship should look like the following:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "<wherobots-cloud-iam-role-arn>"
      },
      "Action": "sts:AssumeRole",
      "Condition": {}
    }
  ]
}

The <wherobots-cloud-iam-role-arn> can be obtained by running the following command in your Wherobots cluster. You can launch a terminal window in the jupyter lab and run the command:

echo $AWS_ROLE_ARN

2. Add permissions to the IAM role of Wherobots cluster

Warning

TODO: This step cannot be done by the end user for now. We need to expose a way for the end user to add permissions to the IAM role of Wherobots cluster.

{
  "Version": "2012-10-17",
  "Statement": {
    "Effect": "Allow",
    "Action": "sts:AssumeRole",
    "Resource": "<your-iam-role-arn>"
  }
}

3. Verify the configuration

You can verify the configuration by running the following command in your Wherobots cluster. You can launch a terminal window in the jupyter lab and run the command:

aws sts assume-role --role-arn <your-iam-role-arn> --role-session-name cross-account-s3-access

If the command succeeded and outputs STS credentials, then the configuration is correct.

Configure Sedona Context

You can configure Havasu to use your S3 bucket as warehouse path. You can configure the IAM role to access your S3 bucket in the Sedona context:

sedona = SedonaContext. \
    builder(). \
    config("spark.sql.catalog.<your_catalog>.type", "hadoop"). \
    config("spark.sql.catalog.<your_catalog>", "org.apache.iceberg.spark.SparkCatalog"). \
    config("spark.sql.catalog.<your_catalog>.warehouse", "s3://<bucket_name>/path/to/your/warehouse"). \
    config("spark.sql.catalog.<your_catalog>.io-impl", "org.apache.iceberg.aws.s3.S3FileIO"). \
    config("spark.sql.catalog.<your_catalog>.client.factory", "org.apache.iceberg.aws.AssumeRoleAwsClientFactory"). \
    config("spark.sql.catalog.<your_catalog>.client.assume-role.arn", "<your-iam-role-arn>"). \
    config("spark.sql.catalog.<your_catalog>.client.assume-role.region",  "<your-region>"). \
    config("spark.hadoop.fs.s3a.bucket.<bucket_name>.aws.credentials.provider", "org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider"). \
    config("spark.hadoop.fs.s3a.bucket.<bucket_name>.assumed.role.credentials.provider", "com.amazonaws.auth.WebIdentityTokenCredentialsProvider"). \
    config("spark.hadoop.fs.s3a.bucket.<bucket_name>.assumed.role.arn", "<your-iam-role-arn>"). \
    getOrCreate()

Note

The IAM role ARN need to be configured for both Havasu (spark.sql.catalog.<your_catalog>.*) and Hadoop FileSystem (spark.hadoop.fs.s3a.bucket.<bucket_name>.*).

It is generally recommended to use S3 bucket in the same region with the Wherobots cluster, otherwise the performance may be affected and additional cost may be incurred. spark.sql.catalog.<your_catalog>.client.region can be omitted if the region of your S3 bucket is the same with the Wherobots cluster. If you are using S3 bucket in a different region, you must configure spark.sql.catalog.<your_catalog>.client.region to the region of your S3 bucket.


Last update: January 23, 2024 00:25:34