Configuring Havasu¶

Havasu needs to be configured before it can be used. Wherobots Cloud provides a set of default configurations for the users. Users can use the default configurations to get started quickly. However, users can also customize the configurations to store data in their own S3 buckets.

Configuring Havasu¶

Havasu can be configured before launching the Sedona context. Here is an example of configuring Havasu to use a custom S3 bucket:

config = SedonaContext. \
    builder(). \
    config("spark.sql.catalog.my_catalog.type", "hadoop"). \
    config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog"). \
    config("spark.sql.catalog.my_catalog.warehouse", "s3://path/to/your/warehouse"). \
    config("spark.sql.catalog.my_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO"). \
    getOrCreate()
sedona = SedonaContext.create(config)

In the configured Sedona context, users can access Havasu tables stored in their own S3 path using my_catalog as table namespace. For example, users can create a table in Havasu using my_catalog.<db_name>.<table_name> as table name:

PythonScalaJava

sedona.sql("CREATE DATABASE IF NOT EXISTS my_catalog.test_db.test_table")
sedona.sql("""
CREATE TABLE IF NOT EXISTS my_catalog.test_db.test_table (
  id STRING,
  geom GEOMETRY
)
""")

sedona.sql("CREATE DATABASE IF NOT EXISTS my_catalog.test_db.test_table")
sedona.sql("""
CREATE TABLE IF NOT EXISTS my_catalog.test_db.test_table (
  id STRING,
  geom GEOMETRY
)
""")

sedona.sql("CREATE DATABASE IF NOT EXISTS my_catalog.test_db.test_table");
sedona.sql(
  "CREATE TABLE IF NOT EXISTS my_catalog.test_db.test_table (" +
  "id STRING," +
  "geom GEOMETRY" +
")");

Users can also configure multiple Havasu catalogs in the same Spark session. For example, users can configure two Havasu catalogs my_catalog and my_catalog2:

config = SedonaContext. \
    builder(). \
    config("spark.sql.catalog.my_catalog.type", "hadoop"). \
    config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog"). \
    config("spark.sql.catalog.my_catalog.warehouse", "s3://path/to/your/warehouse1"). \
    config("spark.sql.catalog.my_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO"). \
    config("spark.sql.catalog.my_catalog2.type", "hadoop"). \
    config("spark.sql.catalog.my_catalog2", "org.apache.iceberg.spark.SparkCatalog"). \
    config("spark.sql.catalog.my_catalog2.warehouse", "s3://path/to/your/warehouse2"). \
    config("spark.sql.catalog.my_catalog2.io-impl", "org.apache.iceberg.aws.s3.S3FileIO"). \
    getOrCreate()
sedona = SedonaContext.create(config)

Tables named my_catalog.<db_name>.<table_name> will be stored in s3://path/to/your/warehouse1, while tables named my_catalog2.<db_name>.<table_name> will be stored in s3://path/to/your/warehouse2.

The configuration of Havasu is almost identical to the catalog configuration of Apache Iceberg on Spark. Please refer to Spark Configuration for details. If users have their own catalog service such as Hive metastore or AWS Glue, they can also configure Havasu to use their own catalog services.

S3 Cross-Account Access¶

Havasu can use users' own S3 buckets as table storage. This requires some configurations on both AWS IAM and Wherobots Sedona Context. Please refer to Access Your Own S3 Buckets for details.

Default Configuration¶

Wherobots Cloud provides a set of default configurations for the users to store data under the S3 bucket allocated for the users, the name of the pre-configured catalog is wherobots.

spark.sql.catalog.wherobots            org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.wherobots.type       hadoop
spark.sql.catalog.wherobots.warehouse  <path to S3 allocated for the user>
spark.sql.catalog.wherobots.io-impl    org.apache.iceberg.aws.s3.S3FileIO

Specifically, Havasu uses hadoop on AWS S3 as the default catalog. This setup is not safe for concurrent writes. Please see Limitations for more details.

Last update: February 9, 2024 03:04:11