Skip to content

Geometry Support in Havasu

Havasu supports geometry data type and allows users to use geometry functions in Sedona for manipulating geometry data. This document describes how to use geometry data type and geometry functions in Havasu.

Geometry as a Primitive Type

Besides the primitive types supported by Apache Iceberg, Havasu introduced a new data type GEOMETRY to represent geospatial data. For instance, user can create a table with a geometry column using SQL:

sedona.sql("""
CREATE TABLE IF NOT EXISTS wherobots.test_db.test_table (
    id bigint,
    data string,
    geom geometry
) PARTITIONED BY (bucket(4, id))""")
sedona.sql("""
CREATE TABLE IF NOT EXISTS wherobots.test_db.test_table (
    id bigint,
    data string,
    geom geometry
) PARTITIONED BY (bucket(4, id))""")
sedona.sql(
    "CREATE TABLE IF NOT EXISTS wherobots.test_db.test_table (" +
    "id bigint," +
    "data string," +
    "geom geometry" +
    ") PARTITIONED BY (bucket(4, id))"
);

This will create an empty table partitioned by the bucketed value of id. Notice that Havasu introduced a new data type GEOMETRY to represent geospatial data. We can inspect the table schema using DESCRIBE command:

sedona.sql("DESCRIBE TABLE wherobots.test_db.test_table").show()
sedona.sql("DESCRIBE TABLE wherobots.test_db.test_table").show()
sedona.sql("DESCRIBE TABLE wherobots.test_db.test_table").show();
+--------------+-------------+-------+
|      col_name|    data_type|comment|
+--------------+-------------+-------+
|            id|       bigint|   null|
|          data|       string|   null|
|          geom|     geometry|   null|
|              |             |       |
|# Partitioning|             |       |
|        Part 0|bucket(4, id)|       |
+--------------+-------------+-------+

Or using .printSchema() function:

sedona.table("wherobots.test_db.test_table").printSchema()
sedona.table("wherobots.test_db.test_table").printSchema()
sedona.table("wherobots.test_db.test_table").printSchema();
root
 |-- id: long (nullable = true)
 |-- data: string (nullable = true)
 |-- geom: geometry (nullable = true)

Creating Table with Geometry Column

As mentioned above, user can create a Havasu table using SQL:

sedona.sql("""
CREATE TABLE IF NOT EXISTS wherobots.test_db.test_table (
    id bigint,
    data string,
    geom geometry
) PARTITIONED BY (bucket(4, id))""")
sedona.sql("""
CREATE TABLE IF NOT EXISTS wherobots.test_db.test_table (
    id bigint,
    data string,
    geom geometry
) PARTITIONED BY (bucket(4, id))""")
sedona.sql(
    "CREATE TABLE IF NOT EXISTS wherobots.test_db.test_table (" +
    "id bigint," +
    "data string," +
    "geom geometry" +
    ") PARTITIONED BY (bucket(4, id))"
);

If user has a DataFrame containing geometry data, they can also create a Havasu table using the DataFrame:

# Create a DataFrame with geometry data
df = sedona.createDataFrame([
    (1, "a", "POINT (1 2)"),
    (2, "b", "POINT (2 3)")
], ["id", "data", "geom"])\
.withColumn("geom", expr("ST_GeomFromText(geom)"))

# Create a Havasu table using the DataFrame
df.writeTo("wherobots.test_db.test_table").create()
// Create a DataFrame with geometry data
val df = Seq(
    (1, "a", "POINT (1 2)"),
    (2, "b", "POINT (2 3)")
).toDF("id", "data", "geom")
.withColumn("geom", expr("ST_GeomFromText(geom)"))

// Create a Havasu table using the DataFrame
df.writeTo("wherobots.test_db.test_table").create()
// Create a DataFrame with geometry data
List<Row> data = Arrays.asList(
    RowFactory.create(1, "a", "POINT (1 2)"),
    RowFactory.create(2, "b", "POINT (2 3)")
);

List<StructField> fields = Arrays.asList(
    DataTypes.createStructField("id", DataTypes.IntegerType, true),
    DataTypes.createStructField("data", DataTypes.StringType, true),
    DataTypes.createStructField("geom", DataTypes.StringType, true)
);

StructType schema = DataTypes.createStructType(fields);

Dataset<Row> df = sedona.createDataFrame(data, schema)
    .withColumn("geom", expr("ST_GeomFromText(geom)"));

// Create a Havasu table using the DataFrame
df.writeTo("wherobots.test_db.test_table").create();// Create a DataFrame with geometry data

Writing Data

INSERT INTO

User can insert data into a Havasu table using INSERT INTO table_name VALUES:

sedona.sql("""
INSERT INTO wherobots.test_db.test_table
VALUES (1, 'a', ST_GeomFromText('POINT (1 2)')), (2, 'b', ST_Point(2, 3))
""")
sedona.sql("""
INSERT INTO wherobots.test_db.test_table
VALUES (1, 'a', ST_GeomFromText('POINT (1 2)')), (2, 'b', ST_Point(2, 3))
""")
sedona.sql(
    "INSERT INTO wherobots.test_db.test_table " +
    "VALUES (1, 'a', ST_GeomFromText('POINT (1 2)')), (2, 'b', ST_Point(2, 3))"
);

Or using INSERT INTO table_name SELECT ... to insert result set of a query into the table:

sedona.sql("INSERT INTO wherobots.test_db.test_table SELECT 3, 'c', ST_Point(3, 4)")
sedona.sql("INSERT INTO wherobots.test_db.test_table SELECT 3, 'c', ST_Point(3, 4)")
sedona.sql("INSERT INTO wherobots.test_db.test_table SELECT 3, 'c', ST_Point(3, 4)");

Writing DataFrame to Havasu table

User can write a DataFrame containing geometry data to a Havasu table:

# Create a DataFrame with geometry data
df = sedona.createDataFrame([
    (1, "a", "POINT (1 2)"),
    (2, "b", "POINT (2 3)")
], ["id", "data", "geom"])\
.withColumn("geom", expr("ST_GeomFromText(geom)"))

# Write the DataFrame to a Havasu table
df.writeTo("wherobots.test_db.test_table").append()
// Create a DataFrame with geometry data
val df = Seq(
    (1, "a", "POINT (1 2)"),
    (2, "b", "POINT (2 3)")
).toDF("id", "data", "geom")
.withColumn("geom", expr("ST_GeomFromText(geom)"))

// Write the DataFrame to a Havasu table
df.writeTo("wherobots.test_db.test_table").append()
List<Row> data = Arrays.asList(
    RowFactory.create(1, "a", "POINT (1 2)"),
    RowFactory.create(2, "b", "POINT (2 3)")
);

List<StructField> fields = Arrays.asList(
    DataTypes.createStructField("id", DataTypes.IntegerType, true),
    DataTypes.createStructField("data", DataTypes.StringType, true),
    DataTypes.createStructField("geom", DataTypes.StringType, true)
);

StructType schema = DataTypes.createStructType(fields);

Dataset<Row> df = sedona.createDataFrame(data, schema)
    .withColumn("geom", expr("ST_GeomFromText(geom)"));

df.writeTo("wherobots.test_db.test_table").append();

This is semantically equivalent to the INSERT INTO table_name SELECT ... statement.

Updating data in Havasu table

Havasu supports UPDATE queries that update matching rows in tables. Update queries accept a filter to match rows to update. Spatial filters are also supported in Havasu.

sedona.sql("UPDATE wherobots.test_db.test_table SET data = 'd' WHERE id = 1")

# Use ST_Intersects to update rows that intersect with a polygon
sedona.sql("""
UPDATE wherobots.test_db.test_table
SET data = 'd'
WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))
""")
sedona.sql("UPDATE wherobots.test_db.test_table SET data = 'd' WHERE id = 1")

// Use ST_Intersects to update rows that intersect with a polygon
sedona.sql("""
UPDATE wherobots.test_db.test_table
SET data = 'd'
WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))
""")
sedona.sql("UPDATE wherobots.test_db.test_table SET data = 'd' WHERE id = 1");

// Use ST_Intersects to update rows that intersect with a polygon
sedona.sql(
    "UPDATE wherobots.test_db.test_table " +
    "SET data = 'd' " +
    "WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))"
);

Deleting data from Havasu table

Havasu supports DELETE FROM queries to remove data from tables. Delete queries accept a filter to match rows to delete. Spatial filters are also supported in Havasu.

sedona.sql("DELETE FROM wherobots.test_db.test_table WHERE id = 1")

# Use ST_Intersects to delete rows that intersect with a polygon
sedona.sql("""
DELETE FROM wherobots.test_db.test_table
WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))
""")
sedona.sql("DELETE FROM wherobots.test_db.test_table WHERE id = 1")

// Use ST_Intersects to delete rows that intersect with a polygon
sedona.sql("""
DELETE FROM wherobots.test_db.test_table
WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))
""")
sedona.sql("DELETE FROM wherobots.test_db.test_table WHERE id = 1");

// Use ST_Intersects to delete rows that intersect with a polygon
sedona.sql(
    "DELETE FROM wherobots.test_db.test_table " +
    "WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))"
);

Merging DataFrame into Havasu table using MERGE INTO

Havasu supports MERGE INTO by rewriting data files that contain rows that need to be updated in an overwrite commit. The syntax is identical to the open source Apache Iceberg, please refer to Apache Iceberg - MERGE INTO for more information.

Querying Data

User can load data from a Havasu table using sedona.table(...):

df = sedona.table("wherobots.test_db.test_table")
df = sedona.table("wherobots.test_db.test_table")
Dataset<Row> df = sedona.table("wherobots.test_db.test_table");

You can apply some configurations for reading the table, such as the split size if you want to read the table into a DataFrame with more partitions:

df = sedona.read.option('split-size', '1000').table("wherobots.test_db.test_table")
df = sedona.read.option('split-size', '1000').table("wherobots.test_db.test_table")
Dataset<Row> df = sedona.read().option("split-size", "1000").table("wherobots.test_db.test_table");

You can run spatial range query on the table using Sedona SQL:

df = sedona.sql("""
SELECT * FROM wherobots.test_db.test_table
WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))
""")
df = sedona.sql("""
SELECT * FROM wherobots.test_db.test_table
WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))
""")
Dataset<Row> df = sedona.sql(
    "SELECT * FROM wherobots.test_db.test_table " +
    "WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))"
);

Or using the DataFrame API:

df = sedona.table("wherobots.test_db.test_table")\
    .where("ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))")
df = sedona.table("wherobots.test_db.test_table")\
    .where("ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))")
Dataset<Row> df = sedona.table("wherobots.test_db.test_table")
    .where("ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))");

Users can also load a Havasu table by specifying the name of the data source explicitly using .format("havasu.iceberg"), this will load an isolated table reference that will not automatically refresh tables used by queries.

df = sedona.read.format("havasu.iceberg").load("wherobots.test_db.test_table")\
    .where("ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))")
val df = sedona.read.format("havasu.iceberg").load("wherobots.test_db.test_table")\
    .where("ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))")
Dataset<Row> df = sedona.read().format("havasu.iceberg").load("wherobots.test_db.test_table")
    .where("ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))");

Spatial range queries are very efficient in Havasu. Havasu supports spatial filter based data skipping. This feature allows user to skip reading data files that don’t contain data that satisfy the spatial filter. Please refer to Cluster by geospatial fields for faster queries for more information.

Working with Geometry Data

Data in columns with geometry type will be loaded as GeometryUDT values in Sedona, user can use any ST_ funtions provided by Sedona to manipulate the geospatial data. For example, user can use ST_Buffer to create a buffer around the geometry column:

sedona.sql("SELECT ST_Buffer(geom, 0.001) FROM wherobots.test_db.test_table").show()
sedona.sql("SELECT ST_Buffer(geom, 0.001) FROM wherobots.test_db.test_table").show()
sedona.sql("SELECT ST_Buffer(geom, 0.001) FROM wherobots.test_db.test_table").show();
+----------------------+
|st_buffer(geom, 0.001)|
+----------------------+
|  POLYGON ((3.001 4...|
|  POLYGON ((1.001 2...|
|  POLYGON ((2.001 3...|
|  POLYGON ((1.001 2...|
|  POLYGON ((2.001 3...|
+----------------------+

The resulting DataFrame can also be written back to a Havasu table using a CTAS statement:

sedona.sql("""
CREATE TABLE wherobots.test_db.test_table_buf AS
SELECT ST_Buffer(geom, 0.001) FROM wherobots.test_db.test_table
""")
sedona.sql("""
CREATE TABLE wherobots.test_db.test_table_buf AS
SELECT ST_Buffer(geom, 0.001) FROM wherobots.test_db.test_table
""")
sedona.sql(
    "CREATE TABLE wherobots.test_db.test_table_buf AS " +
    "SELECT ST_Buffer(geom, 0.001) FROM wherobots.test_db.test_table"
);

Or simply call writeTo function on the resulting DataFrame of the query:

sedona.sql("SELECT ST_Buffer(geom, 0.001) FROM wherobots.test_db.test_table")\
    .writeTo("wherobots.test_db.test_table_buf").create()
sedona.sql("SELECT ST_Buffer(geom, 0.001) FROM wherobots.test_db.test_table")
    .writeTo("wherobots.test_db.test_table_buf").create()
sedona.sql("SELECT ST_Buffer(geom, 0.001) FROM wherobots.test_db.test_table")
    .writeTo("wherobots.test_db.test_table_buf").create();

Further Reading

Havasu is based on Apache Iceberg, all features of Apache Iceberg except MOR tables are supported in Havasu. Please refer to Apache Iceberg documentation for Spark for more information.

If you have spatial data stored in Iceberg table or parquet files as WKT or WKB, you can migrate your data to Havasu very efficiently without scanning or rewriting your data files. Please refer to Convert Existing Table to Havasu Table and Migrating Parquet Files to Havasu for more information.

Havasu supports spatial filter push down and can optimize spatial range queries. Please refer to Cluster by geospatial fields for faster queries for how to organize your spatial data for better performance.


Last update: October 14, 2023 01:17:01