Geometry Support in Havasu¶

Havasu supports geometry data type and allows users to use geometry functions in Sedona for manipulating geometry data. This document describes how to use geometry data type and geometry functions in Havasu.

Geometry as a Primitive Type¶

Besides the primitive types supported by Apache Iceberg, Havasu introduced a new data type GEOMETRY to represent geospatial data. For instance, user can create a table with a geometry column using SQL:

PythonScalaJava

sedona.sql("""
CREATE TABLE IF NOT EXISTS wherobots.test_db.test_table (
    id bigint,
    data string,
    geom geometry
) PARTITIONED BY (bucket(4, id))""")

sedona.sql("""
CREATE TABLE IF NOT EXISTS wherobots.test_db.test_table (
    id bigint,
    data string,
    geom geometry
) PARTITIONED BY (bucket(4, id))""")

sedona.sql(
    "CREATE TABLE IF NOT EXISTS wherobots.test_db.test_table (" +
    "id bigint," +
    "data string," +
    "geom geometry" +
    ") PARTITIONED BY (bucket(4, id))"
);

This will create an empty table partitioned by the bucketed value of id. Notice that Havasu introduced a new data type GEOMETRY to represent geospatial data. We can inspect the table schema using DESCRIBE command:

PythonScalaJava

sedona.sql("DESCRIBE TABLE wherobots.test_db.test_table").show()

sedona.sql("DESCRIBE TABLE wherobots.test_db.test_table").show()

sedona.sql("DESCRIBE TABLE wherobots.test_db.test_table").show();

+--------------+-------------+-------+
|      col_name|    data_type|comment|
+--------------+-------------+-------+
|            id|       bigint|   null|
|          data|       string|   null|
|          geom|     geometry|   null|
|              |             |       |
|# Partitioning|             |       |
|        Part 0|bucket(4, id)|       |
+--------------+-------------+-------+

Or using .printSchema() function:

PythonScalaJava

sedona.table("wherobots.test_db.test_table").printSchema()

sedona.table("wherobots.test_db.test_table").printSchema()

sedona.table("wherobots.test_db.test_table").printSchema();

root
 |-- id: long (nullable = true)
 |-- data: string (nullable = true)
 |-- geom: geometry (nullable = true)

Creating Table with Geometry Column¶

As mentioned above, user can create a Havasu table using SQL:

PythonScalaJava

sedona.sql("""
CREATE TABLE IF NOT EXISTS wherobots.test_db.test_table (
    id bigint,
    data string,
    geom geometry
) PARTITIONED BY (bucket(4, id))""")

sedona.sql("""
CREATE TABLE IF NOT EXISTS wherobots.test_db.test_table (
    id bigint,
    data string,
    geom geometry
) PARTITIONED BY (bucket(4, id))""")

sedona.sql(
    "CREATE TABLE IF NOT EXISTS wherobots.test_db.test_table (" +
    "id bigint," +
    "data string," +
    "geom geometry" +
    ") PARTITIONED BY (bucket(4, id))"
);

If user has a DataFrame containing geometry data, they can also create a Havasu table using the DataFrame:

PythonScalaJava

# Create a DataFrame with geometry data
df = sedona.createDataFrame([
    (1, "a", "POINT (1 2)"),
    (2, "b", "POINT (2 3)")
], ["id", "data", "geom"])\
.withColumn("geom", expr("ST_GeomFromText(geom)"))

# Create a Havasu table using the DataFrame
df.writeTo("wherobots.test_db.test_table").create()

// Create a DataFrame with geometry data
val df = Seq(
    (1, "a", "POINT (1 2)"),
    (2, "b", "POINT (2 3)")
).toDF("id", "data", "geom")
.withColumn("geom", expr("ST_GeomFromText(geom)"))

// Create a Havasu table using the DataFrame
df.writeTo("wherobots.test_db.test_table").create()

// Create a DataFrame with geometry data
List<Row> data = Arrays.asList(
    RowFactory.create(1, "a", "POINT (1 2)"),
    RowFactory.create(2, "b", "POINT (2 3)")
);

List<StructField> fields = Arrays.asList(
    DataTypes.createStructField("id", DataTypes.IntegerType, true),
    DataTypes.createStructField("data", DataTypes.StringType, true),
    DataTypes.createStructField("geom", DataTypes.StringType, true)
);

StructType schema = DataTypes.createStructType(fields);

Dataset<Row> df = sedona.createDataFrame(data, schema)
    .withColumn("geom", expr("ST_GeomFromText(geom)"));

// Create a Havasu table using the DataFrame
df.writeTo("wherobots.test_db.test_table").create();// Create a DataFrame with geometry data

Writing Data¶

INSERT INTO¶

User can insert data into a Havasu table using INSERT INTO table_name VALUES:

PythonScalaJava

sedona.sql("""
INSERT INTO wherobots.test_db.test_table
VALUES (1, 'a', ST_GeomFromText('POINT (1 2)')), (2, 'b', ST_Point(2, 3))
""")

sedona.sql("""
INSERT INTO wherobots.test_db.test_table
VALUES (1, 'a', ST_GeomFromText('POINT (1 2)')), (2, 'b', ST_Point(2, 3))
""")

sedona.sql(
    "INSERT INTO wherobots.test_db.test_table " +
    "VALUES (1, 'a', ST_GeomFromText('POINT (1 2)')), (2, 'b', ST_Point(2, 3))"
);

Or using INSERT INTO table_name SELECT ... to insert result set of a query into the table:

PythonScalaJava

sedona.sql("INSERT INTO wherobots.test_db.test_table SELECT 3, 'c', ST_Point(3, 4)")

sedona.sql("INSERT INTO wherobots.test_db.test_table SELECT 3, 'c', ST_Point(3, 4)")

sedona.sql("INSERT INTO wherobots.test_db.test_table SELECT 3, 'c', ST_Point(3, 4)");

Writing DataFrame to Havasu table¶

User can write a DataFrame containing geometry data to a Havasu table:

PythonScalaJava

# Create a DataFrame with geometry data
df = sedona.createDataFrame([
    (1, "a", "POINT (1 2)"),
    (2, "b", "POINT (2 3)")
], ["id", "data", "geom"])\
.withColumn("geom", expr("ST_GeomFromText(geom)"))

# Write the DataFrame to a Havasu table
df.writeTo("wherobots.test_db.test_table").append()

// Create a DataFrame with geometry data
val df = Seq(
    (1, "a", "POINT (1 2)"),
    (2, "b", "POINT (2 3)")
).toDF("id", "data", "geom")
.withColumn("geom", expr("ST_GeomFromText(geom)"))

// Write the DataFrame to a Havasu table
df.writeTo("wherobots.test_db.test_table").append()

List<Row> data = Arrays.asList(
    RowFactory.create(1, "a", "POINT (1 2)"),
    RowFactory.create(2, "b", "POINT (2 3)")
);

List<StructField> fields = Arrays.asList(
    DataTypes.createStructField("id", DataTypes.IntegerType, true),
    DataTypes.createStructField("data", DataTypes.StringType, true),
    DataTypes.createStructField("geom", DataTypes.StringType, true)
);

StructType schema = DataTypes.createStructType(fields);

Dataset<Row> df = sedona.createDataFrame(data, schema)
    .withColumn("geom", expr("ST_GeomFromText(geom)"));

df.writeTo("wherobots.test_db.test_table").append();

This is semantically equivalent to the INSERT INTO table_name SELECT ... statement.

Updating data in Havasu table¶

Havasu supports UPDATE queries that update matching rows in tables. Update queries accept a filter to match rows to update. Spatial filters are also supported in Havasu.

PythonScalaJava

sedona.sql("UPDATE wherobots.test_db.test_table SET data = 'd' WHERE id = 1")

# Use ST_Intersects to update rows that intersect with a polygon
sedona.sql("""
UPDATE wherobots.test_db.test_table
SET data = 'd'
WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))
""")

sedona.sql("UPDATE wherobots.test_db.test_table SET data = 'd' WHERE id = 1")

// Use ST_Intersects to update rows that intersect with a polygon
sedona.sql("""
UPDATE wherobots.test_db.test_table
SET data = 'd'
WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))
""")

sedona.sql("UPDATE wherobots.test_db.test_table SET data = 'd' WHERE id = 1");

// Use ST_Intersects to update rows that intersect with a polygon
sedona.sql(
    "UPDATE wherobots.test_db.test_table " +
    "SET data = 'd' " +
    "WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))"
);

Deleting data from Havasu table¶

Havasu supports DELETE FROM queries to remove data from tables. Delete queries accept a filter to match rows to delete. Spatial filters are also supported in Havasu.

PythonScalaJava

sedona.sql("DELETE FROM wherobots.test_db.test_table WHERE id = 1")

# Use ST_Intersects to delete rows that intersect with a polygon
sedona.sql("""
DELETE FROM wherobots.test_db.test_table
WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))
""")

sedona.sql("DELETE FROM wherobots.test_db.test_table WHERE id = 1")

// Use ST_Intersects to delete rows that intersect with a polygon
sedona.sql("""
DELETE FROM wherobots.test_db.test_table
WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))
""")

sedona.sql("DELETE FROM wherobots.test_db.test_table WHERE id = 1");

// Use ST_Intersects to delete rows that intersect with a polygon
sedona.sql(
    "DELETE FROM wherobots.test_db.test_table " +
    "WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))"
);

Merging DataFrame into Havasu table using `MERGE INTO`¶

Havasu supports MERGE INTO by rewriting data files that contain rows that need to be updated in an overwrite commit. The syntax is identical to the open source Apache Iceberg, please refer to Apache Iceberg - MERGE INTO for more information.

Querying Data¶

User can load data from a Havasu table using sedona.table(...):

PythonScalaJava

df = sedona.table("wherobots.test_db.test_table")

df = sedona.table("wherobots.test_db.test_table")

Dataset<Row> df = sedona.table("wherobots.test_db.test_table");

You can apply some configurations for reading the table, such as the split size if you want to read the table into a DataFrame with more partitions:

PythonScalaJava

df = sedona.read.option('split-size', '1000').table("wherobots.test_db.test_table")

df = sedona.read.option('split-size', '1000').table("wherobots.test_db.test_table")

Dataset<Row> df = sedona.read().option("split-size", "1000").table("wherobots.test_db.test_table");

You can run spatial range query on the table using Sedona SQL:

PythonScalaJava

df = sedona.sql("""
SELECT * FROM wherobots.test_db.test_table
WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))
""")

df = sedona.sql("""
SELECT * FROM wherobots.test_db.test_table
WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))
""")

Dataset<Row> df = sedona.sql(
    "SELECT * FROM wherobots.test_db.test_table " +
    "WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))"
);

Or using the DataFrame API:

PythonScalaJava

df = sedona.table("wherobots.test_db.test_table")\
    .where("ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))")

df = sedona.table("wherobots.test_db.test_table")\
    .where("ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))")

Dataset<Row> df = sedona.table("wherobots.test_db.test_table")
    .where("ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))");

Users can also load a Havasu table by specifying the name of the data source explicitly using .format("havasu.iceberg"), this will load an isolated table reference that will not automatically refresh tables used by queries.

PythonScalaJava

df = sedona.read.format("havasu.iceberg").load("wherobots.test_db.test_table")\
    .where("ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))")

val df = sedona.read.format("havasu.iceberg").load("wherobots.test_db.test_table")\
    .where("ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))")

Dataset<Row> df = sedona.read().format("havasu.iceberg").load("wherobots.test_db.test_table")
    .where("ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))");

Spatial range queries are very efficient in Havasu. Havasu supports spatial filter based data skipping. This feature allows user to skip reading data files that don’t contain data that satisfy the spatial filter. Please refer to Cluster by geospatial fields for faster queries for more information.

Working with Geometry Data¶

Data in columns with geometry type will be loaded as GeometryUDT values in Sedona, user can use any ST_ funtions provided by Sedona to manipulate the geospatial data. For example, user can use ST_Buffer to create a buffer around the geometry column:

PythonScalaJava

sedona.sql("SELECT ST_Buffer(geom, 0.001) FROM wherobots.test_db.test_table").show()

sedona.sql("SELECT ST_Buffer(geom, 0.001) FROM wherobots.test_db.test_table").show()

sedona.sql("SELECT ST_Buffer(geom, 0.001) FROM wherobots.test_db.test_table").show();

+----------------------+
|st_buffer(geom, 0.001)|
+----------------------+
|  POLYGON ((3.001 4...|
|  POLYGON ((1.001 2...|
|  POLYGON ((2.001 3...|
|  POLYGON ((1.001 2...|
|  POLYGON ((2.001 3...|
+----------------------+

The resulting DataFrame can also be written back to a Havasu table using a CTAS statement:

PythonScalaJava

sedona.sql("""
CREATE TABLE wherobots.test_db.test_table_buf AS
SELECT ST_Buffer(geom, 0.001) FROM wherobots.test_db.test_table
""")

sedona.sql("""
CREATE TABLE wherobots.test_db.test_table_buf AS
SELECT ST_Buffer(geom, 0.001) FROM wherobots.test_db.test_table
""")

sedona.sql(
    "CREATE TABLE wherobots.test_db.test_table_buf AS " +
    "SELECT ST_Buffer(geom, 0.001) FROM wherobots.test_db.test_table"
);

Or simply call writeTo function on the resulting DataFrame of the query:

PythonScalaJava

sedona.sql("SELECT ST_Buffer(geom, 0.001) FROM wherobots.test_db.test_table")\
    .writeTo("wherobots.test_db.test_table_buf").create()

sedona.sql("SELECT ST_Buffer(geom, 0.001) FROM wherobots.test_db.test_table")
    .writeTo("wherobots.test_db.test_table_buf").create()

sedona.sql("SELECT ST_Buffer(geom, 0.001) FROM wherobots.test_db.test_table")
    .writeTo("wherobots.test_db.test_table_buf").create();

Geometry Support in Havasu¶

Geometry as a Primitive Type¶

Creating Table with Geometry Column¶

Writing Data¶

INSERT INTO¶

Writing DataFrame to Havasu table¶

Updating data in Havasu table¶

Deleting data from Havasu table¶

Merging DataFrame into Havasu table using MERGE INTO¶

Querying Data¶

Working with Geometry Data¶

Further Reading¶

Merging DataFrame into Havasu table using `MERGE INTO`¶