Run geospatial queries

Transform the Coordinate Reference System¶

Sedona doesn't control the coordinate unit (degree-based or meter-based) of all geometries in a Geometry column. The unit of all related distances in SedonaSQL is same as the unit of all geometries in a Geometry column.

To convert Coordinate Reference System of the Geometry column created before, use the following code:

SELECT ST_Transform(countyshape, "epsg:4326", "epsg:3857") AS newcountyshape, _c1, _c2, _c3, _c4, _c5, _c6, _c7
FROM spatialdf

The first EPSG code EPSG:4326 in ST_Transform is the source CRS of the geometries. It is WGS84, the most common degree-based CRS.

The second EPSG code EPSG:3857 in ST_Transform is the target CRS of the geometries. It is the most common meter-based CRS.

This ST_Transform transform the CRS of these geometries from EPSG:4326 to EPSG:3857. The details CRS information can be found on EPSG.io

The coordinates of polygons have been changed. The output will be like this:

+--------------------+---+---+--------+-----+-----------+--------------------+---+
|      newcountyshape|_c1|_c2|     _c3|  _c4|        _c5|                 _c6|_c7|
+--------------------+---+---+--------+-----+-----------+--------------------+---+
|POLYGON ((-108001...| 31|039|00835841|31039|     Cuming|       Cuming County| 06|
|POLYGON ((-137408...| 53|069|01513275|53069|  Wahkiakum|    Wahkiakum County| 06|
|POLYGON ((-116403...| 35|011|00933054|35011|    De Baca|      De Baca County| 06|
|POLYGON ((-107880...| 31|109|00835876|31109|  Lancaster|    Lancaster County| 06|

Range query¶

Use ST_Contains, ST_Intersects, ST_Within to run a range query over a single column.

The following example finds all counties that are within the given polygon:

SELECT *
FROM spatialdf
WHERE ST_Contains (ST_PolygonFromEnvelope(1.0,100.0,1000.0,1100.0), newcountyshape)

Note

Read SedonaSQL API to learn how to create a Geometry type query window

KNN query¶

Use ST_Distance to calculate the distance and rank the distance.

The following code returns the 5 nearest neighbor of the given polygon.

SELECT countyname, ST_Distance(ST_PolygonFromEnvelope(1.0,100.0,1000.0,1100.0), newcountyshape) AS distance
FROM spatialdf
ORDER BY distance DESC
LIMIT 5

Range join¶

Tip

Sedona join performance is heavily affected by the number of partitions. If the join performance is not ideal, please increase the number of partitions by doing df.repartition(XXX) right after you create the original DataFrame.

A range join query is to find geometries from A and geometries from B such that each geometry pair satisfies a certain predicate. Most predicates supported by SedonaSQL can trigger a range join.

SQL example:

SELECT *
FROM polygondf, pointdf
WHERE ST_Contains(polygondf.polygonshape,pointdf.pointshape)

SELECT *
FROM polygondf, pointdf
WHERE ST_Intersects(polygondf.polygonshape,pointdf.pointshape)

SELECT *
FROM pointdf, polygondf
WHERE ST_Within(pointdf.pointshape, polygondf.polygonshape)

SQL Physical plan:

== Physical Plan ==
RangeJoin polygonshape#20: geometry, pointshape#43: geometry, false
:- Project [st_polygonfromenvelope(cast(_c0#0 as decimal(24,20)), cast(_c1#1 as decimal(24,20)), cast(_c2#2 as decimal(24,20)), cast(_c3#3 as decimal(24,20)), mypolygonid) AS polygonshape#20]
:  +- *FileScan csv
+- Project [st_point(cast(_c0#31 as decimal(24,20)), cast(_c1#32 as decimal(24,20)), myPointId) AS pointshape#43]
   +- *FileScan csv

Distance join¶

Introduction: Find geometries from A and geometries from B such that the distance of each geometry pair is less or equal than a certain distance. It supports the planar Euclidean distance calculators ST_Distance, ST_HausdorffDistance, ST_FrechetDistance and the meter-based geodesic distance calculators ST_DistanceSpheroid and ST_DistanceSphere.

Below are SQL examples for planar Euclidean distance predicates.

Fully within a certain distance:

SELECT *
FROM pointdf1, pointdf2
WHERE ST_Distance(pointdf1.pointshape1,pointdf2.pointshape2) < 2

SELECT *
FROM pointDf, polygonDF
WHERE ST_HausdorffDistance(pointDf.pointshape, polygonDf.polygonshape, 0.3) < 2

SELECT *
FROM pointDf, polygonDF
WHERE ST_FrechetDistance(pointDf.pointshape, polygonDf.polygonshape) < 2

Intersects within a certain distance:

SELECT *
FROM pointdf1, pointdf2
WHERE ST_Distance(pointdf1.pointshape1,pointdf2.pointshape2) <= 2

SELECT *
FROM pointDf, polygonDF
WHERE ST_HausdorffDistance(pointDf.pointshape, polygonDf.polygonshape) <= 2

SELECT *
FROM pointDf, polygonDF
WHERE ST_FrechetDistance(pointDf.pointshape, polygonDf.polygonshape) <= 2

SQL Physical plan:

== Physical Plan ==
DistanceJoin pointshape1#12: geometry, pointshape2#33: geometry, 2.0, true
:- Project [st_point(cast(_c0#0 as decimal(24,20)), cast(_c1#1 as decimal(24,20)), myPointId) AS pointshape1#12]
:  +- *FileScan csv
+- Project [st_point(cast(_c0#21 as decimal(24,20)), cast(_c1#22 as decimal(24,20)), myPointId) AS pointshape2#33]
   +- *FileScan csv

Warning

If you use planar euclidean distance functions like ST_Distance, ST_HausdorffDistance or ST_FrechetDistance as the predicate, Sedona doesn't control the distance's unit (degree or meter). It is same with the geometry. If your coordinates are in the longitude and latitude system, the unit of distance should be degree instead of meter or mile. To change the geometry's unit, please either transform the coordinate reference system to a meter-based system. See ST_Transform. If you don't want to transform your data, please consider using ST_DistanceSpheroid or ST_DistanceSphere.

SQL example for meter-based geodesic distance ST_DistanceSpheroid (works for ST_DistanceSphere too):

Less than a certain distance:

SELECT *
FROM pointdf1, pointdf2
WHERE ST_DistanceSpheroid(pointdf1.pointshape1,pointdf2.pointshape2) < 2

Less than or equal to a certain distance:

SELECT *
FROM pointdf1, pointdf2
WHERE ST_DistanceSpheroid(pointdf1.pointshape1,pointdf2.pointshape2) <= 2

Warning

If you use ST_DistanceSpheroid or ST_DistanceSphere as the predicate, the unit of the distance is meter. Currently, distance join with geodesic distance calculators work best for point data. For non-point data, it only considers their centroids. The distance join algorithm internally uses an approximate distance buffer which might lead to inaccurate results if your data is close to the poles or antimeridian.

Broadcast index join¶

Introduction: Perform a range join or distance join but broadcast one of the sides of the join. This maintains the partitioning of the non-broadcast side and doesn't require a shuffle.

Sedona will create a spatial index on the broadcasted table.

Sedona uses broadcast join only if the correct side has a broadcast hint. The supported join type - broadcast side combinations are:

Inner - either side, preferring to broadcast left if both sides have the hint
Left semi - broadcast right
Left anti - broadcast right
Left outer - broadcast right
Right outer - broadcast left

pointDf.alias("pointDf").join(broadcast(polygonDf).alias("polygonDf"), expr("ST_Contains(polygonDf.polygonshape, pointDf.pointshape)"))

SQL Physical plan:

== Physical Plan ==
BroadcastIndexJoin pointshape#52: geometry, BuildRight, BuildRight, false ST_Contains(polygonshape#30, pointshape#52)
:- Project [st_point(cast(_c0#48 as decimal(24,20)), cast(_c1#49 as decimal(24,20))) AS pointshape#52]
:  +- FileScan csv
+- SpatialIndex polygonshape#30: geometry, QUADTREE, [id=#62]
   +- Project [st_polygonfromenvelope(cast(_c0#22 as decimal(24,20)), cast(_c1#23 as decimal(24,20)), cast(_c2#24 as decimal(24,20)), cast(_c3#25 as decimal(24,20))) AS polygonshape#30]
      +- FileScan csv

This also works for distance joins with ST_Distance, ST_DistanceSpheroid, ST_DistanceSphere, ST_HausdorffDistance or ST_FrechetDistance:

pointDf1.alias("pointDf1").join(broadcast(pointDf2).alias("pointDf2"), expr("ST_Distance(pointDf1.pointshape, pointDf2.pointshape) <= 2"))

SQL Physical plan:

== Physical Plan ==
BroadcastIndexJoin pointshape#52: geometry, BuildRight, BuildLeft, true, 2.0 ST_Distance(pointshape#52, pointshape#415) <= 2.0
:- Project [st_point(cast(_c0#48 as decimal(24,20)), cast(_c1#49 as decimal(24,20))) AS pointshape#52]
:  +- FileScan csv
+- SpatialIndex pointshape#415: geometry, QUADTREE, [id=#1068]
   +- Project [st_point(cast(_c0#48 as decimal(24,20)), cast(_c1#49 as decimal(24,20))) AS pointshape#415]
      +- FileScan csv

Note: If the distance is an expression, it is only evaluated on the first argument to ST_Distance (pointDf1 above).

Automatic broadcast index join¶

When one table involved a spatial join query is smaller than a threadhold, Sedona will automatically choose broadcast index join instead of Sedona optimized join. The current threshold is controlled by sedona.join.autoBroadcastJoinThreshold and set to the same as spark.sql.autoBroadcastJoinThreshold.

Create a User-Defined Function (UDF)¶

User-Defined Functions (UDFs) are user-created procedures that can perform operations on a single row of information. To cover almost all use cases, we will showcase 4 types of UDFs for a better understanding of how to use geometry with UDFs. Sedona's serializer deserializes the SQL geometry type to JTS Geometry (Scala/Java) or Shapely Geometry (Python). You can implement any custom logic using the rich ecosystem around these two libraries.

Geometry to primitive¶

This UDF example takes a geometry type input and returns a primitive type output:

PythonScalaJava

from sedona.sql.types import GeometryType
from pyspark.sql.types import DoubleType

def lengthPoly(geom: GeometryType()):
    return geom.length

sedona.udf.register("udf_lengthPoly", lengthPoly, DoubleType())

df.selectExpr("udf_lengthPoly(geom)").show()

import org.locationtech.jts.geom.Geometry
import org.apache.spark.sql.types._

def lengthPoly(geom: Geometry): Double = {
    geom.getLength
}

sedona.udf.register("udf_lengthPoly", lengthPoly _)

df.selectExpr("udf_lengthPoly(geom)").show()

import org.apache.spark.sql.api.java.UDF1;
import org.apache.spark.sql.types.DataTypes;

// using lambda function to register the UDF
sparkSession.udf().register(
        "udf_lengthPoly",
        (UDF1<Geometry, Double>) Geometry::getLength,
        DataTypes.DoubleType);

df.selectExpr("udf_lengthPoly(geom)").show()

Output:

+--------------------+
|udf_lengthPoly(geom)|
+--------------------+
|   3.414213562373095|
+--------------------+

Geometry to Geometry¶

This UDF example takes a geometry type input and returns a geometry type output:

PythonScalaJava

from sedona.sql.types import GeometryType
from pyspark.sql.types import DoubleType

def bufferFixed(geom: GeometryType()):
    return geom.buffer(5.5)

sedona.udf.register("udf_bufferFixed", bufferFixed, GeometryType())

df.selectExpr("udf_bufferFixed(geom)").show()

import org.locationtech.jts.geom.Geometry
import org.apache.spark.sql.types._

def bufferFixed(geom: Geometry): Geometry = {
    geom.buffer(5.5)
}

sedona.udf.register("udf_bufferFixed", bufferFixed _)

df.selectExpr("udf_bufferFixed(geom)").show()

import org.apache.spark.sql.api.java.UDF1;
import org.apache.spark.sql.types.DataTypes;

// using lambda function to register the UDF
sparkSession.udf().register(
        "udf_bufferFixed",
        (UDF1<Geometry, Geometry>) geom ->
            geom.buffer(5.5),
        new GeometryUDT());

df.selectExpr("udf_bufferFixed(geom)").show()

Output:

+--------------------------------------------------+
|                             udf_bufferFixed(geom)|
+--------------------------------------------------+
|POLYGON ((1 -4.5, -0.0729967710887076 -4.394319...|
+--------------------------------------------------+

Geometry, primitive to geometry¶

This UDF example takes a geometry type input and a primitive type input and returns a geometry type output:

PythonScalaJava

from sedona.sql.types import GeometryType
from pyspark.sql.types import DoubleType

def bufferIt(geom: GeometryType(), distance: DoubleType()):
    return geom.buffer(distance)

sedona.udf.register("udf_buffer", bufferIt, GeometryType())

df.selectExpr("udf_buffer(geom, distance)").show()

import org.locationtech.jts.geom.Geometry
import org.apache.spark.sql.types._

def bufferIt(geom: Geometry, distance: Double): Geometry = {
    geom.buffer(distance)
}

sedona.udf.register("udf_buffer", bufferIt _)

df.selectExpr("udf_buffer(geom, distance)").show()

import org.apache.spark.sql.api.java.UDF2;
import org.apache.spark.sql.types.DataTypes;

// using lambda function to register the UDF
sparkSession.udf().register(
        "udf_buffer",
        (UDF2<Geometry, Double, Geometry>) Geometry::buffer,
        new GeometryUDT());

df.selectExpr("udf_buffer(geom, distance)").show()

Output:

+--------------------------------------------------+
|                        udf_buffer(geom, distance)|
+--------------------------------------------------+
|POLYGON ((1 -9, -0.9509032201612866 -8.80785280...|
+--------------------------------------------------+

Geometry, primitive to Geometry, primitive¶

This UDF example takes a geometry type input and a primitive type input and returns a geometry type and a primitive type output:

PythonScalaJava

from sedona.sql.types import GeometryType
from pyspark.sql.types import *

schemaUDF = StructType([
    StructField("buffed", GeometryType()),
    StructField("length", DoubleType())
    ])

def bufferAndLength(geom: GeometryType(), distance: DoubleType()):
    buffed = geom.buffer(distance)
    length = buffed.length
    return [buffed, length]

sedona.udf.register("udf_bufferLength", bufferAndLength, schemaUDF)

df.withColumn("bufferLength", expr("udf_bufferLength(geom, buffer)"))
            .select("geom", "buffer", "bufferLength.*")
            .show()

import org.locationtech.jts.geom.Geometry
import org.apache.spark.sql.types._
import org.apache.spark.sql.api.java.UDF2

val schemaUDF = StructType(Array(
    StructField("buffed", GeometryUDT),
    StructField("length", DoubleType)
))

val udf_bufferLength = udf(
    new UDF2[Geometry, Double, (Geometry, Double)] {
        def call(geom: Geometry, distance: Double): (Geometry, Double) = {
            val buffed = geom.buffer(distance)
            val length = geom.getLength
            (buffed, length)
        }
    }, schemaUDF)

sedona.udf.register("udf_bufferLength", udf_bufferLength)

data.withColumn("bufferLength", expr("udf_bufferLengths(geom, distance)"))
    .select("geom", "distance", "bufferLength.*")
    .show()

import org.apache.spark.sql.api.java.UDF2;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructType;
import scala.Tuple2;

StructType schemaUDF = new StructType()
            .add("buffedGeom", new GeometryUDT())
            .add("length", DataTypes.DoubleType);

// using lambda function to register the UDF
sparkSession.udf().register("udf_bufferLength",
            (UDF2<Geometry, Double, Tuple2<Geometry, Double>>) (geom, distance) -> {
                Geometry buffed = geom.buffer(distance);
                Double length = buffed.getLength();
                return new Tuple2<>(buffed, length);
            },
            schemaUDF);

df.withColumn("bufferLength", functions.expr("udf_bufferLength(geom, distance)"))
            .select("geom", "distance", "bufferLength.*")
            .show();

Output:

+------------------------------+--------+--------------------------------------------------+-----------------+
|                          geom|distance|                                        buffedGeom|           length|
+------------------------------+--------+--------------------------------------------------+-----------------+
|POLYGON ((1 1, 1 2, 2 1, 1 1))|    10.0|POLYGON ((1 -9, -0.9509032201612866 -8.80785280...|66.14518337329191|
+------------------------------+--------+--------------------------------------------------+-----------------+

S2/H3 based approximate equi-join¶

If the performance of Sedona optimized join is not ideal, which is possibly caused by complicated and overlapping geometries, you can resort to Sedona built-in Google S2 or Uber H3 -based approximate equi-join. This equi-join leverages Spark's internal equi-join algorithm and might be performant given that you can opt to skip the refinement step by sacrificing query accuracy.

Please use the following steps:

1. Generate cell ids for both tables¶

Use ST_S2CellIds or ST_H3CellIds to generate cell IDs. Each geometry may produce one or more IDs.

SELECT id, geom, name, explode(ST_S2CellIDs(geom, 15)) as cellId
FROM lefts

SELECT id, geom, name, explode(ST_S2CellIDs(geom, 15)) as cellId
FROM rights

2. Perform equi-join¶

Join the two tables by their cellId

SELECT lcs.id as lcs_id, lcs.geom as lcs_geom, lcs.name as lcs_name, rcs.id as rcs_id, rcs.geom as rcs_geom, rcs.name as rcs_name
FROM lcs JOIN rcs ON lcs.cellId = rcs.cellId

3. Optional: Refine the result¶

Due to the nature of S2/H3 Cellid, the equi-join results might have a few false-positives depending on the S2/H3 level you choose. A smaller level indicates bigger cells, less exploded rows, but more false positives.

To ensure the correctness, you can use one of the Spatial Predicates to filter out them. Use this query instead of the query in Step 2.

SELECT lcs.id as lcs_id, lcs.geom as lcs_geom, lcs.name as lcs_name, rcs.id as rcs_id, rcs.geom as rcs_geom, rcs.name as rcs_name
FROM lcs, rcs
WHERE lcs.cellId = rcs.cellId AND ST_Contains(lcs.geom, rcs.geom)

As you see, compared to the query in Step 2, we added one more filter, which is ST_Contains, to remove false positives. You can also use ST_Intersects and so on.

Tip

You can skip this step if you don't need 100% accuracy and want faster query speed.

4. Optional: De-duplicate¶

Due to the explode function used when we generate S2/H3 Cell Ids, the resulting DataFrame may have several duplicate <lcs_geom, rcs_geom> matches. You can remove them by performing a GroupBy query.

SELECT lcs_id, rcs_id , first(lcs_geom), first(lcs_name), first(rcs_geom), first(rcs_name)
FROM joinresult
GROUP BY (lcs_id, rcs_id)

The first function is to take the first value from a number of duplicate values.

If you don't have a unique id for each geometry, you can also group by geometry itself. See below:

SELECT lcs_geom, rcs_geom, first(lcs_name), first(rcs_name)
FROM joinresult
GROUP BY (lcs_geom, rcs_geom)

Note

If you are doing point-in-polygon join, this is not a problem and you can safely discard this issue. This issue only happens when you do polygon-polygon, polygon-linestring, linestring-linestring join.

S2/H3 for distance join¶

This also works for distance join. You first need to use ST_Buffer(geometry, distance) to wrap one of your original geometry column. If your original geometry column contains points, this ST_Buffer will make them become circles with a radius of distance.

For example. run this query first on the left table before Step 1.

SELECT id, ST_Buffer(geom, DISTANCE), name
FROM lefts

Since the coordinates are in the longitude and latitude system, so the unit of distance should be degree instead of meter or mile. You will have to estimate the corresponding degrees based on your meter values. Please use this calculator.

Other queries¶

There are lots of other functions can be combined with these queries. Please read SedonaSQL API.

Last update: February 9, 2024 03:04:11