Spatial Filter Push-down for Rasters¶

Havasu supports spatial filter push-down for rasters. Havasu collects the WGS84 bounding box of rasters when writing data files containing raster columns, and use them to skip unrelated data files when running raster spatial filter such as RS_Intersects, RS_Contains and RS_Within. This can greatly reduce the amount of data that needs to be processed by the application, and thus improve the performance of the application. Spatial filter push-down works for both in-db rasters and out-db rasters.

Spatial Filter Push-down on Partitioned Tables¶

Raster spatial filter push-down works best on raster tables partitioned by location-related columns. For example, if you have a table containing rasters of different cities, you can partition the table by the city name column, so that the rasters of the same city will be stored in the same partition. This will make the spatial filter push-down more effective.

Another example is to partition the table by SRID when the rasters in your table have UTM coordinate reference system and are scattered in different UTM zones. Here is an example of partitioning EuroSAT dataset by SRID to optimize spatial range queries:

PythonScalaJava

# Load rasters from GeoTIFF files
df_binary = sedona.read.format("binaryFile")\
    .option("pathGlobFilter", "*.tif")\
    .option("recursiveFileLookup", "true")\
    .load('s3a://wherobots-examples/data/eurosat_small')
df_geotiff = df_binary.withColumn("rast", expr("RS_FromGeoTiff(content)"))\
    .select("path", "length", "rast")

# Write the data into a table partitioned by SRID
df_geotiff.withColumn("srid", expr("RS_SRID(rast) as srid"))\
    .sort('srid')\
    .write.format("havasu.iceberg").partitionBy("srid")\
    .saveAsTable("wherobots.test_db.eurosat_ms_srid")

// Load rasters from GeoTIFF files
val df_binary = sedona.read.format("binaryFile")
    .option("pathGlobFilter", "*.tif")
    .option("recursiveFileLookup", "true")
    .load("s3a://wherobots-examples/data/eurosat_small")

val df_geotiff = df_binary.withColumn("rast", expr("RS_FromGeoTiff(content)"))
    .select("path", "length", "rast")

// Write the data into a table partitioned by SRID
df_geotiff.withColumn("srid", expr("RS_SRID(rast) as srid"))
    .sort('srid')
    .write.format("havasu.iceberg").partitionBy("srid")
    .saveAsTable("wherobots.test_db.eurosat_ms_srid")

// Load rasters from GeoTIFF files
Dataset<Row> df_binary = sedona.read().format("binaryFile")
    .option("pathGlobFilter", "*.tif")
    .option("recursiveFileLookup", "true")
    .load("s3a://wherobots-examples/data/eurosat_small");

Dataset<Row> df_geotiff = df_binary.withColumn("rast", expr("RS_FromGeoTiff(content)"))
    .select("path", "length", "rast");

// Write the data into a table partitioned by SRID
df_geotiff.withColumn("srid", expr("RS_SRID(rast) as srid"))
    .sort('srid')
    .write().format("havasu.iceberg").partitionBy("srid")
    .saveAsTable("wherobots.test_db.eurosat_ms_srid");

Subsequent queries such as

SELECT * FROM wherobots.test_db.eurosat_ms_srid WHERE RS_Intersects(rast, ...)

will be much faster than the same query on a non-partitioned table.

Clustering by Raster Columns¶

Havasu also supports clustering the data by raster columns. This can be achieved by using CREATE SPATIAL INDEX statement to rewrite the table and cluster the data by the raster column. For example, we can cluster the data in table wherobots.db.test_table by the raster column rast:

CREATE SPATIAL INDEX FOR wherobots.db.test_table USING hilbert(rast, 10);

This is only recommended for the following scenarios:

The original table does not have a good column for partitioning.
The table contains out-db rasters or very small in-db rasters.

Last update: January 26, 2024 05:53:04