Out-DB Rasters¶

Overview¶

Out-DB rasters only holds the geo-referencing information of the raster, as well as the path to the actual raster data. The actual raster data is stored in remote storage, such as S3. Out-DB rasters are useful when the raster data is too large to be stored in parquet files, or when the raster data is already stored in some remote storage.

In-DB and Out-DB Rasters

Using Out-DB Rasters¶

Out-DB Rasters has the same data type with In-DB Rasters, they are all represented as raster primitive type in Havasu and can be loaded as RasterUDT objects in Sedona. If user has a table containing a raster column, it can have both In-DB and Out-DB rasters in that column.

Creating Out-DB Rasters¶

User can create an out-db raster using RS_FromPath function:

PythonScalaJava

df = sedona.sql("SELECT RS_FromPath('s3a://wherobots-examples/data/eurosat_small/AnnualCrop/AnnualCrop_10.tif') as rast")

val df = sedona.sql("SELECT RS_FromPath('s3a://wherobots-examples/data/eurosat_small/AnnualCrop/AnnualCrop_10.tif') as rast")

Dataset<Row> df = sedona.sql("SELECT RS_FromPath('s3a://wherobots-examples/data/eurosat_small/AnnualCrop/AnnualCrop_10.tif') as rast");

SedonaDB allows adding multiple out-db rasters to one object. This example ensures the out-db raster will retrieve the raster only when required for computation:

PythonScalaJava

df = sedona.read.format("binaryFile")
           .load("s3a://wherobots-examples/data/eurosat_small/AnnualCrop/*.tif")
           .drop("content").withColumn("rast", expr("RS_FromPath(path)"))

val df = sedona.read.format("binaryFile")
           .load("s3a://wherobots-examples/data/eurosat_small/AnnualCrop/*.tif")
           .drop("content").withColumn("rast", expr("RS_FromPath(path)"))

Dataset<Row> df = sedona.read.format("binaryFile")
           .load("s3a://wherobots-examples/data/eurosat_small/AnnualCrop/*.tif")
           .drop("content").withColumn("rast", expr("RS_FromPath(path)"))

The result of RS_FromPath is an Out-db raster object, which can be stored in a raster column in a Havasu table.

Out-db rasters can be accessed in the same way as in-db rasters. For example, user can use RS_Metadata function to retrieve the metadata of the raster, or use band data accessors to read the pixel data of the raster.

PythonScalaJava

df.selectExpr("RS_Metadata(rast) meta", "RS_Value(rast, ST_Point(601215, 4326949)) value").show()

df.selectExpr("RS_Metadata(rast) meta", "RS_Value(rast, ST_Point(601215, 4326949)) value").show()

df.selectExpr("RS_Metadata(rast) meta", "RS_Value(rast, ST_Point(601215, 4326949)) value").show();

+--------------------+------+
|                meta| value|
+--------------------+------+
|[600895.002182237...|1113.0|
+--------------------+------+

The rasters path of out-db rasters can be retrieved using RS_BandPath function. It will give the path of the raster file in remote storage. If the argument is an in-db raster, RS_BandPath will return NULL.

PythonScalaJava

df.selectExpr("RS_BandPath(rast)").show()

df.selectExpr("RS_BandPath(rast)").show()

df.selectExpr("RS_BandPath(rast)").show();

+--------------------+
|   rs_bandpath(rast)|
+--------------------+
|s3a://wherobots-e...|
+--------------------+

Large Out-DB rasters can be subdivided into tiles using RS_TileExplode function:

PythonScalaJava

df.selectExpr("RS_TileExplode(rast) as (x, y, rast)").show()

df.selectExpr("RS_TileExplode(rast) as (x, y, rast)").show()

df.selectExpr("RS_TileExplode(rast) as (x, y, rast)").show();

The rasters RS_TileExplode produces are also Out-DB rasters as long as the input raster is an Out-DB raster. These tiles reference different parts of the same raster file in remote storage, so they are very cheap to create.

PythonScalaJava

df.selectExpr("RS_TileExplode(rast, 32, 32) as (x, y, rast)").show()

df.selectExpr("RS_TileExplode(rast, 32, 32) as (x, y, rast)").show()

df.selectExpr("RS_TileExplode(rast, 32, 32) as (x, y, rast)").show();

+---+---+--------------------+
|  x|  y|                rast|
+---+---+--------------------+
|  0|  0|OutDbGridCoverage...|
|  1|  0|OutDbGridCoverage...|
|  0|  1|OutDbGridCoverage...|
|  1|  1|OutDbGridCoverage...|
+---+---+--------------------+

Supported Out-DB Storages¶

Havasu supports any storages that Hadoop supports, as long as the storage has an implementation of org.apache.hadoop.fs.FileSystem on the classpath of the Wherobots Cloud. In the above examples, we accessed out-db rasters stored on S3 using the S3A file system provided by Hadoop AWS. Following is a list of storages that are tested on Wherobots Cloud:

Storage	Scheme	File System Class
Local	`file://`	`org.apache.hadoop.fs.LocalFileSystem`
S3	`s3a://`	`org.apache.hadoop.fs.s3a.S3AFileSystem`

Performance Characteristics¶

Out-DB rasters keeps the geo-referencing information of the raster, but not the actual band data. If the application only needs to access the non-band portion of the raster, out-db rasters will be very efficient. For instance, functions such as RS_Metadata, RS_GeoReference and RS_SRID are very fast to run on Out-DB rasters. However, if the application needs to access the band data of the raster, it needs to read the raster file from remote storage, which is much slower than reading the band data from In-DB rasters. For example, functions such as RS_Value, RS_MapAlgebra and RS_Resample are slower to run on Out-DB rasters than In-DB rasters.

Caching of Out-DB Rasters¶

Havasu supports caching of out-db rasters, so that if a raster is read from the same path multiple times, it may hit the cache and does not need to fetch the raster from remote storage. This avoids the overhead of reading the raster from remote storage multiple times. The cache is per-executor core, and the caches for different cores in the same executor is not shared. This is because the cache is implemented as thread local variables, which is per-executor core. We'll change the implementation in the future to make the cache shared across all the threads in the same executor, so that the cache hit rate will be higher.

The size of each Out-DB raster cache can be configured by spark.wherobots.raster.outdb.pool.size, the default value is 100. If the cache is full, the least recently used raster will be evicted from the cache. The cache is stored in local disk, the directory for storing the cache can be configured by spark.wherobots.raster.outdb.cache.dir, the default value is /data/outdb-cache. It is the same directory as the directory used by Spark block manager for storing shuffled data. If the disk space is not enough, user can tune spark.wherobots.raster.outdb.pool.size to a smaller value, or use a larger EBS volume. If your Out-DB rasters rarely needs to be read more than once, you can also set spark.wherobots.raster.outdb.pool.size to 0 to disable the cache.

The raster caches does not download and cache the entire Out-DB raster at once, it only caches the parts of the raster file that are accessed by the application. For example, if the application calls RS_Value(rast, pt) to retrieve the value of one pixel at point pt, only the part of the raster file that contains the hit tile will be fetched from remote storaga and thus cached. The size of each read is controlled by spark.wherobots.raster.outdb.readahead, the default value is 64k. Larger read ahead value such as 4m may improve the performance of reading the entire raster file, but may hurt the performance if only a small portion of the raster file is needed.

Improve Cache Hit Rate¶

Caching the DataFrame¶

The cache is per-executor core, so if the same raster is read from different executor threads, it will be cached multiple times. To make the cache hit rate higher, it is recommended to cache the dataframe containing out-db rasters before running any spatial operations on it. For example:

PythonScalaJava

df_tiles = df.selectExpr("RS_TileExplode(rast, 32, 32) as (x, y, rast)")
df_tiles.cache()
df_tiles.selectExpr("RS_MapAlgebra(rast, NULL, 'out = 0.5 * (rast[0] + rast[1]);')").show()

val df_tiles = df.selectExpr("RS_TileExplode(rast, 32, 32) as (x, y, rast)")
df_tiles.cache()
df_tiles.selectExpr("RS_MapAlgebra(rast, NULL, 'out = 0.5 * (rast[0] + rast[1]);')").show()

Dataset<Row> df_tiles = df.selectExpr("RS_TileExplode(rast, 32, 32) as (x, y, rast)");
df_tiles.cache();
df_tiles.selectExpr("RS_MapAlgebra(rast, NULL, 'out = 0.5 * (rast[0] + rast[1]);')").show();

Since the dataframe was cached, Spark will be more likely to schedule the spatial operations on the same executor threads, so that the cached rasters has a higher probability to be reused.

Ordering Out-DB Rasters by remote storage path¶

If the Out-DB rasters referencing the same remote storage path were ordered adjacently, the cache hit rate will be higher. If you are processing tiled Out-DB raster dataset, it is recommended to order the rasters by RS_BandPath(rast) before processing the dataset. The overhead of sorting the dataset brings great benefit to the overall performance.

PythonScalaJava

df_sorted_cached = df.sort(expr("RS_BandPath(rast)")).cache()
df_sorted_cached.selectExpr("RS_MapAlgebra(rast, NULL, 'out = 0.5 * (rast[0] + rast[1]);')").show()
df_sorted_cached.selectExpr("RS_MapAlgebra(rast, NULL, 'out = 0.3 * rast[0] + 0.7 * rast[1];')").show()

val df_sorted_cached = df.sort(expr("RS_BandPath(rast)")).cache()
df_sorted_cached.selectExpr("RS_MapAlgebra(rast, NULL, 'out = 0.5 * (rast[0] + rast[1]);')").show()
df_sorted_cached.selectExpr("RS_MapAlgebra(rast, NULL, 'out = 0.3 * rast[0] + 0.7 * rast[1];')").show()

Dataset<Row> df_sorted_cached = df.sort(expr("RS_BandPath(rast)")).cache();
df_sorted_cached.selectExpr("RS_MapAlgebra(rast, NULL, 'out = 0.5 * (rast[0] + rast[1]);')").show();
df_sorted_cached.selectExpr("RS_MapAlgebra(rast, NULL, 'out = 0.3 * rast[0] + 0.7 * rast[1];')").show();

Note

The size of the out-db object pool should be larger than the number of distinct out-db raster path per partition, Otherwise, the cache will be trashing and won't be effective.

Configurations¶

There are several configurations for pooling of out-db rasters:

Configuration	Default	Description
`spark.wherobots.raster.outdb.pool.size`	`100`	Size of per-executor core raster cache
`spark.wherobots.raster.outdb.cache.dir`	`/data/outdb-cache`	Directory for storing cached raster data
`spark.wherobots.raster.outdb.readahead`	`64k`	The minimum number of bytes to read on each read of remote storage. Larger read ahead value may improve the performance of reading large rasters, but may hurt the performance if only a small portion of the raster file is needed.

There is another configuration to take care of when accessing a large number of out-db rasters stored on S3: spark.hadoop.fs.s3a.connection.maximum. This configuration controls the maximum number of concurrent connections to S3. Wherobots cluster may hold a large number of concurrent connections because of out-db raster pooling, so it is configured to a large number by default. The default configuration is 2000.

Note

The options mentioned above can only be configured when creating the Spark session, they cannot be changed after the Spark session is created.

Last update: February 9, 2024 03:04:11