Out-DB Rasters¶
Overview¶
Out-DB rasters only holds the geo-referencing information of the raster, as well as the path to the actual raster data. The actual raster data is stored in remote storage, such as S3. Out-DB rasters are useful when the raster data is too large to be stored in parquet files, or when the raster data is already stored in some remote storage.
Using Out-DB Rasters¶
Out-DB Rasters has the same data type with In-DB Rasters, they are all represented as raster
primitive type in Havasu and can be loaded as RasterUDT
objects in Sedona. If user has a table containing a raster
column, it can have both In-DB and Out-DB rasters
in that column.
Creating Out-DB Rasters¶
User can create an out-db raster using RS_FromPath
function:
df = sedona.sql("SELECT RS_FromPath('s3a://wherobots-examples/data/eurosat_small/AnnualCrop/AnnualCrop_10.tif') as rast")
val df = sedona.sql("SELECT RS_FromPath('s3a://wherobots-examples/data/eurosat_small/AnnualCrop/AnnualCrop_10.tif') as rast")
Dataset<Row> df = sedona.sql("SELECT RS_FromPath('s3a://wherobots-examples/data/eurosat_small/AnnualCrop/AnnualCrop_10.tif') as rast");
The result of RS_FromPath
is an Out-db raster object, which can be stored in a raster column in a Havasu table.
Out-db rasters can be accessed in the same way as in-db rasters. For example, user can use RS_Metadata
function to retrieve the metadata of the raster, or use band data accessors to read the pixel data of the raster.
df.selectExpr("RS_Metadata(rast) meta", "RS_Value(rast, ST_Point(601215, 4326949)) value").show()
df.selectExpr("RS_Metadata(rast) meta", "RS_Value(rast, ST_Point(601215, 4326949)) value").show()
df.selectExpr("RS_Metadata(rast) meta", "RS_Value(rast, ST_Point(601215, 4326949)) value").show();
+--------------------+------+
| meta| value|
+--------------------+------+
|[600895.002182237...|1113.0|
+--------------------+------+
The rasters path of out-db rasters can be retrieved using RS_BandPath
function. It will give the path of the raster file in remote storage. If the argument is an in-db raster, RS_BandPath
will return NULL
.
df.selectExpr("RS_BandPath(rast)").show()
df.selectExpr("RS_BandPath(rast)").show()
df.selectExpr("RS_BandPath(rast)").show();
+--------------------+
| rs_bandpath(rast)|
+--------------------+
|s3a://wherobots-e...|
+--------------------+
Large Out-DB rasters can be subdivided into tiles using RS_TileExplode
function:
df.selectExpr("RS_TileExplode(rast) as (x, y, rast)").show()
df.selectExpr("RS_TileExplode(rast) as (x, y, rast)").show()
df.selectExpr("RS_TileExplode(rast) as (x, y, rast)").show();
The rasters RS_TileExplode
produces are also Out-DB rasters as long as the input raster is an Out-DB raster. These tiles reference different parts of the same raster file in remote storage, so they are very cheap to create.
df.selectExpr("RS_TileExplode(rast, 32, 32) as (x, y, rast)").show()
df.selectExpr("RS_TileExplode(rast, 32, 32) as (x, y, rast)").show()
df.selectExpr("RS_TileExplode(rast, 32, 32) as (x, y, rast)").show();
+---+---+--------------------+
| x| y| rast|
+---+---+--------------------+
| 0| 0|OutDbGridCoverage...|
| 1| 0|OutDbGridCoverage...|
| 0| 1|OutDbGridCoverage...|
| 1| 1|OutDbGridCoverage...|
+---+---+--------------------+
Supported Out-DB Storages¶
Havasu supports any storages that Hadoop supports, as long as the storage has an implementation of org.apache.hadoop.fs.FileSystem
on the classpath of the Wherobots Cloud. In the above examples, we accessed out-db rasters stored on S3 using the S3A file system provided by Hadoop AWS. Following is a list of storages that are tested on Wherobots Cloud:
Storage | Scheme | File System Class |
---|---|---|
Local | file:// |
org.apache.hadoop.fs.LocalFileSystem |
S3 | s3a:// |
org.apache.hadoop.fs.s3a.S3AFileSystem |
Performance Characteristics¶
Out-DB rasters keeps the geo-referencing information of the raster, but not the actual band data. If the application only needs to access the non-band portion of the raster, out-db rasters will be very efficient. For instance, functions such as RS_Metadata
, RS_GeoReference
and RS_SRID
are very fast to run on Out-DB rasters. However, if the application needs to access the band data of the raster, it needs to read the raster file from remote storage, which is much slower than reading the band data from In-DB rasters. For example, functions such as RS_Value
, RS_MapAlgebra
and RS_Resample
are slower to run on Out-DB rasters than In-DB rasters.
Caching of Out-DB Rasters¶
Havasu supports caching of out-db rasters, so that if a raster is read from the same path multiple times, it may hit the cache and does not need to fetch the raster from remote storage. This avoids the overhead of reading the raster from remote storage multiple times. The cache is per-executor core, and the caches for different cores in the same executor is not shared. This is because the cache is implemented as thread local variables, which is per-executor core. We'll change the implementation in the future to make the cache shared across all the threads in the same executor, so that the cache hit rate will be higher.
The size of each Out-DB raster cache can be configured by spark.wherobots.raster.outdb.pool.size
, the default value is 100. If the cache is full, the least recently used raster will be evicted from the cache. The cache is stored in local disk, the directory for storing the cache can be configured by spark.wherobots.raster.outdb.cache.dir
, the default value is /data/outdb-cache
. It is the same directory as the directory used by Spark block manager for storing shuffled data. If the disk space is not enough, user can tune
spark.wherobots.raster.outdb.pool.size
to a smaller value, or use a larger EBS volume. If your Out-DB rasters rarely needs to be read more than once, you can also set spark.wherobots.raster.outdb.pool.size
to 0
to disable the cache.
The raster caches does not download and cache the entire Out-DB raster at once, it only caches the parts of the raster file that are accessed by the application. For example, if the application calls RS_Value(rast, pt)
to retrieve the value of one pixel at point pt
, only the part of the raster file that contains the hit tile will be fetched from remote storaga and thus cached. The size of each read is controlled by spark.wherobots.raster.outdb.readahead
, the default value is 64k
. Larger read ahead value such as 4m
may improve the performance of reading the entire raster file, but may hurt the performance if only a small portion of the raster file is needed.
Improve Cache Hit Rate¶
Caching the DataFrame¶
The cache is per-executor core, so if the same raster is read from different executor threads, it will be cached multiple times. To make the cache hit rate higher, it is recommended to cache the dataframe containing out-db rasters before running any spatial operations on it. For example:
df_tiles = df.selectExpr("RS_TileExplode(rast, 32, 32) as (x, y, rast)")
df_tiles.cache()
df_tiles.selectExpr("RS_MapAlgebra(rast, NULL, 'out = 0.5 * (rast[0] + rast[1]);')").show()
val df_tiles = df.selectExpr("RS_TileExplode(rast, 32, 32) as (x, y, rast)")
df_tiles.cache()
df_tiles.selectExpr("RS_MapAlgebra(rast, NULL, 'out = 0.5 * (rast[0] + rast[1]);')").show()
Dataset<Row> df_tiles = df.selectExpr("RS_TileExplode(rast, 32, 32) as (x, y, rast)");
df_tiles.cache();
df_tiles.selectExpr("RS_MapAlgebra(rast, NULL, 'out = 0.5 * (rast[0] + rast[1]);')").show();
Since the dataframe was cached, Spark will be more likely to schedule the spatial operations on the same executor threads, so that the cached rasters has a higher probability to be reused.
Ordering Out-DB Rasters by remote storage path¶
If the Out-DB rasters referencing the same remote storage path were ordered adjacently, the cache hit rate will be higher. If you are processing tiled Out-DB raster dataset, it is recommended to order the rasters by RS_BandPath(rast)
before processing the dataset.
The overhead of sorting the dataset brings great benefit to the overall performance.
df_sorted_cached = df.sort(expr("RS_BandPath(rast)")).cache()
df_sorted_cached.selectExpr("RS_MapAlgebra(rast, NULL, 'out = 0.5 * (rast[0] + rast[1]);')").show()
df_sorted_cached.selectExpr("RS_MapAlgebra(rast, NULL, 'out = 0.3 * rast[0] + 0.7 * rast[1];')").show()
val df_sorted_cached = df.sort(expr("RS_BandPath(rast)")).cache()
df_sorted_cached.selectExpr("RS_MapAlgebra(rast, NULL, 'out = 0.5 * (rast[0] + rast[1]);')").show()
df_sorted_cached.selectExpr("RS_MapAlgebra(rast, NULL, 'out = 0.3 * rast[0] + 0.7 * rast[1];')").show()
Dataset<Row> df_sorted_cached = df.sort(expr("RS_BandPath(rast)")).cache();
df_sorted_cached.selectExpr("RS_MapAlgebra(rast, NULL, 'out = 0.5 * (rast[0] + rast[1]);')").show();
df_sorted_cached.selectExpr("RS_MapAlgebra(rast, NULL, 'out = 0.3 * rast[0] + 0.7 * rast[1];')").show();
Note
The size of the out-db object pool should be larger than the number of distinct out-db raster path per partition, Otherwise, the cache will be trashing and won't be effective.
Configurations¶
There are several configurations for pooling of out-db rasters:
Configuration | Default | Description |
---|---|---|
spark.wherobots.raster.outdb.pool.size |
100 |
Size of per-executor core raster cache |
spark.wherobots.raster.outdb.cache.dir |
/data/outdb-cache |
Directory for storing cached raster data |
spark.wherobots.raster.outdb.readahead |
64k |
The minimum number of bytes to read on each read of remote storage. Larger read ahead value may improve the performance of reading large rasters, but may hurt the performance if only a small portion of the raster file is needed. |
There is another configuration to take care of when accessing a large number of out-db rasters stored on S3: spark.hadoop.fs.s3a.connection.maximum
. This configuration controls the maximum number of concurrent connections to S3. Wherobots cluster may hold a large number of concurrent connections because of out-db raster pooling, so it is configured to a large number by default. The default configuration is 2000
.
Note
The options mentioned above can only be configured when creating the Spark session, they cannot be changed after the Spark session is created.