This document provides some tips for improving the performance of Havasu raster operations, especially out-db raster operations.
Only Store Small Rasters as In-DB Rasters¶
In-DB rasters are stored in the parquet data files, so storing small in-db rasters in data files can achieve higher read throughput. It is generally a good idea to store small preprocessed chips as in-db rasters, so that they can be read by downstream applications such as machine learning tasks powered by GeoTorchAI more efficiently. For instance, the EuroSAT dataset contains labeled chips of size 64x64. These small chips will be loaded faster if they are stored as in-db rasters.
Storing large rasters as in-db rasters will make the data processing tasks prone to out-of-memory errors, because the entire raster will be loaded into memory when the raster is read from the data files. It is recommended to store large rasters as out-db rasters so they can be loaded lazily, and it is faster to run complicated tasks requiring shuffles on out-db rasters, such as repartitioning and sorting.
Tuning Read-ahead size for Out-DB Rasters¶
Out-DB rasters are stored in remote storage, such as S3. When reading Out-DB rasters, Havasu will read the raster file
from remote storage. The external raster file was read on-demand. However, reading larger blocks each time will make the throuput of reading out-db rasters higher. The size of each read is controlled by
spark.wherobots.raster.outdb.readahead. The default value is
64k. Larger read ahead value may improve the performance of reading the entire raster file, but may hurt the performance if only a small portion of the raster file is needed. Here are some guidelines for tuning the read ahead size:
- For reading the entire raster file such as
RS_Countetc. A larger read ahead size (for example,
4m) will improve the performance.
- For reading a small portion of the raster file such as
RS_Value, A smaller read ahead size will improve the performance, since the extra amount of data read by read-ahead was not reused. However, there are exceptions. If we will run lots of
RS_Valuecalls that hits most of the tiles of a out-db raster, it is benificial to have a large read ahead size, since the extra amount of data read by read-ahead will be reused by the subsequent
Being Aware of the Out-DB Raster Cache¶
Sedona will cache recently read out-db rasters in a per-executor object pool, so that subsequent reads of the same raster will be significantly faster. It is recommended to sort the out-db rasters by external storage paths before processing them, so that the rasters referencing the same external storage path will be read adjacently. This will improve the cache hit rate and thus improve the overall performance. Please refer to Out-DB Rasters - Improve Cache Hit Rate for details.
Using Cloud Optimized GeoTiff (COG) for Out-DB Rasters¶
COG is a format for storing geospatial raster data in a cloud-friendly way. It is optimized for reading from remote storage such as S3. It is recommended to use Cloud Optimized GeoTIFF files as out-db rasters. The reasons are as follows:
- COG files have their IFD (Image File Directory) in a contiguous block at the beginning of the file. This allows Havasu to read the entire IFD without seeking to discrete locations in the file. This is very important for reading from remote storage, because multiple discrete small reads is much slower than a single large read.
- COG files organize band data in properly sized tiles, such as 256x256. Square shaped tiles are generally better than
stride tiles such as 10000x1, because tiles organized in this way is more likely to achieve better tile reuse for most of the raster operations, especially for
RS_TileExplode. Havasu also supports reading GeoTIFF containing stride tiles, however, the performance may be worse than reading COG files.
Inspecting tile size of GeoTIFF files¶
You can use
gdalinfo to inspect the tile size of GeoTIFF files:
$ gdalinfo <file_name> Driver: GTiff/GeoTIFF Files: ... ... Band 1 Block=256x256 Type=UInt32, ColorInterp=Gray Min=0.000 Max=7014.000 Minimum=0.000, Maximum=7014.000, Mean=103.824, StdDev=330.999 NoData Value=4294967295 Overviews: 180410x90000, 90205x45000, 45103x22500, 22552x11250, 11276x5625, 5638x2813, 2819x1407, 1410x704, 705x352, 353x176, 177x88 Metadata: STATISTICS_APPROXIMATE=YES STATISTICS_MAXIMUM=7014 STATISTICS_MEAN=103.82385085182 STATISTICS_MINIMUM=0 STATISTICS_STDDEV=330.99937191441 STATISTICS_VALID_PERCENT=39.95
The tile size is
256x256 in this example. You can find this information in the
Band 1 section of the output.
Converting GeoTIFF to COG¶
If you have GeoTIFF files that are not COG, you can convert them to COG using
gdal_translate \ -co TILED=YES \ # Use tiling -co BLOCKXSIZE=256 \ # Tile width -co BLOCKYSIZE=256 \ # Tile height <path/to/input/file> \ # Input file <path/to/output/file> # Output file
You can also specify
-co COMPRESS=DEFLATE or
-co COMPRESS=LZW to enable compression. Please refer to GDAL COG documentation for more information.