Skip to content

Load geospatial data

Load data from txt files

Assume we have a WKT file, namely usa-county.tsv, at Path /Download/usa-county.tsv as follows:

POLYGON (..., ...)  Cuming County
POLYGON (..., ...)  Wahkiakum County
POLYGON (..., ...)  De Baca County
POLYGON (..., ...)  Lancaster County
The file may have many other columns.

Use the following code to load the data and create a raw DataFrame:

var rawDf = sedona.read.format("csv").option("delimiter", "\t").option("header", "false").load("/Download/usa-county.tsv")
rawDf.createOrReplaceTempView("rawdf")
rawDf.show()
Dataset<Row> rawDf = sedona.read.format("csv").option("delimiter", "\t").option("header", "false").load("/Download/usa-county.tsv")
rawDf.createOrReplaceTempView("rawdf")
rawDf.show()
rawDf = sedona.read.format("csv").option("delimiter", "\t").option("header", "false").load("/Download/usa-county.tsv")
rawDf.createOrReplaceTempView("rawdf")
rawDf.show()

The output will be like this:

|                 _c0|_c1|_c2|     _c3|  _c4|        _c5|                 _c6|_c7|_c8|  _c9|_c10| _c11|_c12|_c13|      _c14|    _c15|       _c16|        _c17|
+--------------------+---+---+--------+-----+-----------+--------------------+---+---+-----+----+-----+----+----+----------+--------+-----------+------------+
|POLYGON ((-97.019...| 31|039|00835841|31039|     Cuming|       Cuming County| 06| H1|G4020|null| null|null|   A|1477895811|10447360|+41.9158651|-096.7885168|
|POLYGON ((-123.43...| 53|069|01513275|53069|  Wahkiakum|    Wahkiakum County| 06| H1|G4020|null| null|null|   A| 682138871|61658258|+46.2946377|-123.4244583|
|POLYGON ((-104.56...| 35|011|00933054|35011|    De Baca|      De Baca County| 06| H1|G4020|null| null|null|   A|6015539696|29159492|+34.3592729|-104.3686961|
|POLYGON ((-96.910...| 31|109|00835876|31109|  Lancaster|    Lancaster County| 06| H1|G4020| 339|30700|null|   A|2169240202|22877180|+40.7835474|-096.6886584|

Create a Geometry type column

All geometrical operations in SedonaSQL are on Geometry type objects. Therefore, before any kind of queries, you need to create a Geometry type column on a DataFrame.

SELECT ST_GeomFromWKT(_c0) AS countyshape, _c1, _c2

You can select many other attributes to compose this spatialdDf. The output will be something like this:

|                 countyshape|_c1|_c2|     _c3|  _c4|        _c5|                 _c6|_c7|_c8|  _c9|_c10| _c11|_c12|_c13|      _c14|    _c15|       _c16|        _c17|
+--------------------+---+---+--------+-----+-----------+--------------------+---+---+-----+----+-----+----+----+----------+--------+-----------+------------+
|POLYGON ((-97.019...| 31|039|00835841|31039|     Cuming|       Cuming County| 06| H1|G4020|null| null|null|   A|1477895811|10447360|+41.9158651|-096.7885168|
|POLYGON ((-123.43...| 53|069|01513275|53069|  Wahkiakum|    Wahkiakum County| 06| H1|G4020|null| null|null|   A| 682138871|61658258|+46.2946377|-123.4244583|
|POLYGON ((-104.56...| 35|011|00933054|35011|    De Baca|      De Baca County| 06| H1|G4020|null| null|null|   A|6015539696|29159492|+34.3592729|-104.3686961|
|POLYGON ((-96.910...| 31|109|00835876|31109|  Lancaster|    Lancaster County| 06| H1|G4020| 339|30700|null|   A|2169240202|22877180|+40.7835474|-096.6886584|

Although it looks same with the input, but actually the type of column countyshape has been changed to Geometry type.

To verify this, use the following code to print the schema of the DataFrame:

spatialDf.printSchema()

The output will be like this:

root
 |-- countyshape: geometry (nullable = false)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)

Note

SedonaSQL provides lots of functions to create a Geometry column, please read SedonaSQL API.

Load MultiLine GeoJSON FeatureCollection

Support we have a GeoJSON FeatureCollection file as follows. This entire file is considered as a single GeoJSON FeatureCollection object.

{ "type": "FeatureCollection",
    "features": [
      { "type": "Feature",
        "geometry": {"type": "Point", "coordinates": [102.0, 0.5]},
        "properties": {"prop0": "value0"}
        },
      { "type": "Feature",
        "geometry": {
          "type": "LineString",
          "coordinates": [
            [102.0, 0.0], [103.0, 1.0], [104.0, 0.0], [105.0, 1.0]
            ]
          },
        "properties": {
          "prop0": "value1",
          "prop1": 0.0
          }
        },
      { "type": "Feature",
         "geometry": {
           "type": "Polygon",
           "coordinates": [
             [ [100.0, 0.0], [101.0, 0.0], [101.0, 1.0],
               [100.0, 1.0], [100.0, 0.0] ]
             ]
         },
         "properties": {
           "prop0": "value2",
           "prop1": {"this": "that"}
           }
         }
       ]
}

Spark SQL's built-in JSON data source supports reading GeoJSON data. To ensure proper parsing of the geometry property, we can define a schema with the geometry property set to type 'string'. This prevents Spark from interpreting the property and allows us to use the ST_GeomFromGeoJSON function for accurate geometry parsing. We also need to set the multiLine option to true.

val schema = "type string, features array<struct<type string, geometry string, properties map<string, string>>>"

val df = sedona.read.schema(schema).option("multiLine", "true").json("PATH/TO/MYFILE.json")
val parsedDf = df.selectExpr("explode(features) as features").select("features.*")
        .withColumn("geometry", expr("ST_GeomFromGeoJSON(geometry)"))
        .withColumn("prop0", expr("properties['prop0']")).drop("properties").drop("type")

parsedDf.show()
parsedDf.printSchema()
String schema = "type string, features array<struct<type string, geometry string, properties map<string, string>>>";

Dataset<Row> df = sedona.read.schema(schema).json(geojson_path)
 .selectExpr("explode(features) as features") // Explode the envelope to get one feature per row.
 .select("features.*") // Unpack the features struct.
 .withColumn("geometry", expr("ST_GeomFromGeoJSON(geometry)")) // Convert the geometry string.
 .withColumn("prop0", expr("properties['prop0']")).drop("properties").drop("type")

df.show();
df.printSchema();
schema = "type string, crs string, totalFeatures long, features array<struct<type string, geometry string, properties map<string, string>>>";

df = sedona.read.json(geojson_path, schema=schema)
 .selectExpr("explode(features) as features") # Explode the envelope to get one feature per row.
 .select("features.*") # Unpack the features struct.
 .withColumn("geometry", f.expr("ST_GeomFromGeoJSON(geometry)")) # Convert the geometry string.
 .withColumn("prop0", f.expr("properties['prop0']")).drop("properties").drop("type")

df.show()
df.printSchema()

The output is as follows:

+--------------------+------+
|            geometry| prop0|
+--------------------+------+
|     POINT (102 0.5)|value0|
|LINESTRING (102 0...|value1|
|POLYGON ((100 0, ...|value2|
+--------------------+------+

root
 |-- geometry: geometry (nullable = false)
 |-- prop0: string (nullable = true)

Load Single Line GeoJSON Features

Suppose we have a single-line GeoJSON Features dataset as follows. Each line is a single GeoJSON Feature.

{"type":"Feature","geometry":{"type":"Point","coordinates":[102.0,0.5]},"properties":{"prop0":"value0"}}
{"type":"Feature","geometry":{"type":"LineString","coordinates":[[102.0,0.0],[103.0,1.0],[104.0,0.0],[105.0,1.0]]},"properties":{"prop0":"value1"}}
{"type":"Feature","geometry":{"type":"Polygon","coordinates":[[[100.0,0.0],[101.0,0.0],[101.0,1.0],[100.0,1.0],[100.0,0.0]]]},"properties":{"prop0":"value2"}}

Spark SQL's built-in JSON data source supports reading GeoJSON data. To ensure proper parsing of the geometry property, we can define a schema with the geometry property set to type 'string'. This prevents Spark from interpreting the property and allows us to use the ST_GeomFromGeoJSON function for accurate geometry parsing.

val schema = "type string, geometry string, properties map<string, string>"

val df = sedona.read.schema(schema).json("testOutput.json")
   .withColumn("geometry", expr("ST_GeomFromGeoJSON(geometry)"))
   .withColumn("prop0", expr("properties['prop0']")).drop("properties").drop("type")

df.show()
df.printSchema()
String schema = "type string, geometry string, properties map<string, string>"

Dataset<Row> df = sedona.read.schema(schema).json("PATH/TO/MYFILE.json")
   .withColumn("geometry", expr("ST_GeomFromGeoJSON(geometry)"))
   .withColumn("prop0", expr("properties['prop0']")).drop("properties").drop("type")

df.show()
df.printSchema()
schema = "type string, geometry string, properties map<string, string>"

df = sedona.read.schema(schema).json("PATH/TO/MYFILE.json")
   .withColumn("geometry", f.expr("ST_GeomFromGeoJSON(geometry)"))
   .withColumn("prop0", f.expr("properties['prop0']")).drop("properties").drop("type")

df.show()
df.printSchema()

The output is as follows:

+--------------------+------+
|            geometry| prop0|
+--------------------+------+
|     POINT (102 0.5)|value0|
|LINESTRING (102 0...|value1|
|POLYGON ((100 0, ...|value2|
+--------------------+------+

root
 |-- geometry: geometry (nullable = false)
 |-- prop0: string (nullable = true)

Load Shapefile

val shapefileInputLocation="/Download/myshapefile"
val spatialRDD = ShapefileReader.readToGeometryRDD(sedona.sparkContext, shapefileInputLocation)
var spatialDf = Adapter.toDf(spatialRDD, sedona)
String shapefileInputLocation="/Download/myshapefile"
SpatialRDD spatialRDD = ShapefileReader.readToGeometryRDD(sedona.sparkContext, shapefileInputLocation)
Dataset<Row> spatialDf = Adapter.toDf(spatialRDD, sedona)
from sedona.core.formatMapper.shapefileParser import ShapefileReader
from sedona.utils.adapter import Adapter

ShapefileReader.readToGeometryRDD(sc, shape_file_location)
spatialDf = Adapter.toDf(spatialRDD, sedona)

Note

The file extensions of .shp, .shx, .dbf must be in lowercase. Assume you have a shape file called myShapefile, the file structure should be like this:

- shapefile1
- shapefile2
- myshapefile
 - myshapefile.shp
 - myshapefile.shx
 - myshapefile.dbf
 - myshapefile...
 - ...

If the file you are reading contains non-ASCII characters you'll need to explicitly set the encoding via sedona.global.charset system property before creating your Spark context.

Example:

System.setProperty("sedona.global.charset", "utf8")

Load GeoParquet

Sedona natively supports loading GeoParquet file. Sedona will infer geometry fields using the "geo" metadata in GeoParquet files.

val df = sedona.read.format("geoparquet").load(geoparquetdatalocation1)
df.printSchema()
Dataset<Row> df = sedona.read.format("geoparquet").load(geoparquetdatalocation1)
df.printSchema()
df = sedona.read.format("geoparquet").load(geoparquetdatalocation1)
df.printSchema()

The output will be as follows:

root
 |-- pop_est: long (nullable = true)
 |-- continent: string (nullable = true)
 |-- name: string (nullable = true)
 |-- iso_a3: string (nullable = true)
 |-- gdp_md_est: double (nullable = true)
 |-- geometry: geometry (nullable = true)

Sedona supports spatial predicate push-down for GeoParquet files, please refer to the SedonaSQL API documentation for details.

Load data from JDBC data sources (PostGIS)

The 'query' option in Spark SQL's JDBC data source can be used to convert geometry columns to a format that Sedona can interpret. This should work for most spatial JDBC data sources. For Postgis there is no need to add a query to convert geometry types since it's already using EWKB as it's wire format.

// For any JDBC data source, including Postgis.
val df = sedona.read.format("jdbc")
    // Other options.
    .option("query", "SELECT id, ST_AsBinary(geom) as geom FROM my_table")
    .load()
    .withColumn("geom", expr("ST_GeomFromWKB(geom)"))

// This is a simplified version that works for Postgis.
val df = sedona.read.format("jdbc")
    // Other options.
    .option("dbtable", "my_table")
    .load()
    .withColumn("geom", expr("ST_GeomFromWKB(geom)"))
// For any JDBC data source, including Postgis.
Dataset<Row> df = sedona.read().format("jdbc")
    // Other options.
    .option("query", "SELECT id, ST_AsBinary(geom) as geom FROM my_table")
    .load()
    .withColumn("geom", expr("ST_GeomFromWKB(geom)"))

// This is a simplified version that works for Postgis.
Dataset<Row> df = sedona.read().format("jdbc")
    // Other options.
    .option("dbtable", "my_table")
    .load()
    .withColumn("geom", expr("ST_GeomFromWKB(geom)"))
# For any JDBC data source, including Postgis.
df = (sedona.read.format("jdbc")
    # Other options.
    .option("query", "SELECT id, ST_AsBinary(geom) as geom FROM my_table")
    .load()
    .withColumn("geom", f.expr("ST_GeomFromWKB(geom)")))

# This is a simplified version that works for Postgis.
df = (sedona.read.format("jdbc")
    # Other options.
    .option("dbtable", "my_table")
    .load()
    .withColumn("geom", f.expr("ST_GeomFromWKB(geom)")))

Load data from GeoPandas file reader

Note

Sedona Python does not work with Shapely 2.0+ APIs. We have installed GeoPandas and Shapely (1.X) in Wherotbos managed environments for you.

Sedona Python has implemented serializers and deserializers which allows to convert Sedona Geometry objects into Shapely BaseGeometry objects. Based on that it is possible to load the data with geopandas from file and create Spark DataFrame based on GeoDataFrame object.

import geopandas as gpd

gdf = gpd.read_file("<some path>.shp")

df = spark.createDataFrame(gdf)

Load data from Python Shapely

Note

Sedona Python does not work with Shapely 2.0+ APIs. We have installed GeoPandas and Shapely (1.X) in Wherotbos managed environments for you.

The following Shapely geometry types are supported:

  • Point
  • MultiPoint
  • LineString
  • MultiLinestring
  • Polygon
  • MultiPolygon

Create a schema

To create Spark DataFrame from Shapely objects, you need to first create a schema. Please use GeometryType from sedona.sql.types module.

from pyspark.sql.types import IntegerType, StructField, StructType

from sedona.sql.types import GeometryType

schema = StructType(
    [
        StructField("id", IntegerType(), False),
        StructField("geom", GeometryType(), False)
    ]
)

Create Shapely geometries

Now you can create Shapely geometries and convert them to a DataFrame.

from shapely.geometry import Point

data = [
    [1, Point(21.0, 52.0)],
    [1, Point(23.0, 42.0)],
    [1, Point(26.0, 32.0)]
]


gdf = spark.createDataFrame(
    data,
    schema
)

gdf.show()
+---+-------------+
| id|         geom|
+---+-------------+
|  1|POINT (21 52)|
|  1|POINT (23 42)|
|  1|POINT (26 32)|
+---+-------------+
from shapely.geometry import MultiPoint

data = [
    [1, MultiPoint([[19.511463, 51.765158], [19.446408, 51.779752]])]
]

gdf = spark.createDataFrame(
    data,
    schema
).show(1, False)
+---+---------------------------------------------------------+
|id |geom                                                     |
+---+---------------------------------------------------------+
|1  |MULTIPOINT ((19.511463 51.765158), (19.446408 51.779752))|
+---+---------------------------------------------------------+
from shapely.geometry import LineString

line = [(40, 40), (30, 30), (40, 20), (30, 10)]

data = [
    [1, LineString(line)]
]

gdf = spark.createDataFrame(
    data,
    schema
)

gdf.show(1, False)
+---+--------------------------------+
|id |geom                            |
+---+--------------------------------+
|1  |LINESTRING (10 10, 20 20, 10 40)|
+---+--------------------------------+
from shapely.geometry import MultiLineString

line1 = [(10, 10), (20, 20), (10, 40)]
line2 = [(40, 40), (30, 30), (40, 20), (30, 10)]

data = [
    [1, MultiLineString([line1, line2])]
]

gdf = spark.createDataFrame(
    data,
    schema
)

gdf.show(1, False)
+---+---------------------------------------------------------------------+
|id |geom                                                                 |
+---+---------------------------------------------------------------------+
|1  |MULTILINESTRING ((10 10, 20 20, 10 40), (40 40, 30 30, 40 20, 30 10))|
+---+---------------------------------------------------------------------+
from shapely.geometry import Polygon

polygon = Polygon(
    [
         [19.51121, 51.76426],
         [19.51056, 51.76583],
         [19.51216, 51.76599],
         [19.51280, 51.76448],
         [19.51121, 51.76426]
    ]
)

data = [
    [1, polygon]
]

gdf = spark.createDataFrame(
    data,
    schema
)

gdf.show(1, False)
+---+--------------------------------------------------------------------------------------------------------+
|id |geom                                                                                                    |
+---+--------------------------------------------------------------------------------------------------------+
|1  |POLYGON ((19.51121 51.76426, 19.51056 51.76583, 19.51216 51.76599, 19.5128 51.76448, 19.51121 51.76426))|
+---+--------------------------------------------------------------------------------------------------------+
from shapely.geometry import MultiPolygon

exterior_p1 = [(0, 0), (0, 2), (2, 2), (2, 0), (0, 0)]
interior_p1 = [(1, 1), (1, 1.5), (1.5, 1.5), (1.5, 1), (1, 1)]

exterior_p2 = [(0, 0), (1, 0), (1, 1), (0, 1), (0, 0)]

polygons = [
    Polygon(exterior_p1, [interior_p1]),
    Polygon(exterior_p2)
]

data = [
    [1, MultiPolygon(polygons)]
]

gdf = spark.createDataFrame(
    data,
    schema
)

gdf.show(1, False)
+---+----------------------------------------------------------------------------------------------------------+
|id |geom                                                                                                      |
+---+----------------------------------------------------------------------------------------------------------+
|1  |MULTIPOLYGON (((0 0, 0 2, 2 2, 2 0, 0 0), (1 1, 1.5 1, 1.5 1.5, 1 1.5, 1 1)), ((0 0, 0 1, 1 1, 1 0, 0 0)))|
+---+----------------------------------------------------------------------------------------------------------+

Confirm the DataFrame structure

gdf.printSchema()
root
 |-- id: integer (nullable = false)
 |-- geom: geometry (nullable = false)

Last update: August 30, 2023 18:56:20