Apache Sedona with PySpark¶

Apache Sedona™ is a prime example of a distributed engine built on top of Spark, specifically designed for geographic data processing.

The home page describes Apache Sedona (https://sedona.apache.org/) as:

a cluster computing system for processing large-scale spatial data. Sedona extends existing cluster computing systems, such as Apache Spark, Apache Flink, and Snowflake, with a set of out-of-the-box distributed Spatial Datasets and Spatial SQL that efficiently load, process, and analyze large-scale spatial data across machines.

In this notebook we are going to execute a basic Sedona demonstration using PySpark. The Sedona notebook starts below at Apache Sedona Core demo.

Install Apache Sedona, PySpark, and required libraries¶

To start with, we are going to install apache-sedona and PySpark making sure that we have the desired Spark version.

The required packages are specified in this Pipfile under [packages]:

[packages]
pandas="<=1.5.3"
geopandas="*"
numpy="<2"
shapely=">=1.7.0"
pyspark=">=2.3.0"
attrs="*"
pyarrow="*"
keplergl = "==0.3.2"
pydeck = "===0.8.0"
rasterio = ">=1.2.10"

Install Apache Sedona without Spark. To install Spark as well you can use pip install apache-sedona[spark] but we chose to use the Spark engine that comes with PySpark.

In [1]:

!pip install apache-sedona

Collecting apache-sedona
  Downloading apache_sedona-1.6.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.0 kB)
Requirement already satisfied: attrs in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from apache-sedona) (24.2.0)
Collecting shapely>=1.7.0 (from apache-sedona)
  Downloading shapely-2.0.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.0 kB)
Collecting rasterio>=1.2.10 (from apache-sedona)
  Downloading rasterio-1.3.11-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (14 kB)
Collecting affine (from rasterio>=1.2.10->apache-sedona)
  Downloading affine-2.4.0-py3-none-any.whl.metadata (4.0 kB)
Requirement already satisfied: certifi in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from rasterio>=1.2.10->apache-sedona) (2024.8.30)
Collecting click>=4.0 (from rasterio>=1.2.10->apache-sedona)
  Downloading click-8.1.7-py3-none-any.whl.metadata (3.0 kB)
Collecting cligj>=0.5 (from rasterio>=1.2.10->apache-sedona)
  Downloading cligj-0.7.2-py3-none-any.whl.metadata (5.0 kB)
Requirement already satisfied: numpy in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from rasterio>=1.2.10->apache-sedona) (1.24.4)
Collecting snuggs>=1.4.1 (from rasterio>=1.2.10->apache-sedona)
  Downloading snuggs-1.4.7-py3-none-any.whl.metadata (3.4 kB)
Collecting click-plugins (from rasterio>=1.2.10->apache-sedona)
  Downloading click_plugins-1.1.1-py2.py3-none-any.whl.metadata (6.4 kB)
Requirement already satisfied: setuptools in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from rasterio>=1.2.10->apache-sedona) (56.0.0)
Requirement already satisfied: importlib-metadata in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from rasterio>=1.2.10->apache-sedona) (8.5.0)
Requirement already satisfied: pyparsing>=2.1.6 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from snuggs>=1.4.1->rasterio>=1.2.10->apache-sedona) (3.1.4)
Requirement already satisfied: zipp>=3.20 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from importlib-metadata->rasterio>=1.2.10->apache-sedona) (3.20.2)
Downloading apache_sedona-1.6.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (177 kB)
Downloading rasterio-1.3.11-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.8 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21.8/21.8 MB 136.6 MB/s eta 0:00:00
Downloading shapely-2.0.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.5/2.5 MB 126.7 MB/s eta 0:00:00
Downloading click-8.1.7-py3-none-any.whl (97 kB)
Downloading cligj-0.7.2-py3-none-any.whl (7.1 kB)
Downloading snuggs-1.4.7-py3-none-any.whl (5.4 kB)
Downloading affine-2.4.0-py3-none-any.whl (15 kB)
Downloading click_plugins-1.1.1-py2.py3-none-any.whl (7.5 kB)
Installing collected packages: snuggs, shapely, click, affine, cligj, click-plugins, rasterio, apache-sedona
Successfully installed affine-2.4.0 apache-sedona-1.6.1 click-8.1.7 click-plugins-1.1.1 cligj-0.7.2 rasterio-1.3.11 shapely-2.0.6 snuggs-1.4.7

For the sake of this tutorial we are going to use the Spark engine that is included in the Pyspark distribution. Since Sedona needs Spark $3.4.0$ we need to make sure that we choose the correct PySpark version.

In [2]:

!pip install pyspark==3.4.0

Collecting pyspark==3.4.0
  Downloading pyspark-3.4.0.tar.gz (310.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 310.8/310.8 MB 135.8 MB/s eta 0:00:00
  Installing build dependencies ... - \ | / done
  Getting requirements to build wheel ... - done
  Preparing metadata (pyproject.toml) ... - done
Collecting py4j==0.10.9.7 (from pyspark==3.4.0)
  Downloading py4j-0.10.9.7-py2.py3-none-any.whl.metadata (1.5 kB)
Downloading py4j-0.10.9.7-py2.py3-none-any.whl (200 kB)
Building wheels for collected packages: pyspark
  Building wheel for pyspark (pyproject.toml) ... - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - done
  Created wheel for pyspark: filename=pyspark-3.4.0-py2.py3-none-any.whl size=311317121 sha256=3fa98b3b6e594f997111ac6bda1a8c0447aad8149992847de7d91b2071ecbd91
  Stored in directory: /home/runner/.cache/pip/wheels/27/3e/a7/888155c6a7f230b13a394f4999b90fdfaed00596c68d3de307
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.7 pyspark-3.4.0

Verify that PySpark is using Spark version $3.4.0$.

In [3]:

!pyspark --version

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.4.0
      /_/
                        
Using Scala version 2.12.17, OpenJDK 64-Bit Server VM, 11.0.24
Branch HEAD
Compiled by user xinrong.meng on 2023-04-07T02:18:01Z
Revision 87a5442f7ed96b11051d8a9333476d080054e5a0
Url https://github.com/apache/spark
Type --help for more information.

Install Geopandas¶

The libraries numpy, pandas, geopandas, and shapely are available by default on Google Colab.

In [4]:

!pip install geopandas

Collecting geopandas
  Downloading geopandas-0.13.2-py3-none-any.whl.metadata (1.5 kB)
Collecting fiona>=1.8.19 (from geopandas)
  Downloading fiona-1.10.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (56 kB)
Requirement already satisfied: packaging in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from geopandas) (24.1)
Requirement already satisfied: pandas>=1.1.0 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from geopandas) (2.0.3)
Collecting pyproj>=3.0.1 (from geopandas)
  Downloading pyproj-3.5.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (28 kB)
Requirement already satisfied: shapely>=1.7.1 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from geopandas) (2.0.6)
Requirement already satisfied: attrs>=19.2.0 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from fiona>=1.8.19->geopandas) (24.2.0)
Requirement already satisfied: certifi in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from fiona>=1.8.19->geopandas) (2024.8.30)
Requirement already satisfied: click~=8.0 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from fiona>=1.8.19->geopandas) (8.1.7)
Requirement already satisfied: click-plugins>=1.0 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from fiona>=1.8.19->geopandas) (1.1.1)
Requirement already satisfied: cligj>=0.5 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from fiona>=1.8.19->geopandas) (0.7.2)
Requirement already satisfied: importlib-metadata in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from fiona>=1.8.19->geopandas) (8.5.0)
Requirement already satisfied: python-dateutil>=2.8.2 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from pandas>=1.1.0->geopandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from pandas>=1.1.0->geopandas) (2024.2)
Requirement already satisfied: tzdata>=2022.1 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from pandas>=1.1.0->geopandas) (2024.2)
Requirement already satisfied: numpy>=1.20.3 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from pandas>=1.1.0->geopandas) (1.24.4)
Requirement already satisfied: six>=1.5 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from python-dateutil>=2.8.2->pandas>=1.1.0->geopandas) (1.16.0)
Requirement already satisfied: zipp>=3.20 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from importlib-metadata->fiona>=1.8.19->geopandas) (3.20.2)
Downloading geopandas-0.13.2-py3-none-any.whl (1.1 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 31.1 MB/s eta 0:00:00
Downloading fiona-1.10.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17.3/17.3 MB 143.0 MB/s eta 0:00:00
Downloading pyproj-3.5.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.8/7.8 MB 137.2 MB/s eta 0:00:00
Installing collected packages: pyproj, fiona, geopandas
Successfully installed fiona-1.10.1 geopandas-0.13.2 pyproj-3.5.0

In [5]:

!pip install shapely

Requirement already satisfied: shapely in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (2.0.6)
Requirement already satisfied: numpy<3,>=1.14 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from shapely) (1.24.4)

Download the data¶

We are going to download the data from Sedona's GitHub repository.

In [6]:

import json
import os
import urllib
import base64

def download_blob(url, path, localfile):
    print(f"Downloading blob to localfile: {localfile} from {url}")

    # Fetch the JSON data from the URL
    with urllib.request.urlopen(url) as response:
        json_data = response.read().decode('utf-8')

        # Load the JSON data into a dictionary
        data = json.loads(json_data)

        # Extract the Base64 content
        base64_content = data['content']

        # Decode the Base64 content
        decoded_content = base64.b64decode(base64_content)

        try:
            # Attempt to decode as UTF-8 text
            decoded_text = decoded_content.decode('utf-8')
            with open(localfile, 'w') as f:
                f.write(decoded_text)
        except UnicodeDecodeError:
            # If text decoding fails, save as binary
            with open(localfile, 'wb') as f:
                f.write(decoded_content)


def download_gitpath(url, path, localpath):
    """
    Recursively downloads a specific path (directory or file) from a GitHub repository using the GitHub API
    and saves it to a specified local directory, preserving the repository's structure.

    Args:
        url (str):
            The GitHub API URL that points to the tree structure of the repository.
            This URL should provide a JSON response containing the file and directory information for the tree.

        path (str):
            The path within the repository that you want to download.
            This path is used to filter the relevant files and directories within the tree structure.
            Example: "src/utils" would download everything under the `src/utils` directory in the repository.

        localpath (str):
            The local directory path where the downloaded files and directories will be saved.
            If the directory does not exist, it will be created. The repository's structure will be mirrored in this location.

    Returns:
        None

    Example:
        tree_url = "https://api.github.com/repos/{owner}/{repo_name}/git/trees/master?recursive=true"
        tree_url = "https://api.github.com/repos/apache/spark/git/trees/master?recursive=true"
        repo_path = "data/mllib/images"
        local_dir = "./downloaded_images"

        download_gitpath(tree_url, repo_path, local_dir)

    How it works:
        1. The function fetches the tree of files and directories from the GitHub API using the provided URL.
        2. It filters the tree data to only include items that fall under the specified `path`.
        3. Files (blobs) are downloaded and saved locally.
        4. Directories (trees) are handled recursively by creating local directories and downloading their contents.

    Notes:
        - Ensure that you have access to the repository's GitHub API, especially if it's private (you may need a token).
        - This function will handle both text and binary files appropriately.
        - Error handling for network and API issues is minimal and could be enhanced.
    """
    #print(f"Processing path: {path} into local directory: {localpath}")

    # Create the local directory if it doesn't exist
    os.makedirs(localpath, exist_ok=True)

    with urllib.request.urlopen(url) as response:
        json_data = response.read().decode('utf-8')

        # Load the JSON data into a dictionary
        data = json.loads(json_data)['tree']

        # Filter and map the paths to their URLs
        items = {x['path']: (x['url'], x['type']) for x in data if x['path'].startswith(path)}

        for item_path, (item_url, item_type) in items.items():

            # Handle blobs (files)
            if item_type == 'blob':
                local_file_path = os.path.join(localpath, os.path.relpath(item_path, path))
                download_blob(item_url, item_path, local_file_path)

            # Handle trees (directories)
            elif item_type == 'tree':
                new_local_dir = os.path.join(localpath, os.path.relpath(item_path, path))
                os.makedirs(new_local_dir, exist_ok=True)
                download_gitpath(item_url, item_path, new_local_dir)

In [7]:

url = 'https://api.github.com/repos/jiayuasu/sedona/git/trees/master?recursive=true'
path = 'docs/usecases/data/'
# download only if data folder does not exist
if not os.path.exists('./data'):
  os.makedirs('./data')
  download_gitpath(url, path, './data')

Downloading blob to localfile: ./data/arealm-small.csv from https://api.github.com/repos/jiayuasu/sedona/git/blobs/1f090b8189a075988ddc1816aa20de77bd9c2905
Downloading blob to localfile: ./data/county_small.tsv from https://api.github.com/repos/jiayuasu/sedona/git/blobs/6b90b565b6b30f5a88c203159b9b080f0f6520f6
Downloading blob to localfile: ./data/county_small_wkb.tsv from https://api.github.com/repos/jiayuasu/sedona/git/blobs/6b8760e24b51e89802c724c3d8616e8d99d5c0d0
Downloading blob to localfile: ./data/gis_osm_pois_free_1.cpg from https://api.github.com/repos/jiayuasu/sedona/git/blobs/7edc66b06a9d03549d75d2c9cbb89f83c611ddd3
Downloading blob to localfile: ./data/gis_osm_pois_free_1.dbf from https://api.github.com/repos/jiayuasu/sedona/git/blobs/61b1f3da529a27d9ba3560e8981f200e4f0977cb
Downloading blob to localfile: ./data/gis_osm_pois_free_1.prj from https://api.github.com/repos/jiayuasu/sedona/git/blobs/8f73f480ffe4929be342f332ea5ee69fb0ef9fb4
Downloading blob to localfile: ./data/gis_osm_pois_free_1.shp from https://api.github.com/repos/jiayuasu/sedona/git/blobs/fede38f6c34d2ec344658543dac87abb7ba84ee0
Downloading blob to localfile: ./data/gis_osm_pois_free_1.shx from https://api.github.com/repos/jiayuasu/sedona/git/blobs/577b40830e9845c58f36e524792c598ac23cc835
Downloading blob to localfile: ./data/ne_50m_admin_0_countries_lakes/ne_50m_admin_0_countries_lakes.cpg from https://api.github.com/repos/jiayuasu/sedona/git/blobs/7edc66b06a9d03549d75d2c9cbb89f83c611ddd3
Downloading blob to localfile: ./data/ne_50m_admin_0_countries_lakes/ne_50m_admin_0_countries_lakes.dbf from https://api.github.com/repos/jiayuasu/sedona/git/blobs/03caa768d026ba1f61bf123340dc5a0600d43c09
Downloading blob to localfile: ./data/ne_50m_admin_0_countries_lakes/ne_50m_admin_0_countries_lakes.prj from https://api.github.com/repos/jiayuasu/sedona/git/blobs/8f73f480ffe4929be342f332ea5ee69fb0ef9fb4
Downloading blob to localfile: ./data/ne_50m_admin_0_countries_lakes/ne_50m_admin_0_countries_lakes.shp from https://api.github.com/repos/jiayuasu/sedona/git/blobs/e630c6f44eb605f71764c15130517a7767ccf520
Downloading blob to localfile: ./data/ne_50m_admin_0_countries_lakes/ne_50m_admin_0_countries_lakes.shx from https://api.github.com/repos/jiayuasu/sedona/git/blobs/76a04349bbf11e581e1aa8c41b7ccac0eb548959
Downloading blob to localfile: ./data/ne_50m_airports/ne_50m_airports.dbf from https://api.github.com/repos/jiayuasu/sedona/git/blobs/bdf30050021a27e1f5bc504b6454e2540fbe253c
Downloading blob to localfile: ./data/ne_50m_airports/ne_50m_airports.prj from https://api.github.com/repos/jiayuasu/sedona/git/blobs/40dd8c6cdcbabeb4297cb137424805c3f3472ee2
Downloading blob to localfile: ./data/ne_50m_airports/ne_50m_airports.shp from https://api.github.com/repos/jiayuasu/sedona/git/blobs/d1417b2ed8cf7ed53b816c07acac1fcf67300a6f
Downloading blob to localfile: ./data/ne_50m_airports/ne_50m_airports.shx from https://api.github.com/repos/jiayuasu/sedona/git/blobs/b99c0fc7d5578b015ee1ccfd60d66cc0914b9d14
Downloading blob to localfile: ./data/polygon/map.dbf from https://api.github.com/repos/jiayuasu/sedona/git/blobs/a71205ad62a1485d700625f73464bffc226b96be
Downloading blob to localfile: ./data/polygon/map.shp from https://api.github.com/repos/jiayuasu/sedona/git/blobs/ab10fd7037950e9d41a4dcd7e10fcba72ad00a8d
Downloading blob to localfile: ./data/polygon/map.shx from https://api.github.com/repos/jiayuasu/sedona/git/blobs/53a25eb312a5ae471370d248aa46698be33629d9
Downloading blob to localfile: ./data/primaryroads-linestring.csv from https://api.github.com/repos/jiayuasu/sedona/git/blobs/6e57b87527b6c9d23327905d0e8c565bce6b0a0a
Downloading blob to localfile: ./data/primaryroads-polygon.csv from https://api.github.com/repos/jiayuasu/sedona/git/blobs/eeb2f10678eb0c13d9ace21e5de9b597cc8355df
Downloading blob to localfile: ./data/raster/T21HUB_4704_4736_8224_8256.tif from https://api.github.com/repos/jiayuasu/sedona/git/blobs/776d75342642fd994a7429c8145b651e9e45ed82
Downloading blob to localfile: ./data/raster/test1.tiff from https://api.github.com/repos/jiayuasu/sedona/git/blobs/bebd68232e85a348c1bc1d067ddc95d38ad66c0f
Downloading blob to localfile: ./data/raster/test5.tiff from https://api.github.com/repos/jiayuasu/sedona/git/blobs/6caabeadae31129afc2c4309348c5ff6a10d0d24
Downloading blob to localfile: ./data/raster/vya_T21HUB_992_1024_4352_4384.tif from https://api.github.com/repos/jiayuasu/sedona/git/blobs/6df2b0a467b56e0fb49bd7844a7b8edcb0dcb355
Downloading blob to localfile: ./data/testPolygon.json from https://api.github.com/repos/jiayuasu/sedona/git/blobs/b5802075d13b1f1bcca169f8fdb9eca94b84e361
Downloading blob to localfile: ./data/testpoint.csv from https://api.github.com/repos/jiayuasu/sedona/git/blobs/05ff16a1be4e774d5954e83ddeef23f931bf5f75
Downloading blob to localfile: ./data/zcta510-small.csv from https://api.github.com/repos/jiayuasu/sedona/git/blobs/ced508092e7bebc389af9438b1ae2d5b5e1e2808

Verify the presence of data in the designated data folder.

In [8]:

!ls -lR data

data:
total 16716
-rw-r--r-- 1 runner docker  200449 Oct 14 20:42 arealm-small.csv
-rw-r--r-- 1 runner docker 4182112 Oct 14 20:42 county_small.tsv
-rw-r--r-- 1 runner docker 6379960 Oct 14 20:42 county_small_wkb.tsv
-rw-r--r-- 1 runner docker       6 Oct 14 20:42 gis_osm_pois_free_1.cpg
-rw-r--r-- 1 runner docker 1841000 Oct 14 20:42 gis_osm_pois_free_1.dbf
-rw-r--r-- 1 runner docker     144 Oct 14 20:42 gis_osm_pois_free_1.prj
-rw-r--r-- 1 runner docker  360544 Oct 14 20:42 gis_osm_pois_free_1.shp
-rw-r--r-- 1 runner docker  103084 Oct 14 20:42 gis_osm_pois_free_1.shx
drwxr-xr-x 2 runner docker    4096 Oct 14 20:42 ne_50m_admin_0_countries_lakes
drwxr-xr-x 2 runner docker    4096 Oct 14 20:42 ne_50m_airports
drwxr-xr-x 2 runner docker    4096 Oct 14 20:42 polygon
-rw-r--r-- 1 runner docker 1132600 Oct 14 20:42 primaryroads-linestring.csv
-rw-r--r-- 1 runner docker 1399092 Oct 14 20:42 primaryroads-polygon.csv
drwxr-xr-x 2 runner docker    4096 Oct 14 20:42 raster
-rw-r--r-- 1 runner docker 1324354 Oct 14 20:42 testPolygon.json
-rw-r--r-- 1 runner docker   12993 Oct 14 20:42 testpoint.csv
-rw-r--r-- 1 runner docker  129482 Oct 14 20:42 zcta510-small.csv

data/ne_50m_admin_0_countries_lakes:
total 2164
-rw-r--r-- 1 runner docker       6 Oct 14 20:42 ne_50m_admin_0_countries_lakes.cpg
-rw-r--r-- 1 runner docker  546979 Oct 14 20:42 ne_50m_admin_0_countries_lakes.dbf
-rw-r--r-- 1 runner docker     144 Oct 14 20:42 ne_50m_admin_0_countries_lakes.prj
-rw-r--r-- 1 runner docker 1652184 Oct 14 20:42 ne_50m_admin_0_countries_lakes.shp
-rw-r--r-- 1 runner docker    2028 Oct 14 20:42 ne_50m_admin_0_countries_lakes.shx

data/ne_50m_airports:
total 336
-rw-r--r-- 1 runner docker 326032 Oct 14 20:42 ne_50m_airports.dbf
-rw-r--r-- 1 runner docker    148 Oct 14 20:42 ne_50m_airports.prj
-rw-r--r-- 1 runner docker   7968 Oct 14 20:42 ne_50m_airports.shp
-rw-r--r-- 1 runner docker   2348 Oct 14 20:42 ne_50m_airports.shx

data/polygon:
total 7344
-rw-r--r-- 1 runner docker   10033 Oct 14 20:42 map.dbf
-rw-r--r-- 1 runner docker 7424324 Oct 14 20:42 map.shp
-rw-r--r-- 1 runner docker   80100 Oct 14 20:42 map.shx

data/raster:
total 396
-rw-r--r-- 1 runner docker   6619 Oct 14 20:42 T21HUB_4704_4736_8224_8256.tif
-rw-r--r-- 1 runner docker 174803 Oct 14 20:42 test1.tiff
-rw-r--r-- 1 runner docker 209199 Oct 14 20:42 test5.tiff
-rw-r--r-- 1 runner docker   7689 Oct 14 20:42 vya_T21HUB_992_1024_4352_4384.tif

Apache Sedona Core demo¶

The notebook is available at the following link: https://github.com/apache/sedona/blob/master/docs/usecases/ApacheSedonaCore.ipynb.

Refer to https://mvnrepository.com/artifact/org.apache.sedona/sedona-spark-3.4 for making sense of packages and versions.

Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements.  See the NOTICE file
distributed with this work for additional information
regarding copyright ownership.  The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License.  You may obtain a copy of the License at
  http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied.  See the License for the
specific language governing permissions and limitations
under the License.

In [9]:

from pyspark.sql import SparkSession
from pyspark import StorageLevel
import geopandas as gpd
import pandas as pd
from pyspark.sql.types import StructType
from pyspark.sql.types import StructField
from pyspark.sql.types import StringType
from pyspark.sql.types import LongType
from shapely.geometry import Point
from shapely.geometry import Polygon

from sedona.spark import *
from sedona.core.geom.envelope import Envelope

Note: the next cell might take a while to execute. Stretch your legs and contemplate the mysteries of the universe in the meantime. Hang tight!

In [10]:

config = SedonaContext.builder() .\
    config('spark.jars.packages',
           'org.apache.sedona:sedona-spark-3.4_2.12:1.6.0,'
           'org.datasyslab:geotools-wrapper:1.6.0-28.2,'
           'uk.co.gresearch.spark:spark-extension_2.12:2.11.0-3.4'). \
    getOrCreate()


sedona = SedonaContext.create(config)

:: loading settings :: url = jar:file:/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml

Ivy Default Cache set to: /home/runner/.ivy2/cache
The jars for the packages stored in: /home/runner/.ivy2/jars
org.apache.sedona#sedona-spark-3.4_2.12 added as a dependency
org.datasyslab#geotools-wrapper added as a dependency
uk.co.gresearch.spark#spark-extension_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-c2fcb9ab-9ede-4abc-bad7-73533a42fa83;1.0
	confs: [default]
	found org.apache.sedona#sedona-spark-3.4_2.12;1.6.0 in central
	found org.apache.sedona#sedona-common;1.6.0 in central
	found org.apache.commons#commons-math3;3.6.1 in central
	found org.locationtech.jts#jts-core;1.19.0 in central
	found org.wololo#jts2geojson;0.16.1 in central
	found org.locationtech.spatial4j#spatial4j;0.8 in central
	found com.google.geometry#s2-geometry;2.0.0 in central
	found com.google.guava#guava;25.1-jre in central
	found com.google.code.findbugs#jsr305;3.0.2 in central
	found org.checkerframework#checker-qual;2.0.0 in central
	found com.google.errorprone#error_prone_annotations;2.1.3 in central
	found com.google.j2objc#j2objc-annotations;1.1 in central
	found org.codehaus.mojo#animal-sniffer-annotations;1.14 in central
	found com.uber#h3;4.1.1 in central
	found net.sf.geographiclib#GeographicLib-Java;1.52 in central
	found com.github.ben-manes.caffeine#caffeine;2.9.2 in central
	found org.checkerframework#checker-qual;3.10.0 in central
	found com.google.errorprone#error_prone_annotations;2.5.1 in central
	found org.apache.sedona#sedona-spark-common-3.4_2.12;1.6.0 in central
	found commons-lang#commons-lang;2.6 in central
	found org.scala-lang.modules#scala-collection-compat_2.12;2.5.0 in central
	found org.beryx#awt-color-factory;1.0.0 in central
	found org.datasyslab#geotools-wrapper;1.6.0-28.2 in central
	found uk.co.gresearch.spark#spark-extension_2.12;2.11.0-3.4 in central
	found com.github.scopt#scopt_2.12;4.1.0 in central
downloading https://repo1.maven.org/maven2/org/apache/sedona/sedona-spark-3.4_2.12/1.6.0/sedona-spark-3.4_2.12-1.6.0.jar ...
	[SUCCESSFUL ] org.apache.sedona#sedona-spark-3.4_2.12;1.6.0!sedona-spark-3.4_2.12.jar (11ms)
downloading https://repo1.maven.org/maven2/org/datasyslab/geotools-wrapper/1.6.0-28.2/geotools-wrapper-1.6.0-28.2.jar ...
	[SUCCESSFUL ] org.datasyslab#geotools-wrapper;1.6.0-28.2!geotools-wrapper.jar (272ms)
downloading https://repo1.maven.org/maven2/uk/co/gresearch/spark/spark-extension_2.12/2.11.0-3.4/spark-extension_2.12-2.11.0-3.4.jar ...
	[SUCCESSFUL ] uk.co.gresearch.spark#spark-extension_2.12;2.11.0-3.4!spark-extension_2.12.jar (11ms)
downloading https://repo1.maven.org/maven2/org/apache/sedona/sedona-common/1.6.0/sedona-common-1.6.0.jar ...
	[SUCCESSFUL ] org.apache.sedona#sedona-common;1.6.0!sedona-common.jar (13ms)
downloading https://repo1.maven.org/maven2/org/apache/sedona/sedona-spark-common-3.4_2.12/1.6.0/sedona-spark-common-3.4_2.12-1.6.0.jar ...
	[SUCCESSFUL ] org.apache.sedona#sedona-spark-common-3.4_2.12;1.6.0!sedona-spark-common-3.4_2.12.jar (37ms)
downloading https://repo1.maven.org/maven2/org/locationtech/jts/jts-core/1.19.0/jts-core-1.19.0.jar ...
	[SUCCESSFUL ] org.locationtech.jts#jts-core;1.19.0!jts-core.jar(bundle) (22ms)
downloading https://repo1.maven.org/maven2/org/wololo/jts2geojson/0.16.1/jts2geojson-0.16.1.jar ...
	[SUCCESSFUL ] org.wololo#jts2geojson;0.16.1!jts2geojson.jar (7ms)
downloading https://repo1.maven.org/maven2/org/scala-lang/modules/scala-collection-compat_2.12/2.5.0/scala-collection-compat_2.12-2.5.0.jar ...
	[SUCCESSFUL ] org.scala-lang.modules#scala-collection-compat_2.12;2.5.0!scala-collection-compat_2.12.jar (14ms)
downloading https://repo1.maven.org/maven2/org/apache/commons/commons-math3/3.6.1/commons-math3-3.6.1.jar ...
	[SUCCESSFUL ] org.apache.commons#commons-math3;3.6.1!commons-math3.jar (29ms)
downloading https://repo1.maven.org/maven2/org/locationtech/spatial4j/spatial4j/0.8/spatial4j-0.8.jar ...
	[SUCCESSFUL ] org.locationtech.spatial4j#spatial4j;0.8!spatial4j.jar(bundle) (9ms)
downloading https://repo1.maven.org/maven2/com/google/geometry/s2-geometry/2.0.0/s2-geometry-2.0.0.jar ...
	[SUCCESSFUL ] com.google.geometry#s2-geometry;2.0.0!s2-geometry.jar (15ms)
downloading https://repo1.maven.org/maven2/com/uber/h3/4.1.1/h3-4.1.1.jar ...
	[SUCCESSFUL ] com.uber#h3;4.1.1!h3.jar (21ms)
downloading https://repo1.maven.org/maven2/net/sf/geographiclib/GeographicLib-Java/1.52/GeographicLib-Java-1.52.jar ...
	[SUCCESSFUL ] net.sf.geographiclib#GeographicLib-Java;1.52!GeographicLib-Java.jar (15ms)
downloading https://repo1.maven.org/maven2/com/github/ben-manes/caffeine/caffeine/2.9.2/caffeine-2.9.2.jar ...
	[SUCCESSFUL ] com.github.ben-manes.caffeine#caffeine;2.9.2!caffeine.jar (17ms)
downloading https://repo1.maven.org/maven2/com/google/guava/guava/25.1-jre/guava-25.1-jre.jar ...
	[SUCCESSFUL ] com.google.guava#guava;25.1-jre!guava.jar(bundle) (31ms)
downloading https://repo1.maven.org/maven2/com/google/code/findbugs/jsr305/3.0.2/jsr305-3.0.2.jar ...
	[SUCCESSFUL ] com.google.code.findbugs#jsr305;3.0.2!jsr305.jar (7ms)
downloading https://repo1.maven.org/maven2/com/google/j2objc/j2objc-annotations/1.1/j2objc-annotations-1.1.jar ...
	[SUCCESSFUL ] com.google.j2objc#j2objc-annotations;1.1!j2objc-annotations.jar (6ms)
downloading https://repo1.maven.org/maven2/org/codehaus/mojo/animal-sniffer-annotations/1.14/animal-sniffer-annotations-1.14.jar ...
	[SUCCESSFUL ] org.codehaus.mojo#animal-sniffer-annotations;1.14!animal-sniffer-annotations.jar (5ms)
downloading https://repo1.maven.org/maven2/org/checkerframework/checker-qual/3.10.0/checker-qual-3.10.0.jar ...
	[SUCCESSFUL ] org.checkerframework#checker-qual;3.10.0!checker-qual.jar (12ms)
downloading https://repo1.maven.org/maven2/com/google/errorprone/error_prone_annotations/2.5.1/error_prone_annotations-2.5.1.jar ...
	[SUCCESSFUL ] com.google.errorprone#error_prone_annotations;2.5.1!error_prone_annotations.jar (7ms)
downloading https://repo1.maven.org/maven2/commons-lang/commons-lang/2.6/commons-lang-2.6.jar ...
	[SUCCESSFUL ] commons-lang#commons-lang;2.6!commons-lang.jar (8ms)
downloading https://repo1.maven.org/maven2/org/beryx/awt-color-factory/1.0.0/awt-color-factory-1.0.0.jar ...
	[SUCCESSFUL ] org.beryx#awt-color-factory;1.0.0!awt-color-factory.jar (7ms)
downloading https://repo1.maven.org/maven2/com/github/scopt/scopt_2.12/4.1.0/scopt_2.12-4.1.0.jar ...
	[SUCCESSFUL ] com.github.scopt#scopt_2.12;4.1.0!scopt_2.12.jar (10ms)
:: resolution report :: resolve 5088ms :: artifacts dl 598ms
	:: modules in use:
	com.github.ben-manes.caffeine#caffeine;2.9.2 from central in [default]
	com.github.scopt#scopt_2.12;4.1.0 from central in [default]
	com.google.code.findbugs#jsr305;3.0.2 from central in [default]
	com.google.errorprone#error_prone_annotations;2.5.1 from central in [default]
	com.google.geometry#s2-geometry;2.0.0 from central in [default]
	com.google.guava#guava;25.1-jre from central in [default]
	com.google.j2objc#j2objc-annotations;1.1 from central in [default]
	com.uber#h3;4.1.1 from central in [default]
	commons-lang#commons-lang;2.6 from central in [default]
	net.sf.geographiclib#GeographicLib-Java;1.52 from central in [default]
	org.apache.commons#commons-math3;3.6.1 from central in [default]
	org.apache.sedona#sedona-common;1.6.0 from central in [default]
	org.apache.sedona#sedona-spark-3.4_2.12;1.6.0 from central in [default]
	org.apache.sedona#sedona-spark-common-3.4_2.12;1.6.0 from central in [default]
	org.beryx#awt-color-factory;1.0.0 from central in [default]
	org.checkerframework#checker-qual;3.10.0 from central in [default]
	org.codehaus.mojo#animal-sniffer-annotations;1.14 from central in [default]
	org.datasyslab#geotools-wrapper;1.6.0-28.2 from central in [default]
	org.locationtech.jts#jts-core;1.19.0 from central in [default]
	org.locationtech.spatial4j#spatial4j;0.8 from central in [default]
	org.scala-lang.modules#scala-collection-compat_2.12;2.5.0 from central in [default]
	org.wololo#jts2geojson;0.16.1 from central in [default]
	uk.co.gresearch.spark#spark-extension_2.12;2.11.0-3.4 from central in [default]
	:: evicted modules:
	org.checkerframework#checker-qual;2.0.0 by [org.checkerframework#checker-qual;3.10.0] in [default]
	com.google.errorprone#error_prone_annotations;2.1.3 by [com.google.errorprone#error_prone_annotations;2.5.1] in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   25  |   25  |   25  |   2   ||   23  |   23  |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-c2fcb9ab-9ede-4abc-bad7-73533a42fa83
	confs: [default]
	23 artifacts copied, 0 already retrieved (42692kB/72ms)
24/10/14 20:42:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

In [11]:

sc = sedona.sparkContext
sc

Out[11]:

SparkContext

Spark UI

Version: v3.4.0
Master: local[*]
AppName: pyspark-shell

config is the Spark session

In [12]:

type(config)

Out[12]:

pyspark.sql.session.SparkSession

Create SpatialRDD¶

Reading to PointRDD from CSV file¶

We now want load the CSV file into Apache Sedona PointRDD

testattribute0,-88.331492,32.324142,testattribute1,testattribute2
testattribute0,-88.175933,32.360763,testattribute1,testattribute2
testattribute0,-88.388954,32.357073,testattribute1,testattribute2
testattribute0,-88.221102,32.35078,testattribute1,testattribute2
testattribute0,-88.323995,32.950671,testattribute1,testattribute2
testattribute0,-88.231077,32.700812,testattribute1,testattribute2

In [13]:

!head data/arealm-small.csv

In [14]:

point_rdd = PointRDD(sc, "data/arealm-small.csv", 1, FileDataSplitter.CSV, True, 10)

In [15]:

## Getting approximate total count
point_rdd.approximateTotalCount

Out[15]:

In [16]:

# getting boundary for PointRDD or any other SpatialRDD, it returns Enelope object which inherits from
# shapely.geometry.Polygon
point_rdd.boundary()

Out[16]:

In [17]:

# To run analyze please use function analyze
point_rdd.analyze()

Out[17]:

True

In [18]:

# Finding boundary envelope for PointRDD or any other SpatialRDD, it returns Enelope object which inherits from
# shapely.geometry.Polygon
point_rdd.boundaryEnvelope

Out[18]:

In [19]:

# Calculate number of records without duplicates
point_rdd.countWithoutDuplicates()

Out[19]:

In [20]:

# Getting source epsg code
point_rdd.getSourceEpsgCode()

Out[20]:

''

In [21]:

# Getting target epsg code
point_rdd.getTargetEpsgCode()

Out[21]:

''

In [22]:

# Spatial partitioning data
point_rdd.spatialPartitioning(GridType.KDBTREE)

Out[22]:

True

Operations on RawSpatialRDD¶

rawSpatialRDD method returns RDD which consists of GeoData objects which has 2 attributes

geom: shapely.geometry.BaseGeometry

userData: str

You can use any operations on those objects and spread across machines

In [23]:

# take firs element
point_rdd.rawSpatialRDD.take(1)

Out[23]:

[Geometry: Point userData: testattribute0	testattribute1	testattribute2]

In [24]:

# collect to Python list
point_rdd.rawSpatialRDD.collect()[:5]

Out[24]:

[Geometry: Point userData: testattribute0	testattribute1	testattribute2,
 Geometry: Point userData: testattribute0	testattribute1	testattribute2,
 Geometry: Point userData: testattribute0	testattribute1	testattribute2,
 Geometry: Point userData: testattribute0	testattribute1	testattribute2,
 Geometry: Point userData: testattribute0	testattribute1	testattribute2]

In [25]:

# apply map functions, for example distance to Point(52 21)
point_rdd.rawSpatialRDD.map(lambda x: x.geom.distance(Point(21, 52))).take(5)

Out[25]:

[111.08786851399313,
 110.92828303170774,
 111.1385974283527,
 110.97450594034112,
 110.97122518072091]

Transforming to GeoPandas¶

Loaded data can be transformed to GeoPandas DataFrame in a few ways¶

Directly from RDD¶

In [26]:

point_rdd_to_geo = point_rdd.rawSpatialRDD.map(lambda x: [x.geom, *x.getUserData().split("\t")])

In [27]:

point_gdf = gpd.GeoDataFrame(
    point_rdd_to_geo.collect(), columns=["geom", "attr1", "attr2", "attr3"], geometry="geom"
)

In [28]:

point_gdf[:5]

Out[28]:

	geom	attr1	attr2	attr3
0	POINT (-88.33149 32.32414)	testattribute0	testattribute1	testattribute2
1	POINT (-88.17593 32.36076)	testattribute0	testattribute1	testattribute2
2	POINT (-88.38895 32.35707)	testattribute0	testattribute1	testattribute2
3	POINT (-88.22110 32.35078)	testattribute0	testattribute1	testattribute2
4	POINT (-88.32399 32.95067)	testattribute0	testattribute1	testattribute2

Using Adapter¶

In [29]:

# Adapter allows you to convert geospatial data types introduced with sedona to other ones

In [30]:

spatial_df = Adapter.\
    toDf(point_rdd, ["attr1", "attr2", "attr3"], sedona).\
    createOrReplaceTempView("spatial_df")

spatial_gdf = sedona.sql("Select attr1, attr2, attr3, geometry as geom from spatial_df")

In [31]:

spatial_gdf.show(5, False)

+--------------+--------------+--------------+----------------------------+
|attr1         |attr2         |attr3         |geom                        |
+--------------+--------------+--------------+----------------------------+
|testattribute0|testattribute1|testattribute2|POINT (-88.331492 32.324142)|
|testattribute0|testattribute1|testattribute2|POINT (-88.175933 32.360763)|
|testattribute0|testattribute1|testattribute2|POINT (-88.388954 32.357073)|
|testattribute0|testattribute1|testattribute2|POINT (-88.221102 32.35078) |
|testattribute0|testattribute1|testattribute2|POINT (-88.323995 32.950671)|
+--------------+--------------+--------------+----------------------------+
only showing top 5 rows

In [32]:

gpd.GeoDataFrame(spatial_gdf.toPandas(), geometry="geom")[:5]

Out[32]:

	attr1	attr2	attr3	geom
0	testattribute0	testattribute1	testattribute2	POINT (-88.33149 32.32414)
1	testattribute0	testattribute1	testattribute2	POINT (-88.17593 32.36076)
2	testattribute0	testattribute1	testattribute2	POINT (-88.38895 32.35707)
3	testattribute0	testattribute1	testattribute2	POINT (-88.22110 32.35078)
4	testattribute0	testattribute1	testattribute2	POINT (-88.32399 32.95067)

With DataFrame creation¶

In [33]:

schema = StructType(
    [
        StructField("geometry", GeometryType(), False),
        StructField("attr1", StringType(), False),
        StructField("attr2", StringType(), False),
        StructField("attr3", StringType(), False),
    ]
)

In [34]:

geo_df = sedona.createDataFrame(point_rdd_to_geo, schema, verifySchema=False)

In [35]:

gpd.GeoDataFrame(geo_df.toPandas(), geometry="geometry")[:5]

Out[35]:

	geometry	attr1	attr2	attr3
0	POINT (-88.33149 32.32414)	testattribute0	testattribute1	testattribute2
1	POINT (-88.17593 32.36076)	testattribute0	testattribute1	testattribute2
2	POINT (-88.38895 32.35707)	testattribute0	testattribute1	testattribute2
3	POINT (-88.22110 32.35078)	testattribute0	testattribute1	testattribute2
4	POINT (-88.32399 32.95067)	testattribute0	testattribute1	testattribute2

Load Typed SpatialRDDs¶

Currently The library supports 5 typed SpatialRDDs:

RectangleRDD

PointRDD

PolygonRDD

LineStringRDD

CircleRDD

In [36]:

rectangle_rdd = RectangleRDD(sc, "data/zcta510-small.csv", FileDataSplitter.CSV, True, 11)
point_rdd = PointRDD(sc, "data/arealm-small.csv", 1, FileDataSplitter.CSV, False, 11)
polygon_rdd = PolygonRDD(sc, "data/primaryroads-polygon.csv", FileDataSplitter.CSV, True, 11)
linestring_rdd = LineStringRDD(sc, "data/primaryroads-linestring.csv", FileDataSplitter.CSV, True)

In [37]:

rectangle_rdd.analyze()
point_rdd.analyze()
polygon_rdd.analyze()
linestring_rdd.analyze()

Out[37]:

True

Spatial Partitioning¶

Apache Sedona spatial partitioning method can significantly speed up the join query. Three spatial partitioning methods are available: KDB-Tree, Quad-Tree and R-Tree. Two SpatialRDD must be partitioned by the same way.

In [38]:

point_rdd.spatialPartitioning(GridType.KDBTREE)

Out[38]:

True

Create Index¶

Apache Sedona provides two types of spatial indexes, Quad-Tree and R-Tree. Once you specify an index type, Apache Sedona will build a local tree index on each of the SpatialRDD partition.

In [39]:

point_rdd.buildIndex(IndexType.RTREE, True)

SpatialJoin¶

Spatial join is operation which combines data based on spatial relations like:

intersects

touches

within

etc

To Use Spatial Join in GeoPyspark library please use JoinQuery object, which has implemented below methods:

SpatialJoinQuery(spatialRDD: SpatialRDD, queryRDD: SpatialRDD, useIndex: bool, considerBoundaryIntersection: bool) -> RDD

DistanceJoinQuery(spatialRDD: SpatialRDD, queryRDD: SpatialRDD, useIndex: bool, considerBoundaryIntersection: bool) -> RDD

spatialJoin(queryWindowRDD: SpatialRDD, objectRDD: SpatialRDD, joinParams: JoinParams) -> RDD

DistanceJoinQueryFlat(spatialRDD: SpatialRDD, queryRDD: SpatialRDD, useIndex: bool, considerBoundaryIntersection: bool) -> RDD

SpatialJoinQueryFlat(spatialRDD: SpatialRDD, queryRDD: SpatialRDD, useIndex: bool, considerBoundaryIntersection: bool) -> RDD

Example SpatialJoinQueryFlat PointRDD with RectangleRDD¶

In [40]:

# partitioning the data
point_rdd.spatialPartitioning(GridType.KDBTREE)
rectangle_rdd.spatialPartitioning(point_rdd.getPartitioner())
# building an index
point_rdd.buildIndex(IndexType.RTREE, True)
# Perform Spatial Join Query
result = JoinQuery.SpatialJoinQueryFlat(point_rdd, rectangle_rdd, False, True)

As result we will get RDD[GeoData, GeoData] It can be used like any other Python RDD. You can use map, take, collect and other functions

In [41]:

result

Out[41]:

MapPartitionsRDD[63] at map at FlatPairRddConverter.scala:30

In [42]:

result.take(2)

Out[42]:

[[Geometry: Polygon userData: , Geometry: Point userData: ],
 [Geometry: Polygon userData: , Geometry: Point userData: ]]

In [43]:

result.collect()[:3]

Out[43]:

[[Geometry: Polygon userData: , Geometry: Point userData: ],
 [Geometry: Polygon userData: , Geometry: Point userData: ],
 [Geometry: Polygon userData: , Geometry: Point userData: ]]

In [44]:

# getting distance using SpatialObjects
result.map(lambda x: x[0].geom.distance(x[1].geom)).take(5)

Out[44]:

[0.0, 0.0, 0.0, 0.0, 0.0]

In [45]:

# getting area of polygon data
result.map(lambda x: x[0].geom.area).take(5)

Out[45]:

[0.026651558685001447,
 0.026651558685001447,
 0.026651558685001447,
 0.026651558685001447,
 0.057069904940998895]

In [46]:

# Base on result you can create DataFrame object, using map function and build DataFrame from RDD
schema = StructType(
    [
        StructField("geom_left", GeometryType(), False),
        StructField("geom_right", GeometryType(), False)
    ]
)

In [47]:

# Set verifySchema to False
spatial_join_result = result.map(lambda x: [x[0].geom, x[1].geom])
sedona.createDataFrame(spatial_join_result, schema, verifySchema=False).show(5, True)

+--------------------+--------------------+
|           geom_left|          geom_right|
+--------------------+--------------------+
|POLYGON ((-87.229...|POINT (-87.204033...|
|POLYGON ((-87.229...|POINT (-87.204299...|
|POLYGON ((-87.229...|POINT (-87.19351 ...|
|POLYGON ((-87.229...|POINT (-87.18222 ...|
|POLYGON ((-87.285...|POINT (-87.28468 ...|
+--------------------+--------------------+
only showing top 5 rows

In [48]:

# Above code produces DataFrame with geometry Data type
sedona.createDataFrame(spatial_join_result, schema, verifySchema=False).printSchema()

root
 |-- geom_left: geometry (nullable = false)
 |-- geom_right: geometry (nullable = false)

We can create DataFrame object from Spatial Pair RDD using Adapter object as follows

In [49]:

Adapter.toDf(result, ["attr1"], ["attr2"], sedona).show(5, True)

+--------------------+-----+--------------------+-----+
|              geom_1|attr1|              geom_2|attr2|
+--------------------+-----+--------------------+-----+
|POLYGON ((-87.229...|     |POINT (-87.204033...|     |
|POLYGON ((-87.229...|     |POINT (-87.204299...|     |
|POLYGON ((-87.229...|     |POINT (-87.19351 ...|     |
|POLYGON ((-87.229...|     |POINT (-87.18222 ...|     |
|POLYGON ((-87.285...|     |POINT (-87.28468 ...|     |
+--------------------+-----+--------------------+-----+
only showing top 5 rows

This also produce DataFrame with geometry DataType

In [50]:

Adapter.toDf(result, ["attr1"], ["attr2"], sedona).printSchema()

root
 |-- geom_1: geometry (nullable = true)
 |-- attr1: string (nullable = true)
 |-- geom_2: geometry (nullable = true)
 |-- attr2: string (nullable = true)

We can create RDD which will be of type RDD[GeoData, List[GeoData]] We can for example calculate number of Points within some polygon data

To do that we can use code specified below

In [51]:

point_rdd.spatialPartitioning(GridType.KDBTREE)
rectangle_rdd.spatialPartitioning(point_rdd.getPartitioner())

In [52]:

spatial_join_result_non_flat = JoinQuery.SpatialJoinQuery(point_rdd, rectangle_rdd, False, True)

In [53]:

# number of point for each polygon
number_of_points = spatial_join_result_non_flat.map(lambda x: [x[0].geom, x[1].__len__()])

In [54]:

schema = StructType([
    StructField("geometry", GeometryType(), False),
    StructField("number_of_points", LongType(), False)
])

In [55]:

sedona.createDataFrame(number_of_points, schema, verifySchema=False).show()

+--------------------+----------------+
|            geometry|number_of_points|
+--------------------+----------------+
|POLYGON ((-87.082...|              12|
|POLYGON ((-87.229...|               7|
|POLYGON ((-86.697...|               1|
|POLYGON ((-87.105...|              15|
|POLYGON ((-87.092...|               5|
|POLYGON ((-86.860...|              12|
|POLYGON ((-86.816...|               6|
|POLYGON ((-87.114...|              15|
|POLYGON ((-87.285...|              26|
|POLYGON ((-86.749...|               4|
+--------------------+----------------+

KNNQuery¶

Spatial KNNQuery is operation which help us find answer which k number of geometries lays closest to other geometry.

For Example: 5 closest Shops to your home. To use Spatial KNNQuery please use object KNNQuery which has one method:

SpatialKnnQuery(spatialRDD: SpatialRDD, originalQueryPoint: BaseGeometry, k: int,  useIndex: bool)-> List[GeoData]

Finds 5 closest points from PointRDD to given Point¶

In [56]:

result = KNNQuery.SpatialKnnQuery(point_rdd, Point(-84.01, 34.01), 5, False)

In [57]:

result

Out[57]:

[Geometry: Point userData: ,
 Geometry: Point userData: ,
 Geometry: Point userData: ,
 Geometry: Point userData: ,
 Geometry: Point userData: ]

As Reference geometry you can also use Polygon or LineString object

In [58]:

polygon = Polygon(
    [(-84.237756, 33.904859), (-84.237756, 34.090426),
     (-83.833011, 34.090426), (-83.833011, 33.904859),
     (-84.237756, 33.904859)
    ])
polygons_nearby = KNNQuery.SpatialKnnQuery(polygon_rdd, polygon, 5, False)

In [59]:

polygons_nearby

Out[59]:

[Geometry: Polygon userData: ,
 Geometry: Polygon userData: ,
 Geometry: Polygon userData: ,
 Geometry: Polygon userData: ,
 Geometry: Polygon userData: ]

In [60]:

polygons_nearby[0].geom.wkt

Out[60]:

'POLYGON ((-83.993559 34.087259, -83.993559 34.131247, -83.959903 34.131247, -83.959903 34.087259, -83.993559 34.087259))'

RangeQuery¶

A spatial range query takes as input a range query window and an SpatialRDD and returns all geometries that intersect / are fully covered by the query window. RangeQuery has one method:

SpatialRangeQuery(self, spatialRDD: SpatialRDD, rangeQueryWindow: BaseGeometry, considerBoundaryIntersection: bool, usingIndex: bool) -> RDD

In [61]:

from sedona.core.geom.envelope import Envelope

In [62]:

query_envelope = Envelope(-85.01, -60.01, 34.01, 50.01)

result_range_query = RangeQuery.SpatialRangeQuery(linestring_rdd, query_envelope, False, False)

In [63]:

result_range_query

Out[63]:

MapPartitionsRDD[127] at map at GeometryRddConverter.scala:30

In [64]:

result_range_query.take(6)

Out[64]:

[Geometry: LineString userData: ,
 Geometry: LineString userData: ,
 Geometry: LineString userData: ,
 Geometry: LineString userData: ,
 Geometry: LineString userData: ,
 Geometry: LineString userData: ]

In [65]:

# Creating DataFrame from result

In [66]:

schema = StructType([StructField("geometry", GeometryType(), False)])

In [67]:

sedona.createDataFrame(
    result_range_query.map(lambda x: [x.geom]),
    schema,
    verifySchema=False
).show(5, True)

+--------------------+
|            geometry|
+--------------------+
|LINESTRING (-72.1...|
|LINESTRING (-72.4...|
|LINESTRING (-72.4...|
|LINESTRING (-73.4...|
|LINESTRING (-73.6...|
+--------------------+
only showing top 5 rows

Load From other Formats¶

GeoPyspark allows to load the data from other Data formats like:

GeoJSON

Shapefile

WKB

WKT

ShapeFile - load to SpatialRDD¶

In [68]:

shape_rdd = ShapefileReader.readToGeometryRDD(sc, "data/polygon")

In [69]:

shape_rdd

Out[69]:

<sedona.core.SpatialRDD.spatial_rdd.SpatialRDD at 0x7fb7977c5eb0>

In [70]:

Adapter.toDf(shape_rdd, sedona).show(5, True)

+--------------------+
|            geometry|
+--------------------+
|MULTIPOLYGON (((1...|
|MULTIPOLYGON (((-...|
|MULTIPOLYGON (((1...|
|POLYGON ((118.362...|
|MULTIPOLYGON (((-...|
+--------------------+
only showing top 5 rows

GeoJSON - load to SpatialRDD¶

{ "type": "Feature", "properties": { "STATEFP": "01", "COUNTYFP": "077", "TRACTCE": "011501", "BLKGRPCE": "5", "AFFGEOID": "1500000US010770115015", "GEOID": "010770115015", "NAME": "5", "LSAD": "BG", "ALAND": 6844991, "AWATER": 32636 }, "geometry": { "type": "Polygon", "coordinates": [ [ [ -87.621765, 34.873444 ], [ -87.617535, 34.873369 ], [ -87.6123, 34.873337 ], [ -87.604049, 34.873303 ], [ -87.604033, 34.872316 ], [ -87.60415, 34.867502 ], [ -87.604218, 34.865687 ], [ -87.604409, 34.858537 ], [ -87.604018, 34.851336 ], [ -87.603716, 34.844829 ], [ -87.603696, 34.844307 ], [ -87.603673, 34.841884 ], [ -87.60372, 34.841003 ], [ -87.603879, 34.838423 ], [ -87.603888, 34.837682 ], [ -87.603889, 34.83763 ], [ -87.613127, 34.833938 ], [ -87.616451, 34.832699 ], [ -87.621041, 34.831431 ], [ -87.621056, 34.831526 ], [ -87.62112, 34.831925 ], [ -87.621603, 34.8352 ], [ -87.62158, 34.836087 ], [ -87.621383, 34.84329 ], [ -87.621359, 34.844438 ], [ -87.62129, 34.846387 ], [ -87.62119, 34.85053 ], [ -87.62144, 34.865379 ], [ -87.621765, 34.873444 ] ] ] } },

In [71]:

geo_json_rdd = GeoJsonReader.readToGeometryRDD(sc, "data/testPolygon.json")

In [72]:

geo_json_rdd

Out[72]:

<sedona.core.SpatialRDD.spatial_rdd.SpatialRDD at 0x7fb7a5e2a4c0>

In [73]:

Adapter.toDf(geo_json_rdd, sedona).drop("AWATER").show(5, True)

+--------------------+-------+--------+-------+--------+--------------------+------------+----+----+--------+
|            geometry|STATEFP|COUNTYFP|TRACTCE|BLKGRPCE|            AFFGEOID|       GEOID|NAME|LSAD|   ALAND|
+--------------------+-------+--------+-------+--------+--------------------+------------+----+----+--------+
|POLYGON ((-87.621...|     01|     077| 011501|       5|1500000US01077011...|010770115015|   5|  BG| 6844991|
|POLYGON ((-85.719...|     01|     045| 021102|       4|1500000US01045021...|010450211024|   4|  BG|11360854|
|POLYGON ((-86.000...|     01|     055| 001300|       3|1500000US01055001...|010550013003|   3|  BG| 1378742|
|POLYGON ((-86.574...|     01|     089| 001700|       2|1500000US01089001...|010890017002|   2|  BG| 1040641|
|POLYGON ((-85.382...|     01|     069| 041400|       1|1500000US01069041...|010690414001|   1|  BG| 8243574|
+--------------------+-------+--------+-------+--------+--------------------+------------+----+----+--------+
only showing top 5 rows

WKT - loading to SpatialRDD¶

In [74]:

wkt_rdd = WktReader.readToGeometryRDD(sc, "data/county_small.tsv", 0, True, False)

In [75]:

wkt_rdd

Out[75]:

<sedona.core.SpatialRDD.spatial_rdd.SpatialRDD at 0x7fb797a23430>

In [76]:

Adapter.toDf(wkt_rdd, sedona).printSchema()

root
 |-- geometry: geometry (nullable = true)

In [77]:

Adapter.toDf(wkt_rdd, sedona).show(5, True)

+--------------------+
|            geometry|
+--------------------+
|POLYGON ((-97.019...|
|POLYGON ((-123.43...|
|POLYGON ((-104.56...|
|POLYGON ((-96.910...|
|POLYGON ((-98.273...|
+--------------------+
only showing top 5 rows

WKB - load to SpatialRDD¶

In [78]:

wkb_rdd = WkbReader.readToGeometryRDD(sc, "data/county_small_wkb.tsv", 0, True, False)

In [79]:

Adapter.toDf(wkb_rdd, sedona).show(5, True)

+--------------------+
|            geometry|
+--------------------+
|POLYGON ((-97.019...|
|POLYGON ((-123.43...|
|POLYGON ((-104.56...|
|POLYGON ((-96.910...|
|POLYGON ((-98.273...|
+--------------------+
only showing top 5 rows

Converting RDD Spatial join result to DF directly, avoiding jvm python serde¶

In [80]:

point_rdd.spatialPartitioning(GridType.KDBTREE)
rectangle_rdd.spatialPartitioning(point_rdd.getPartitioner())
# building an index
point_rdd.buildIndex(IndexType.RTREE, True)
# Perform Spatial Join Query
result = JoinQueryRaw.SpatialJoinQueryFlat(point_rdd, rectangle_rdd, False, True)

In [81]:

# without passing column names, the result will contain only two geometries columns
geometry_df = Adapter.toDf(result, sedona)

In [82]:

geometry_df.printSchema()

root
 |-- leftgeometry: geometry (nullable = true)
 |-- rightgeometry: geometry (nullable = true)

In [83]:

geometry_df.show(5)

+--------------------+--------------------+
|        leftgeometry|       rightgeometry|
+--------------------+--------------------+
|POLYGON ((-87.092...|POINT (-86.94719 ...|
|POLYGON ((-86.816...|POINT (-86.709657...|
|POLYGON ((-86.816...|POINT (-86.717655...|
|POLYGON ((-86.749...|POINT (-86.736302...|
|POLYGON ((-86.749...|POINT (-86.735506...|
+--------------------+--------------------+
only showing top 5 rows

In [84]:

geometry_df.collect()[0]

Out[84]:

Row(leftgeometry=<POLYGON ((-87.093 34.264, -87.093 34.422, -86.764 34.422, -86.764 34.264, -...>, rightgeometry=<POINT (-86.947 34.29)>)

Passing column names¶

In [85]:

geometry_df = Adapter.toDf(result, ["left_user_data"], ["right_user_data"], sedona)

In [86]:

geometry_df.show(5)

+--------------------+--------------+--------------------+---------------+
|        leftgeometry|left_user_data|       rightgeometry|right_user_data|
+--------------------+--------------+--------------------+---------------+
|POLYGON ((-87.092...|              |POINT (-86.94719 ...|           null|
|POLYGON ((-86.816...|              |POINT (-86.709657...|           null|
|POLYGON ((-86.816...|              |POINT (-86.717655...|           null|
|POLYGON ((-86.749...|              |POINT (-86.736302...|           null|
|POLYGON ((-86.749...|              |POINT (-86.735506...|           null|
+--------------------+--------------+--------------------+---------------+
only showing top 5 rows

Converting RDD Spatial join result to DF directly, avoiding jvm python serde¶

In [87]:

query_envelope = Envelope(-85.01, -60.01, 34.01, 50.01)

result_range_query = RangeQueryRaw.SpatialRangeQuery(linestring_rdd, query_envelope, False, False)

In [88]:

# converting to df
gdf = Adapter.toDf(result_range_query, sedona)

In [89]:

gdf.show(5)

+--------------------+
|            geometry|
+--------------------+
|LINESTRING (-72.1...|
|LINESTRING (-72.4...|
|LINESTRING (-72.4...|
|LINESTRING (-73.4...|
|LINESTRING (-73.6...|
+--------------------+
only showing top 5 rows

In [90]:

gdf.printSchema()

root
 |-- geometry: geometry (nullable = true)

In [91]:

# Passing column names
# converting to df
gdf_with_columns = Adapter.toDf(result_range_query, sedona, ["_c1"])

In [92]:

gdf_with_columns.show(5)

+--------------------+---+
|            geometry|_c1|
+--------------------+---+
|LINESTRING (-72.1...|   |
|LINESTRING (-72.4...|   |
|LINESTRING (-72.4...|   |
|LINESTRING (-73.4...|   |
|LINESTRING (-73.6...|   |
+--------------------+---+
only showing top 5 rows

In [93]:

gdf_with_columns.printSchema()

root
 |-- geometry: geometry (nullable = true)
 |-- _c1: string (nullable = true)

Summary¶

We have shown how to install Sedona with Pyspark and run a basic example (source: https://github.com/apache/sedona/blob/master/docs/usecases/ApacheSedonaCore.ipynb) on Google Colab. This demo uses the Spark engine provided by PySpark.