Jupyter notebook management

After initiating a Jupyter Notebook following the creation of a Notebook instance, the Jupyter dashboard will be presented in the following manner:

Jupyter Dashboard

Note

While you are working on Jupyter or Spark UI, you might see a warning message indicating that the connection to the server is lost. Wherobots use cookie sessions to authenticate users, and the cookie session will expire after an hour. You can refresh your page and the browser will redirect to the login page, after you re-login will the browser redirect you back to the previous page. Force relogin

Executing Jupyter Notebook¶

There are two types of kernels available for your Jupyter Notebook: the Python kernel (ipykernel) and the Scala kernel (Scala). These kernels can be created using the Launcher, which can be accessed through File -> New Launcher.

Spark Web UI¶

A Spark web UI enables real-time monitoring, performance analysis, and resource optimization for efficient data processing, aiding in identifying bottlenecks and improving overall application efficiency. You can access it through the toolbar and clicking on Sedona Spark and selecting the correct port number.

Spark UI

To obtain the port number, execute the provided code snippet:

spark_ui_port = sedona.sparkContext.uiWebUrl.split(":")[-1]

Click here for more information on Spark Web UI.

Executing All Cells¶

Open the desired Jupyter notebook to start executing on the Notebook instance that you created. Follow the steps below to execute all the code cells in the Jupyter notebook.

Locate the toolbar on the top-left of the notebook and select Run as shown in the screenshot.
Click Run All Cells to execute each cell in the notebook.

Note

When you first execute a Sedona Spark code cell, you will see a warning:

<TIMESTAMP> WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

This behavior is normal as it takes somewhere between 1-5 minutes depending on the number of Executors provided, to start the Executors.

The spark configuration and adding maven libraries differs slightly between Scala notebook and Python notebook.

Python Notebook Sedona Spark Configuration¶

For example, if you want to access data, you will need to create a Sedona object with appropriate configurations. The first configuration that you should do is accessing data from a public S3 bucket, it looks like this:

.config("spark.hadoop.fs.s3a.bucket.<YOUR_BUCKET_NAME>.aws.credentials.provider","org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")

If your bucket is not public then you may add your access key and secret key to access the private S3 bucket using the configuration shown below:

.config("spark.hadoop.fs.s3a.bucket.<YOUR_BUCKET_NAME>.access.key", "ACCESS-KEY")
.config("spark.hadoop.fs.s3a.bucket.<YOUR_BUCKET_NAME>.secret.key", "SECRET-KEY")
.config("spark.hadoop.fs.s3a.bucket.<YOUR_BUCKET_NAME>.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")

Sedona can also dynamically fetch any number of additional dependencies directly from Maven by defining the maven coordinates for the package. Maven coordinates have 3 required fields groupId:artiactId:version.

.config("spark.jars.packages", "<MAVEN_COORDINATE>,<MAVEN_COORDINATE>")

Compiling all the configuration for accessing a public S3 bucket with a dependency, looks like this.

from sedona.spark import *

config = SedonaContext.builder() .\
    config("spark.hadoop.fs.s3a.bucket.<YOUR_BUCKET_NAME>.aws.credentials.provider","org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider") .\
    config("spark.jars.packages", "<MAVEN_COORDINATE>") .\
    getOrCreate()
sedona = SedonaContext.create(config)

Scala Notebook Sedona Spark Configuration¶

Defining configurations for Scala Jupyter notebook is a bit different than you are used to in Python Jupyter notebook. You have to define the configurations before you create a Sedona object. To configure Scala Jupyter notebook to access a public S3 bucket follow the instruction below:

launcher.conf.set("spark.hadoop.fs.s3a.bucket.<YOUR_BUCKET_NAME>.aws.credentials.provider","org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")

To access an S3 bucket is not public, you will have to define the access key and secret key for the respective S3 bucket.

launcher.conf.set("spark.hadoop.fs.s3a.bucket.<YOUR_BUCKET_NAME>.access.key", "ACCESS-KEY")
launcher.conf.set("spark.hadoop.fs.s3a.bucket.<YOUR_BUCKET_NAME>.secret.key", "SECRET-KEY")
launcher.conf.set("spark.hadoop.fs.s3a.bucket.<YOUR_BUCKET_NAME>.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")

You can dynamically add any number of Maven dependencies by defining the Maven coordinates.

launcher.packages = ["<MAVEN_COORDINATE>", "<MAVEN_COORDINATE>"]

Compiling all the Scala Jupyter configurations to access a public S3 bucket, while adding a dependency. You need to add these configurations at the very beginning of the Scala notebooks.

%%init_spark
launcher.conf.set("spark.hadoop.fs.s3a.bucket.<YOUR_BUCKET_NAME>.aws.credentials.provider","org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
launcher.packages = ["<MAVEN_COORDINATE>"]

As you have defined the configuration for Scala Jupyter notebook, creation of Sedona object is straight forward.

val config = SedonaContext.builder().getOrCreate()
val sedona = SedonaContext.create(config)

Note

You have the option to bypass configuring Spark settings in the Jupyter Notebook by specifying them directly in the advanced configuration's Spark configuration field.

Python Zip Library File Support¶

In Jupyter Notebook, there is a support for users to import their own customised Python modules.

1.On users' local computer or any users' environments, users create a directory called zipmoudletest

2.In the zipmoudletest directory, create a file named hellosedona.py with the following contents:

def hello(input):
    return 'hello ' + str(input);

3.In the same directory, add an empty file with the name __init__.py, if you list this directory, ls zipmoudletest,they should now look like the following.

__init__.py       hellosedona.py

4.Use the zip command to place the two module files into a file called zipmoudletest.zip, or you can use any compress tool to zip these two files.

zip -r9 ../zipmoudletest.zip *

5.Upload the zip file into your S3 folder from file browser, refer to file structure docs

6.Copy the S3 path of zip file, and use below code to import zip file

sedona.sparkContext.addPyFile('s3://<Your-Bucket>/path-to-file/zipmoudletest.zip')
from zipmoudletest.hellosedona import hello
hello_str = hello("Sedona")

output will be

hello Sedona

7.In Job submission, if you want to import customised Python module, you also can include above code in your Python code.

Python Notebook As a Job¶

Once the Jupyter notebook is ready to export the notebook as an executable Python file. Exporting your Python Jupyter notebook allows you to create jobs. Please follow steps to create a python executable.

Locate the toolbar on the top-left of the notebook and select File as shown in the screenshot.
Hover over the Save and Export Notebook As....
Select Executable Script then a file with the name of the Jupyter notebook will be saved to your downloads folder.

Once you have the Python executable file, refer to job submission docs to create a job.

Scala Notebook As a Job¶

As you have made changes to the Jupyter Scala notebook and want to export it as an executable Scala file. The steps are same as creating a Python executable script from Python notebook.

Locate the toolbar on the top-left of the notebook and select File as shown in the screenshot.
Hover over the Save and Export Notebook As....
Select Executable Script then a file with the name of the Jupyter Scala notebook will be saved to your downloads folder.

As you have a scala file, you can import it to Scala executable file to sedona-maven-exmaple/src/main/scala/com/wherobots/sedona/ file path, this will make sure that you have the scala file in the jar package for job submission.

Note

The Scala executable file that you create won't have a main class, for that you can wrap your code after the imports with this object <class-name-you-want> extends App { all-of-code }, this will ensure that the job is executing as expectation.

Executing a .scala file is not possible within the Jupyter environment as it functions solely as a text editor. To execute code, please utilize the Jupyter Scala notebook.

Alternatively, you can also copy the code from Jupyter Scala notebook and paste it in the already existing file in path mentioned above.

Creating Jar File¶

Navigate to File on the toolbar.
Click on New Launcher
Open terminal, make sure you are in the sedona-maven-example directory.
Execute mvn clean package in the terminal at the specified location.
Locate target folder.
Right-click on the sedonadb-example-0.0.1.jar.
Select Download to download the jar in your downloads folder.

Note

You may add any dependency to the pom.xml located at notebook-example/scala/sedona-maven-example

Once you have the jar file, refer to job submission docs to create a job.

Last update: October 21, 2023 05:44:41