AWS EMR on EC2
Setup Permissions for Wherobots through IAM¶
To link to your AWS account, you need an existing AWS account. If you don’t have an AWS account, you can sign up for an AWS Free Tier account at https://aws.amazon.com/free/.
Create an IAM user under your AWS account¶
- Navigate to your AWS IAM dashboard (What is an IAM user?)
-
Click on the Users tab
-
Click on “Add users”
-
If you want to grant the IAM user the least permissions required for Wherobots to operate in your account, skip this step and refer to Grant IAM User permissions. Alternatively, if you want to save time here, you can simply choose
Attach policies directly
and toggle the following 2 policies:- AmazonEMRFullAccessPolicy_v2
- AmazonS3FullAccess
-
Review the new user and press “Create user”
-
Find the new user you created and open its page
-
Click on the “Security credentials” tab
-
Scroll down to the “Access keys” section and click on “Create access key”
-
Select Third-party service and click “Next”
-
Set a description tag (optional) and click on “Create access key”
-
Download or copy the access keys you have just created. You will need them in the coming steps
Grant IAM User permissions¶
To enable wherobots to operate in your account, the minimum permissions you need to put in your IAM policies are listed as below.
Permissions to manage EMR clusters¶
Wherobots requires permissions to list clusters, describe and clone an EMR cluster, terminate an EMR cluster.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowEMRAPI",
"Effect": "Allow",
"Action": [
"elasticmapreduce:ListClusters",
"elasticmapreduce:RunJobFlow"
],
"Resource": "*"
},
{
"Sid": "AllowEMRCluters",
"Effect": "Allow",
"Action": [
"elasticmapreduce:GetManagedScalingPolicy",
"elasticmapreduce:DescribeStep",
"elasticmapreduce:ListInstances",
"elasticmapreduce:ListBootstrapActions",
"elasticmapreduce:ListSteps",
"elasticmapreduce:ListInstanceFleets",
"elasticmapreduce:GetAutoTerminationPolicy",
"elasticmapreduce:DescribeCluster",
"elasticmapreduce:TerminateJobFlows",
"elasticmapreduce:ListInstanceGroups"
],
"Resource": "arn:aws:elasticmapreduce:*:{ACCOUNT_ID}:cluster/*"
},
{
"Sid": "AllowPassRoles",
"Effect": "Allow",
"Action": "iam:PassRole",
"Resource": "arn:aws:iam::{ACCOUNT_ID}:role/*",
"Condition": {
"StringLike": {
"iam:PassedToService": "*amazonaws.com*"
}
}
}
]
}
You can directly copy the policies and create under your account.
Permissions to upload wherobots binaries¶
Wherobots will create a bucket with
- tag
author = wherobots
- name starts with
wherobots-*
in your account. And use it to upload our binaries and config files. To achieve that, we require below permissions.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowInspectBuckets",
"Effect": "Allow",
"Action": [
"s3:ListAllMyBuckets",
"s3:GetBucketTagging"
],
"Resource": "*"
},
{
"Sid": "AllowManageWherobotsBuckets",
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:CreateBucket",
"s3:ListBucket",
"s3:DeleteObject",
"s3:DeleteBucket",
"s3:PutBucketTagging"
],
"Resource": "arn:aws:s3:::wherobots-*"
}
]
}
Grant permissions to IAM User¶
-
Find your 12-digit AWS Account ID. Please see the steps on AWS
-
In AWS Console Navigate to
Policies
underIAM
. Then clickCreate Policy
at the top right. Switch toJson
view and paste all the policies above in. Replace the{ACCOUNT_ID}
with the real account id of yours. Then create the new Policy. -
Navigate to
Users
underIAM
. Find the IAM user you just created. Under tabPermissions
->Permissions policies
. ClickAdd permissions
. In the edit panel, chooseAttach Policies directly
, then search the new IAM policy you just created. And attach it to the new IAM User.
Link your AWS IAM account to your Wherobots Account¶
- Navigate to the Wherobots AWS cloud provider page
- Enter your access key and secret key from your AWS IAM user into the corresponding fields
- Press “Save Changes”
Create the EMR Cluster¶
Note
Wherobots currently does not support EMR on EKS. Please choose EMR on EC2.
Wherobots deploy to EMR by cloning the existing cluster preferred by the user and append sedona add-ons. This means the user need to spawn a clone source cluster first. Below are the steps to create the simplest cluster that is feasible for wherobots to clone.
-
Navigate to AWS EMR clusters list, click the
Create Cluster
button. -
Under
Application bundle
section, chooseCustom
and pick at least following apps:- spark
- livy
- JupyterEnterpriseGateway
-
Choose the
Amazon EMR Service Role
andEC2 instance profile for Amazon EMR
. Pick the default roles prompt by AWS if you don't need any special permission control. -
Click the
Create Cluster
button, navigate back to clusters page, you will find the cluster provisioning. Click into the cluster you will see the details.
For more details of how to configure the EMR cluster, please refer to the AWS official document
Setup your EMR Cluster with Wherobots¶
-
Navigate to the clusters page on https://www.wherobots.services/clusters. Click on the
AWS Elastic Map Reduce(EMR)
icon -
Find the source cluster you created in the cluster list, click the check box at left side, then choose
clone to start
. In the pop up window, inpu the new cluster name and choose wherobots version. Then submit -
Wherobots service will clone your source cluster, and plant wherobots binaries into the new cluster. You can monitor the cluster spawn process in the same page. Once the status updated to
Waiting
, the EMR cluster is ready to use
Spawn EMR Workspace(JupyterLab) and run notebooks¶
Tip
We provide many ready-to-use example Python Jupyter notebooks. Please try them out: Wherobots examples.
-
Navigate to your AWS EMR cluster page, click the
Workspace
on the side bar, then choose to create workspace. If there is noStudio
yet, create a new one follow the AWS guidance, or use any existing studio available.- Make sure that the studio is under same VPC and subnet(multi-choice) with the EMR cluster.
-
In the create page, under
Advanced configuration
, chooseAttach Workspace to an EMR cluster
, then pick the cluster spawned by wherobots under. -
Submit the create, then navigate back to the workspace list, wait for the new workspace turns into
attached
status, then click the workspace, AWS will pop up a new jupyterlab tab in browser. -
Import a Jupyter notebook. You can get one from Wherobots examples.
-
Now your notebook is imported. You can open it, and then choose the
PySpark
kernel, and execute the cells one by one.