AWS Big Data Blog
Simplify your Spark dependency management with Docker in EMR 6.0.0
Apache Spark is a powerful data processing engine that gives data analyst and engineering teams easy to use APIs and tools to analyze their data, but it can be challenging for teams to manage their Python and R library dependencies. Installing every dependency that a job may need before it runs and dealing with library version conflicts is time-consuming and complicated. Amazon EMR 6.0.0 simplifies this by allowing you to use Docker images from Docker Hub and Amazon ECR to package your dependencies. This allows you to package and manage your dependencies for individual Spark jobs or notebooks, without having to manage a spiderweb of dependencies across your cluster.
This post shows you how to use Docker to manage notebook dependencies with Amazon EMR 6.0.0 and EMR Notebooks. You will launch an EMR 6.0.0 cluster and use notebook-specific Docker images from Amazon ECR with your EMR Notebook.
Creating a Docker image
The first step is to create a Docker image that contains Python 3 and the latest version of the numpy
Python package. You create Docker images by using a Dockerfile, which defines the packages and configuration to include in the image. Docker images used with Amazon EMR 6.0.0 must contain a Java Development Kit (JDK); the following Dockerfile uses Amazon Linux 2 and the Amazon Corretto JDK 8:
You will use this Dockerfile to create a Docker image, and then tag and upload it to Amazon ECR. After you upload it, you will launch an EMR 6.0.0 cluster that is configured to use this Docker image as the default image for Spark jobs. Complete the following steps to build, tag, and upload your Docker image:
- Create a directory and a new file named
Dockerfile
using the following commands: - Copy, and then paste the contents of the Dockerfile, save it, and run the following command to build a Docker image:
- Create the
emr-docker-examples
Amazon ECR repository for this walkthrough using the following command: - Tag the locally built image and replace 123456789123.dkr.ecr.us-east-1.amazonaws.com with your Amazon ECR endpoint, using the following command:
Before you can push the Docker image to Amazon ECR, you need to log in.
- To get the login line for your Amazon ECR account, use the following command:
- Enter and run the output from the
get-login
command: - Upload the locally built image to Amazon ECR and replace 123456789123.dkr.ecr.us-east-1.amazonaws.com with your Amazon ECR endpoint. See the following command:
After this push is complete, the Docker image is available to use with your EMR cluster.
Launching an EMR 6.0.0 cluster with Docker enabled
To use Docker with Amazon EMR, you must launch your EMR cluster with Docker runtime support enabled and have the right configuration in place to connect to your Amazon ECR account. To allow your cluster to download images from Amazon ECR, makes sure the instance profile for your cluster has the permissions from the AmazonEC2ContainerRegistryReadOnly
managed policy associated with it. The configuration in the first step below configures your EMR 6.0.0 cluster to use Amazon ECR to download Docker images, and configures Apache Livy and Apache Spark to use the pyspark-latest
Docker image as the default Docker image for all Spark jobs. Complete the following steps to launch your cluster:
- Create a file named
emr-configuration.json
in the local directory with the following configuration (replace 123456789123.dkr.ecr.us-east-1.amazonaws.com with your Amazon ECR endpoint):You will use that configuration to launch your EMR 6.0.0 cluster using the AWS CLI.
- Enter the following commands (replace myKey with the name of the EC2 key pair you use to access the cluster using SSH, and subnet-1234567 with the subnet ID the cluster should be launched in):
After the cluster launches and is in the
Waiting
state, make sure that the cluster hosts can authenticate themselves to Amazon ECR and download Docker images. - Use your EC2 key pair to SSH into one of the core nodes of the cluster.
- To generate the Docker CLI command to create the credentials (valid for 12 hours) the cluster uses to download Docker images from Amazon ECR, enter the following command:
- Enter and run the output from the
get-login
command:This command generates a
config.json
file in the/root/.docker
folder. - Place the generated
config.json
file in the HDFS location/user/hadoop/
using the following command:
Now that you have an EMR cluster that is configured with Docker and an image in Amazon ECR, you can use EMR Notebooks to create and run your notebook.
Creating an EMR Notebook
EMR Notebooks are serverless Jupyter notebooks available directly through the Amazon EMR console. They allow you to separate your notebook environment from your underlying cluster infrastructure, and access your notebook without spending time setting up SSH access or configuring your browser for port-forwarding. You can find EMR Notebooks in the left-hand navigation of the EMR console.
To create your notebook, complete the following steps:
- Click on Notebooks in the EMR Console
- Choose a name for your notebook
- Click Choose an existing cluster and select the cluster you just created
- Click Create notebook
- Once your notebook is in a Ready status, you can click the Open in JupyterLab button to open it in a new browser tab. A default notebook with the name of your EMR Notebook is created by default. When you click on that notebook, you’ll be asked to choose a Kernel. Choose PySpark.
- Enter the following configuration into the first cell in your notebook and click ▸(Run):
- Enter the following PySpark code into your notebook and click ▸(Run):
The output should look like the following screenshot; the numpy version in use is the latest (at the time of this writing, 1.18.2).
This PySpark code was run on your EMR 6.0.0 cluster using YARN, Docker, and the pyspark-latest
image that you created. EMR Notebooks connect to EMR clusters using Apache Livy. The configuration specified in emr-configuration.json
configured your EMR cluster’s Spark and Livy instances to use Docker and the pyspark-latest
Docker image as the default Docker image for all Spark jobs submitted to this cluster. This allows you to use numpy
without having to install it on any cluster nodes. The following section looks at how you can create and use different Docker images for specific notebooks.
Using a custom Docker image for a specific notebook
Individual workloads often require specific versions of library dependencies. To allow individual notebooks to use their own Docker images, you first create a new Docker image and push it to Amazon ECR. You then configure your notebook to use this Docker image instead of the default pyspark-latest
image.
Complete the following steps:
- Create a new Dockerfile with a specific version of
numpy
: 1.17.5. - Create a directory and a new file named
Dockerfile
using the following commands: - Enter the contents of your new Dockerfile and the following code to build a Docker image:
- Tag and upload the locally built image to Amazon ECR and replace 123456789123.dkr.ecr.us-east-1.amazonaws.com with your Amazon ECR endpoint using the following commands:
Now that the numpy-1-17 Docker image is available in Amazon ECR, you can use it with a new notebook.
- Create a new Notebook by returning to your EMR Notebook and click File, New, Notebook, and choose the PySpark kernel.To tell your EMR Notebook to use this specific Docker image instead of the default, you need to use the following configuration parameters.
- Enter the following code in your notebook (replace 123456789123.dkr.ecr.us-east-1.amazonaws.com with your Amazon ECR endpoint) and choose ▸(Run):
To check if your PySpark code is using version 1.17.5, enter the same PySpark code as before to use numpy and output the version.
- Enter the following code into your notebook and choose Run:
The output should look like the following screenshot; the numpy version in use is the version you installed in your numpy-1-17
Docker image: 1.17.5.
Summary
This post showed you how to simplify your Spark dependency management using Amazon EMR 6.0.0 and Docker. You created a Docker image to package your Python dependencies, created a cluster configured to use Docker, and used that Docker image with an EMR Notebook to run PySpark jobs. To find out more about using Docker images with EMR, refer to the EMR documentation on how to Run Spark Applications with Docker Using Amazon EMR 6.0.0. Stay tuned for additional updates on new features and further improvements with Apache Spark on Amazon EMR.
About the Author
Paul Codding is a senior product manager for EMR at Amazon Web Services.
Suthan Phillips is a big data architect at AWS. He works with customers to provide them architectural guidance and helps them achieve performance enhancements for complex applications on Amazon EMR. In his spare time, he enjoys hiking and exploring the Pacific Northwest.