AWS Open Source Blog
Kubeflow on Amazon EKS
NOTE: Since this blog post was written, much about Kubeflow has changed. While we are leaving it up for historical reference, more accurate information about Kubeflow on AWS can be found here.
The Kubeflow project is designed to simplify the deployment of machine learning projects like TensorFlow on Kubernetes. There are also plans to add support for additional frameworks such as MXNet, Pytorch, Chainer, and more. These frameworks can leverage GPUs in the Kubernetes cluster for machine learning tasks.
Recently, we announced support of P2 and P3 GPU worker instances for Amazon EKS. While it’s possible to run machine learning workloads with CPU instances, GPU instances have thousands of CUDA cores, which significantly improve performance in the training of deep neural networks and processing large data sets. This post will demonstrate how to deploy Kubeflow on Amazon EKS clusters with P3 worker instances. We will then show how you can use Kubeflow to easily perform machine learning tasks like training and model serving on Kubernetes. We will be using a Jupyter notebook for our training based on the TensorFlow framework. A Jupyter notebook is an open source web application that allows us to create and share machine learning documents in various programming languages like Python, Scala, R, etc. A Python notebook is used in our example.
Prerequisites
- AWS CLI
- An environment where you can build Docker images. We recommend using AWS Cloud9 IDE, as it has Docker and AWS CLI pre-installed. Having Docker and AWS CLI installed on your local machine will work, too.
- Subscribe to the EKS-optimized AMI with GPU Support from the AWS Marketeplace.
- Ensure that you can launch at least two instances of GPU instances (P2 or P3); you can raise this limit through the EC2 console.
- Have the
kubectl
command line tool installed.
Follow the instructions to create an EKS cluster with GPU instances. Alternatively, use the eksctl
command line tool from Weaveworks to spin up an EKS cluster. For example, the following command will spin up a cluster with two worker nodes of p3.8xlarge
instances in the us-west-2 region:
$ eksctl create cluster eks-kubeflow --node-type=p3.8xlarge --nodes 2 --region us-west-2 --timeout=40m
Amazon EKS Cluster Validation
Run this command to apply the Nvidia Kubernetes device plugin as a daemonset on each worker node:
$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.10/nvidia-device-plugin.yml
You can issue these commands to check the status of nvidia-device-plugin daemonsets and the corresponding pods:
$ kubectl get daemonset -n kube-system
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
aws-node 2 2 2 2 2 <none> 2d
kube-proxy 2 2 2 2 2 <none> 2d
nvidia-device-plugin-daemonset 2 2 2 2 2 <none> 2d
$ kubectl get pods -n kube-system -owide |grep nvid
nvidia-device-plugin-daemonset-7842r 1/1 Running 0 2d 192.168.118.128 ip-192-168-111-8.us-west-2.compute.internal
nvidia-device-plugin-daemonset-7cnnd 1/1 Running 0 2d 192.168.179.50 ip-192-168-153-27.us-west-2.compute.internal
Once the nvidia-device-plugin daemonsets are running, the next command confirms that there are four GPUs in each worker node:
$ kubectl get nodes \
"-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu,EC2:.metadata.labels.beta\.kubernetes\.io/instance-type,AZ:.metadata.labels.failure-domain\.beta\.kubernetes\.io/zone"
NAME GPU EC2 AZ
ip-192-168-177-96.us-west-2.compute.internal 4 p3.8xlarge us-west-2a
ip-192-168-246-95.us-west-2.compute.internal 4 p3.8xlarge us-west-2c
Storage Class for Persistent Volume
Kubeflow requires a default storage class to spawn Jupyter notebooks with attached persistent volumes. A StorageClass
in Kubernetes provides a way to describe the type of storage (e.g., types of EBS volume: io1, gp2, sc1, st1) that an application can request for its persistent storage. The following command creates a Kubernetes default storage class for dynamic provisioning of persistent volumes backed by Amazon Elastic Block Store (EBS) with the general-purpose SSD volume type (gp2).
$ cat <<EOF | kubectl create -f -
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: gp2
annotations:
storageclass.beta.kubernetes.io/is-default-class: "true"
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp2
reclaimPolicy: Delete
mountOptions:
- debug
EOF
Validate that the default StorageClass is created using the command below:
$ kubectl get storageclass
NAME PROVISIONER AGE
gp2 (default) kubernetes.io/aws-ebs 2d
Install Kubeflow
Kubeflow uses ksonnet
, a command line tool that simplifies the configuration and deployment of applications in multiple Kubernetes environments. Ksonnet abstracts Kubernetes resources as Prototypes. Ksonnet uses these prototypes to generate Components as Kubernetes YAML files, which are tuned for specific implementations by filling in the parameters of the prototypes. A different set of parameters can be used for each Kubernetes environment.
Download ksonnet CLI. On MacOS, you can also use brew install ksonnet/tap/ks
.
Validate that you have version 0.12.0 of ksonnet:
$ ks version
ksonnet version: 0.12.0
jsonnet version: v0.11.2
client-go version: kubernetes-1.10.4
Install Kubeflow on Amazon EKS
First, create a new Kubernetes namespace for the Kubeflow deployment:
$ export NAMESPACE=kubeflow
$ kubectl create namespace ${NAMESPACE}
Next, download the current version of the Kubeflow deployment script; it will clone the Kubeflow repository from GitHub.
$ export KUBEFLOW_VERSION=0.2.5
$ export KUBEFLOW_DEPLOY=false
$ curl https://raw.githubusercontent.com/kubeflow/kubeflow/v${KUBEFLOW_VERSION}/scripts/deploy.sh | bash
The following commands will set the namespace in the ksonnet default environment to kubeflow
and deploy Kubeflow on Amazon EKS.
$ cd kubeflow_ks_app/
$ ks env set default --namespace ${NAMESPACE}
$ ks apply default
Take note of the following:
- The
KUBEFLOW_DEPLOY
flag disables thedeploy.sh
script from automatically deploying Kubeflow before we configure our ksonnet environment. - Kubeflow by default will enable anonymous usage reporting. If you do not want to provide usage reporting, execute
ks param set kubeflow-core reportUsage false
before you runks apply default.
- Ksonnet uses GitHub to pull Kubeflow scripts, if you encounter GitHub API rate limiting, you can fix that by creating a GitHub API token. Refer to the Kubeflow troubleshooting guide for more details.
To check the status of Kubeflow’s deployment, list out the pods created in the kubeflow
namespace:
$ kubectl get pod -n ${NAMESPACE}
You should get output like this:
NAME READY STATUS RESTARTS AGE
ambassador-849fb9c8c5-dglsc 2/2 Running 0 1m
ambassador-849fb9c8c5-jh8vk 2/2 Running 0 1m
ambassador-849fb9c8c5-vxvkg 2/2 Running 0 1m
centraldashboard-7d7744cccb-97r4v 1/1 Running 0 1m
tf-hub-0 1/1 Running 0 1m
tf-job-dashboard-bfc9bc6bc-6zzns 1/1 Running 0 1m
tf-job-operator-v1alpha2-756cf9cb97-rdrjj 1/1 Running 0 1m
The roles of these pods in Kubeflow are as follows:
tf-hub-0
: JupyterHub web application that spawns and manages Jupyter notebooks.
tf-job-operator, tf-job-dashboard
: Runs and monitors TensorFlow jobs in Kubeflow.
ambassador
: Ambassador API Gateway that routes services for Kubeflow.
centraldashboard
: Kubeflow central dashboard UI.
A Data Scientist’s Workflow Using Kubeflow
Let’s walk through a simple tutorial provided by the Kubeflow’s example repository.
We will use the github_issue_summarization
example, which applies a sequence-to-sequence model to summarize text found in GitHub issues. Sequence-to-sequence (seq2seq) is a supervised learning model where an input is a sequence of tokens (in this example, a long string of words in a GitHub issue), and the output generated is another sequence of tokens (a predicted shorter string that is a summary of the GitHub issue). Other use cases of seq2seq include machine translation of languages and speech-to-text.
First, we will use a Jupyter notebook to download the GitHub issues dataset and train the seq2seq model. Our Jupyter notebook will run as a Kubernetes pod with GPU attached to speed up the training process. Once we have our trained model, we will serve it with a simple Python microservice using Seldon Core. Seldon Core allow us to deploy our machine learning models on Kubernetes and expose them via REST and gRPC automatically.
The detailed steps are depicted in the following diagram:
The steps we’ll be following are:
- Build a Docker image for a Jupyter notebook with GPU support, and push that image to the Amazon Elastic Container Registry (Amazon ECR).
- Launch the Jupyter notebook through JupyterHub.
- Perform machine learning and generated a trained model in the Jupyter notebook.
- Build a Docker image for model serving microservices using the Seldon Core Python wrapper and our trained model.
- Launch the prediction microservice using Seldon Core behind an Ambassador API gateway.
- Use curl CLI to generate a prediction of a summary for a given GitHub issue.
Build the Docker Image for a Jupyter Notebook
Execute the following commands to build a Docker image of the Jupyter notebook with GPU support. It will also include the necessary files for sequence-to-sequence training. The Docker image will be hosted on the Amazon Elastic Container Registry (Amazon ECR). We will use this image to perform our model training.
# Login to ECR, create an image repository
$ ACCOUNTID=`aws iam get-user|grep Arn|cut -f6 -d:`
$ `aws ecr get-login --no-include-email --region us-west-2`
$ aws ecr create-repository --repository-name tensorflow-notebook-gpu --region us-west-2
$ curl -o train.py https://raw.githubusercontent.com/kubeflow/examples/master/github_issue_summarization/notebooks/train.py
$ curl -o seq2seq_utils.py https://raw.githubusercontent.com/kubeflow/examples/master/github_issue_summarization/notebooks/seq2seq_utils.py
# Build, tag and push Jupyter notebook docker image to ECR
$ docker build -t $ACCOUNTID.dkr.ecr.us-west-2.amazonaws.com/tensorflow-notebook-gpu:0.1 . -f-<<EOF
FROM gcr.io/kubeflow-images-public/tensorflow-1.8.0-notebook-gpu
RUN pip install ktext annoy sklearn h5py nltk pydot
COPY train.py /workdir/train.py
COPY seq2seq_utils.py /workdir/seq2seq_utils.py
EOF
$ docker push $ACCOUNTID.dkr.ecr.us-west-2.amazonaws.com/tensorflow-notebook-gpu:0.1
Launch the Jupyter Notebook
Connect to the JupyterHub and spin up a new Jupyter notebook. JupyterHub can be accessed at http://localhost:8080
with a browser by port-forwarding the tf-hub-lb
service.
$ kubectl port-forward svc/tf-hub-lb -n ${NAMESPACE} 8080:80
Sign in with any username; there is no password needed.
Enter the following in the Spawner Options
:
Image: 123456789012
.dkr.ecr.us-west-2.amazonaws.com/tensorflow-notebook-gpu:0.1
(This is the docker image that was built in the previous step. Replace 123456789012
with your $ACCOUNTID value.)
Extra Resource Limits: {"nvidia.com/gpu": “1”}
(This setting will configure one GPU to the Jupyter notebook.)
A Jupyter notebook with a pod name jupyter-${username}
is spawned with a Persistent Volume and one GPU resource. Run the following command to confirm that the pod is running:
$ kubectl get pod -n ${NAMESPACE}
NAME READY STATUS RESTARTS AGE
ambassador-585dd7b87-4fz2l 2/2 Running 0 26m
ambassador-585dd7b87-dlh9j 2/2 Running 0 26m
ambassador-585dd7b87-v45t2 2/2 Running 0 26m
centraldashboard-7d7744cccb-q5mzl 1/1 Running 0 26m
jupyter-enghwa 1/1 Running 1 4m
tf-hub-0 1/1 Running 0 26m
tf-job-dashboard-bfc9bc6bc-xhpn2 1/1 Running 0 26m
tf-job-operator-v1alpha2-756cf9cb97-9pp4z 1/1 Running 0 26m
Perform Machine Learning to Train Our Model
Once the Jupyter notebook is ready, launch a Terminal inside the Jupyter notebook (Files → New → Terminal) and clone the kubeflow example repository:
git clone https://github.com/kubeflow/examples
The examples
folder will now show up in the Jupyter notebook. Launch the Training.ipynb
notebook in the examples/github_issue_summarization/notebooks
folder.
This notebook will download the GitHub issues dataset and perform sequence-to-sequence training. At the end of the training, a Keras model seq2seq_model_tutorial.h5
will be produced. The GPU will be used to speed up the training (training one million rows takes about 15 minutes instead of a few hours, as it would on a standard CPU).
Before we run the notebook, make the following two changes:
- Cell 3: Change the DATA_DR to
/home/jovyan/github-issues-data
- Cell 7: Change
training_data_size
from2000
to1000000
. This increased training data size will improve the prediction result. You can also use the full dataset (~4.8M rows), which will take about 1 hour to train.
Start the training in the Jupyter notebook with Cell -> Run All.
Once the training is completed, the model is saved in the Jupyter notebook’s pod. (Note: You can safely ignore the error in the BLEU Score evaluation). To serve this model as a microservice over a REST API, the following steps are needed:
- Create a model-serving microservice image called “
github-issue-summarization
” with the python codeIssueSummarization.py
using Seldon Core’s Python wrapper. - Copy the model files from the Jupyter notebook’s pod to this model-serving microservice image.
- Run this model-serving microservice image with Seldon Core.
Build the Seldon Core Microservice Image
To build the model-serving microservice image, we will clone the github_issue_summarization
from Kubeflow example repository. The steps are as follows:
- Clone the Kubeflow example repository for the necessary python files in the “
github_issue_summarization/notebooks
” directory to serve the model. - Execute the Seldon Core’s python wrapper script to prepare a Docker build directory for microservice image.
- Copy the trained model’s files from the Jupyter notebook’s pod to the
build
directory so that the Docker build can package these files into the microservice image. - Build the microservice image and push it to Amazon ECR.
The following commands accomplish these steps.
$ git clone https://github.com/kubeflow/examples serve/
$ cd serve/github_issue_summarization/notebooks
$ docker run -v $(pwd):/my_model seldonio/core-python-wrapper:0.7 /my_model IssueSummarization 0.1 gcr.io --base-image=python:3.6 --image-name=gcr-repository-name/issue-summarization
$ cd build/
# fix directory permission
$ sudo chown `id -u` .
$ PODNAME=`kubectl get pods --namespace=${NAMESPACE} --selector="app=jupyterhub" --output=template --template="{{with index .items 0}}{{.metadata.name}}{{end}}"`
$ kubectl --namespace=${NAMESPACE} cp ${PODNAME}:/home/jovyan/examples/github_issue_summarization/notebooks/seq2seq_model_tutorial.h5 .
$ kubectl --namespace=${NAMESPACE} cp ${PODNAME}:/home/jovyan/examples/github_issue_summarization/notebooks/body_pp.dpkl .
$ kubectl --namespace=${NAMESPACE} cp ${PODNAME}:/home/jovyan/examples/github_issue_summarization/notebooks/title_pp.dpkl .
# build and push microservice image to Amazon ECR
$ aws ecr create-repository --repository-name github-issue-summarization --region us-west-2
$ docker build --force-rm=true -t $ACCOUNTID.dkr.ecr.us-west-2.amazonaws.com/github-issue-summarization:0.1 .
$ docker push $ACCOUNTID.dkr.ecr.us-west-2.amazonaws.com/github-issue-summarization:0.1
Serve Our prediction as a Seldon Core Microservice
Install Seldon Core
A Seldon Core prototype is shipped with Kubeflow. Execute the following ksonnet commands inside the kubeflow_ks_app
directory to generate the Seldon Core component and deploy Seldon Core:
$ ks generate seldon seldon --name=seldon
$ ks apply default -c seldon
Verify that Seldon Core is running by running kubectl get pods -n${NAMESPACE}.
You should see a pod named seldon-cluster-manager-*.
Kubeflow includes a component to serve Seldon Core microservices. Using these ksonnet commands, the github-issue-summarization
microservice image created previously will be deployed as a Kubernetes deployment with two replicaSets.
$ ks generate seldon-serve-simple issue-summarization-model-serving \
--name=issue-summarization \
--image=$ACCOUNTID.dkr.ecr.us-west-2.amazonaws.com/github-issue-summarization:0.1 \
--replicas=2
$ ks apply default -c issue-summarization-model-serving
Testing the Prediction REST API
Seldon Core uses the ambassador
API gateway to route requests to the microservice. Run these commands to port-forward the ambassador
service to localhost:8081
and test the summary prediction REST API.
$ kubectl port-forward svc/ambassador -n ${NAMESPACE} 8081:80
Let’s generate a summary prediction of a sample GitHub issue using curl to POST to the REST API. As shown below, the summary of our long text of GitHub issue is being predicted by our model as example of how to use it
.
$ curl -X POST -H 'Content-Type: application/json' -d '{"data":{"ndarray":[["There is lots of interest in serving with GPUs but we do not have a good example showing how to do this. I think it would be nice to have one. A simple example might be inception with a simple front end that allows people to upload images for classification."]]}}' http://localhost:8081/seldon/issue-summarization/api/v0.1/predictions
{
"meta": {
"puid": "2f9qdrbkro67lh93audeve9p60",
"tags": {
},
"routing": {
}
},
"data": {
"names": ["t:0"],
"ndarray": [["example of how to use it"]]
}
}
Summary
In this post, we first deployed Kubeflow on Amazon EKS with GPU worker nodes. We then walked through a typical data scientist’s workflow of training a machine learning model using a Jupyter notebook and then serving it as a microservice on Kubernetes.
To clean up, run kubectl delete namespace ${NAMESPACE}
to delete all the resources created under the kubeflow namespace.
You can continue your exploration of Kubeflow on EKS in our open source Kubernetes and Machine Learning workshop.