Containers

Introducing AWS Step Functions integration with Amazon EKS

This is my first post on AWS Container Blog since I joined AWS and I could not be more excited to talk about two technologies now working together: Serverless and Kubernetes, or more specifically AWS Step Functions and Amazon Elastic Kubernetes Service.

In my previous role, I envisioned to build a web application that would offer on-demand product demos to customers. Since my team was small and we had other responsibilities and commitments, I made the decision to go serverless first. It allowed us to focus on designing our business logic with minimal operational overhead, rather than deploying an infrastructure. AWS Step Functions appeared to be really handy to build that business logic into workflows. The workflows would call various AWS Lambda functions, pull and push data from DynamoDB, and leverage SNS, S3, as well as other serverless services. At one point, we identified some complex tasks in our workflows that would take more computing capabilities than what Lambda could offer. We decided to run those tasks in containers. Step Functions was able to run containers with Amazon ECS/AWS Fargate and this was the perfect match to our needs. Step Functions would start a container passing on parameters from previous steps, wait for the code to be executed and the result to be returned by the container then the container will be terminated and the workflow would continue.

This use case is not uncommon. Customers are used to leverage containers in cases where:

  • long processing tasks might run for more than 15 minutes
  • there is a need for GPU for machine learning or video rendering
  • the code is designed to be executed in a Microsoft Windows environment
  • support for ARM is required

In addition to Amazon ECS, customers now have the ability to execute jobs in containers orchestrated by Amazon Elastic Kubernetes Service (EKS). Amazon EKS is a managed Kubernetes service that makes it easy for you to run Kubernetes on AWS without needing to install, operate, and maintain your own Kubernetes control plane.

For more details on how to choose between Amazon ECS Amazon EKS to run containers, read this blog post.

AWS Step Functions and Amazon EKS are now working better together with the following capabilities:

  • you can create workflows that interact with Amazon EKS thanks to the Service Integration APIs
  • you can create AWS Step Functions state machine from Kubernetes with the AWS Controllers for Kubernetes (ACK)

Let’s start by having a closer look at the API integration.

Service integration APIs

AWS Step Functions provides two types of service APIs for integrating with Amazon EKS. One lets you use the Amazon EKS API to create and delete an Amazon EKS cluster. The other lets you interact with a cluster (the one you created via the Amazon EKS API or an existing one) using the Kubernetes API and run jobs as part of your application’s workflow.

The sample project on the AWS Management Console shows an example of a workflow that creates an Amazon EKS cluster with a node group, then runs a job, and deletes the resources when completed.

EKS Cluster lifecycle workflow

With this Amazon EKS APIs integration, we can manage, from Step Functions, the creation and deletion of an Amazon EKS cluster and the execution of jobs on it. You can now build workflows to build ephemeral Amazon EKS clusters to do end-to-end testing or Kubernetes conformance tests for example. More details in the “Manage an Amazon EKS cluster” documentation.

Note: AWS Step Functions Standard Workflows executions are billed according to the number of state transitions processed and not for the wait time of each state (see the pricing page). This means that you would not be charged based on the time it takes for the Amazon EKS cluster to be created and ready. Amazon EKS is billed on a per hour basis once the cluster is ready as described on the Amazon EKS pricing page.

Let’s consider now another scenario. Your company has standardized on Kubernetes and leverages Amazon EKS to run containers in the cloud. You are already operating Amazon EKS clusters and want to create workflows that interact with this environment. You can now leverage the Step Functions integration with the Kubernetes API: RunJob and Call.
The eks:runJob service integration allows you to run a job on your Amazon EKS cluster. The eks:runJob.sync variant allows you to wait for the job to complete, and optionally retrieve logs. The eks:call service integration allows you to use the Kubernetes API to read and write Kubernetes resource objects via a Kubernetes API endpoint. More details in the Step Functions documentation.

Workflow example

In this example, we are going to run a job on an existing Amazon EKS cluster using a Step Function Standard Workflow.
Our cluster is named ‘EKSCluster’ and has the following details:

EKSCluster details

Note the API server endpoint, the certificate authority, as well as the Cluster ARN. We will need them later on.

We are now going to create a simple Step Functions state machine that takes a few parameters in, launches a job on EKSCluster, then does something with the result of the job and finally will delete the job. In this particular case, we will just pass once the job is executed.

{
  "StartAt": "Run a job on EKS",
  "States": {
    "Run a job on EKS": {
      "Type": "Task",
      "Resource": "arn:aws:states:::eks:runJob.sync",
      "Parameters": {
        "ClusterName.$": "$.cluster.name",
        "CertificateAuthority.$": "$.cluster.ca",
        "Endpoint.$": "$.cluster.endpoint",
    "LogOptions": {
          "RetrieveLogs": true
        },
        "Job": {
          "apiVersion": "batch/v1",
          "kind": "Job",
          "metadata": {
            "name": "my-eks-job"
          },
          "spec": {
            "backoffLimit": 0,
            "template": {
              "metadata": {
                "name": "pi2000-on-eks"
              },
              "spec": {
              "containers": [
                {
                  "name": "pi-2000",
                  "image": "perl",
                  "command": [
                    "perl"
                  ],
                  "args": [
                    "-Mbignum=bpi",
                    "-wle",
                    "print '{ ' . '\"pi\": '. bpi(2000) . ' }';"
                  ]
                }
                ],
              "restartPolicy": "Never"
              }
            }
          }
        }
      },
      "ResultSelector": {
        "status.$": "$.status",
        "logs.$": "$.logs..pi"
      },
      "ResultPath": "$.RunJobResult",
      "Next" : "Do Something"
    },
    "Do Something":{
      "Type" : "Pass",
      "Next": "Delete Node"
    },
    "Delete Node":{
      "Type": "Task",
      "Resource": "arn:aws:states:::eks:call",
      "Parameters": {
        "ClusterName.$": "$.cluster.name",
        "CertificateAuthority.$": "$.cluster.ca",
        "Endpoint.$": "$.cluster.endpoint",
        "Method": "DELETE",
        "Path": "/apis/batch/v1/namespaces/default/jobs/my-eks-job"
      },
      "End": true
    }
  }
}

If you were to execute the Step Functions state machine at this stage, you would probably get an 401 error as the role is not authorized to connect to the Kubernetes API. Indeed, Amazon EKS implements the Kubernetes Role-Base Access Control (RBAC) and as such, we need to map the role your Step Functions state machine assumes (in our example EKS_StepFunctions_Integration) to a user and a role in the Kubernetes RBAC. More details in the Amazon EKS documentation.

The authentication flow is as follows:

Note: make sure you have the AWS CLI installed to execute the below commands. More details in the AWS CLI documentation.

We need to modify the Kubernetes ConfigMap so the first step is to copy it from the cluster with the following command:

kubectl get configmap -n kube-system aws-auth -o yaml > aws-auth.yaml

Use your favorite text editor to update the newly created the aws-auth.yaml file. You will see that there is an existing mapping with the nodegroup of your cluster, we won’t modify it. We need to map a user (here eks-stepfunctions) to the IAM role previously created.

kind: Role
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  namespace: default
  name: run_job
rules:
- apiGroups: ["batch"]
  resources: ["jobs"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

We have the user, we have the role, and now we’re bind them together with a RoleBinding resource. Let’s create another file job_role_binding.yaml

kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: bind-run-job
subjects:
- kind: User
  name: eks-stepfunctions
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: run_job
  apiGroup: rbac.authorization.k8s.io

Next, we apply the Role and RoleBindings we created:

kubectl apply -f job_role.yaml
kubectl apply -f job_role_binding.yaml

Our Step Functions state machine is now ready to execute jobs on the Amazon EKS cluster.
It will require the cluster details (cluster name, certificate authority, and API server endpoint) we noted at the beginning as inputs for the state machine:

{
    "cluster" : {
      "name" : "EKSCluster",
      "ca" : "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUN5RENDQWJDZ0F3SUJBZ0lCQURBTkJna3Foa2lHOXcwQkFRc0ZBREFWTVJNd0VRWURWUVFERXdwcmRXSmwKY201bGRHVnpNQjRYRFRJd01URXlOREUzTkRjd00xb1hEVE13TVRFeU1qRTNORGN3TTFvd0ZURVRNQkVHQTFVRQpBeE1LYTNWaVpYSnVaWFJsY3pDQ0FTSXdEUVlKS29aSWh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUJBTUJYClpGcDNJTk9KdVVWQy8wTzdxaTkxQ3V5Qk04N0FkYUxHMDhBOEhZa0FDK2hIUk8zM00vVk9mTjZoSHdGL0EvZm8KSjZyWERyT041K29OVEZOdzNSbEFQODVDODdGdFkzZDdsN0VobHZtRjNEQVUybExDUWRjbGtQR0ZqdVB6ZFdLWQpPYUdoSm9EcktJalNBTlBvdjR6UDJGYWFYUWNiV2ZRSktYRkJ0YndBS1R0WXA4WW00YXFrZWx2WS9CNTVRMklQCjhxYWNMS2UxeGNuNVRUVHA5bTdrRkFxSWdqeS80ZUdQM3VvdVFwZkc0a0t4cXNUelllN3lxV0MrY3lxWXF5ZDQKWTVnK3RIanRlL2EzU1FINUFjd0tJUGJXQUlxNUVFMXVmRDBqZGlGYkQxbzJOUko2ZFR0bzlWQnllbjdGdWgzcQozbVV4Z2dCYzVjK2hpUlJONDRrQ0F3RUFBYU1qTUNFd0RnWURWUjBQQVFIL0JBUURBZ0trTUE4R0ExVWRFd0VCCi93UUZNQU1CQWY4d0RRWUpLb1pJaHZjTkFRRUxCUUFEZ2dFQkFHN1FjcGRuaDVtK0Q5SjhwKzBNNElsQ2xmMmYKYlhZVytRaFdpd2lqQ0JuclR4Uy92Z1NaanhqTDBIbkcwR2dWVkJwcldVb1dvdnlpejRocWtJVUtTVkNXY0l3YwpsM1ExM2VMY25ra0E3QVdUcUw1dWF4S3p1c0xPbDRwdG5yZ2F0ZHRVRnc3WE9OZFp6cTFBeWRpbElmckNUbDZBCitkeTlIRGI3WnNUUm0rU201b0pLOWZyU3luc1gwNWVJNk9Od21ZWXdXMXFHbmR0eEN3VkQxZUJQajBWOEFGTVAKMHRibWpmU0crbENOd0J3QmhVbGhsNDhpWW9tR2dTOWR4ZGhaZy9sNnNmMzI5U0srUkRJM01TOEhQNVl5V1kvcQpKYTlST2RMTG5oYjV2S3hUTlZOakVUelp2R05rdThleWxpYUVMTTNUU2VKVlRFc2phUFNLZnFoNERUND0KLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=",
      "endpoint" : "https://171CE4AB2E76E2F1ABEDD0E37DD9A5D6.sk1.us-west-1.eks.amazonaws.com"
    }
}

Before the job gets deleted, we can quickly double check via the CLI that the it was successfully executed on our cluster.

% kubectl describe job/my-eks-job 
Name:           my-eks-job
Namespace:      default
Selector:       controller-uid=8e2fa4c0-7cb3-488f-8626-d9c737399b12
Labels:         controller-uid=8e2fa4c0-7cb3-488f-8626-d9c737399b12
                job-name=my-eks-job
Annotations:    <none>
Parallelism:    1
Completions:    1
Start Time:     Wed, 25 Nov 2020 11:28:49 -0800
Completed At:   Wed, 25 Nov 2020 11:28:56 -0800
Duration:       7s
Pods Statuses:  0 Running / 1 Succeeded / 0 Failed
Pod Template:
  Labels:  controller-uid=8e2fa4c0-7cb3-488f-8626-d9c737399b12
           job-name=my-eks-job
  Containers:
   pi-2000:
    Image:      perl
    Port:       <none>
    Host Port:  <none>
    Command:
      perl
    Args:
      -Mbignum=bpi
      -wle
      print '{ ' . '"pi": '. bpi(2000) . ' }';
    Environment:  <none>
    Mounts:       <none>
  Volumes:        <none>
Events:
  Type    Reason            Age   From            Message
  ----    ------            ----  ----            -------
  Normal  SuccessfulCreate  48s   job-controller  Created pod: my-eks-job-67lnb
  Normal  Completed         41s   job-controller  Job completed

We eventually can see that the workflow was successfully executed.

Run Job on Amazon EKS with AWS Step Functions

AWS Controllers for Kubernetes (ACK) Step Functions Controller

There are a variety of ways to create a Step Functions state machine, such as using the AWS Console, AWS SDK, AWS CloudFormation, or AWS Cloud Development Kit (CDK). If you prefer to stay within Kubernetes-native tooling, you can now use the Step Functions controller from the AWS Controllers for Kubernetes (ACK) project to create and manage Step Functions state machines and activities directly from Kubernetes. It’s in Developer Preview on GitHub, so you can give it a test drive for now until the installation process for ACK controllers gets easier.

It works like you’d expect for a Kubernetes resource. We first create a statemachine.yaml file containing a StateMachine custom resource:

apiVersion: sfn.services.k8s.aws/v1alpha1
kind: StateMachine
metadata:
  name: my-ack-machine
spec:
  name: MyAckMachine
  roleARN: "arn:aws:iam::123456789012:role/service-role/MySFNRole"
  definition: "{ \"StartAt\": \"HelloWorld\", \"States\": { \"HelloWorld\": { \"Type\": \"Pass\", \"Result\": \"Hello World!\", \"End\": true }}}"

Then we give that definition to Kubernetes using kubectl, and moments later my state machine appears in Step Functions:

kubectl apply -f ~/statemachine.yaml

Conclusion

In this article, we reviewed the new AWS Step Functions integration with Amazon EKS, which opens up a new world of possibilities. It enables us integrating AWS Step Function with existing clusters to run jobs as well as execute other operations on the clusters. This feature also allows us to create serverless workflows that instantiate short lived Amazon EKS clusters to execute jobs. Finally, the new ACK Step Functions Controller offers developers a way to create workflows with AWS Step Functions directly from Kubernetes.

Have fun building awesome workflows!