Feb 2024: This blog has been updated for Karpenter version v0.33.1 and v1beta1 specification.
About Karpenter
Karpenter is an open-source node lifecycle management project built for Kubernetes. It observes the aggregate resource requests of unschedulable pods and makes decisions to launch new nodes and terminate them to reduce scheduling latencies and infrastructure costs sending commands to the underlying cloud provider. Karpenter launches the nodes with minimal compute resources to fit the unschedulable pods for efficient bin-packing and it works in tandem with the Kubernetes scheduler to bind the unschedulable pods to the new nodes that are provisioned.
Why Karpenter
Kubernetes users needed to dynamically adjust the compute capacity of their clusters to support applications using Amazon EC2 Auto Scaling groups and the Kubernetes Cluster Autoscaler before the launch of Karpenter. Some of the challenges with Cluster Autoscaler include significant deployment latency because many pods must wait for a node to scale up before they can be scheduled. Nodes can take multiple minutes to become available as Cluster Autoscaler does not bind pods to nodes and scheduling decisions are made by the kube-scheduler
which results in longer wait for the Nodes to become available and it can increase pod scheduling latency for critical workloads.
One of the main objectives of Karpenter is to simplify the management of capacity. If you are familiar with other Auto Scalers, you will notice Karpenter takes a different approach referred as group-less auto scaling. Traditionally we have used the concept of a node group as the element of control that defines the characteristics of the capacity provided (i.e: On-Demand, EC2 Spot, GPU Nodes, etc) and that controls the desired scale of the group in the cluster. In AWS, the implementation of a node group matches with Auto Scaling groups. Over time, clusters using this paradigm, that run different type of applications requiring different capacity types, end up with a complex configuration and operational model where node groups must be defined and provided in advance.
Configuring Nodepools
Karpenter’s job is to add nodes to handle unschedulable pods (pods with the status condition Unschedulable=True set by the kube-scheduler), schedule pods on those nodes, and remove the nodes when they are not needed. To configure Karpenter, you create nodepools that define how Karpenter manages unschedulable pods and expires nodes.
NodePool sets constraints on the nodes that can be created by Karpenter and the pods that can run on those nodes.
Additionally, it also allows the pods to request nodes based on instance types, architectures, OS or other attributes by adding specifications to Kubernetes pod deployments, so that the Pod scheduling constraints like Resource requests, Node selection, Node affinity, Topology spread fall within nodepool constraints for the Pods to get deployed on the Karpenter provisioned Nodes if not then the pods will not deploy.
In many scenarios a single nodepool can satisfy all the requirements and can use the Scheduling Constraints with nodepool and pods by that it helps in achieving the use case of different teams having different constraints for running their workloads (such as one team can use only nodes in specific AZ and other teams can use Arm64 hardware nodes) , for billing purposes, having different de-provisioning requirements, etc.
Use cases for Nodepool Constraints
With Karpenter layered constraints, you can be sure that the precise type and amount of resources needed are available to your pods.
However, for specific requirement of choosing an instance type or availability zones etc we can tighten the constraints defined in a nodepool by defining additional scheduling constraints in the pod spec.
Below are some of use cases for using nodepool scheduling constraints or use of specific requirements in the nodepool and binding the unschedulable pods to Nodes via Karpenter.
- Needing to run in specific instance type on zones where dependent applications or storage are available
- Requiring certain kinds of processors or other hardware
Upgrading nodes
A straight-forward way to upgrade nodes is to set spec.disruption.expireAfter. Nodes will be terminated after a set period of time and will be replaced with newer nodes. The recommended method to patch your Kubernetes worker nodes is using Drift, please refer the Blog on How to upgrade Amazon EKS worker nodes with Karpenter Drift . Also, you can read on Karpenter Disruption for more details.
Walkthrough
In this section, you will provision an EKS cluster, deploy Karpenter, deploy a sample application, and demonstrate Node scaling with Karpenter and process of deploying constraints with Pods in line to requirements of nodepool for different application workloads or different teams needing different instance capacity for their application.
Prerequisites
Prerequisites
Karpenter Deployment Tasks
1) Set the following environment variables:
export KARPENTER_NAMESPACE=kube-system
export KARPENTER_VERSION=v0.33.1
export K8S_VERSION=1.27
export AWS_PARTITION="aws" # if you are not using standard partitions, you may need to configure to aws-cn / aws-us-gov
export CLUSTER_NAME="karpenter-demo"
export AWS_DEFAULT_REGION="us-west-2"
export AWS_ACCOUNT_ID="$(aws sts get-caller-identity --query Account --output text)"
export TEMPOUT=$(mktemp)
echo $KARPENTER_NAMESPACE $KARPENTER_VERSION $K8S_VERSION $CLUSTER_NAME $AWS_DEFAULT_REGION $AWS_ACCOUNT_ID $TEMPOUT
2) Create an Amazon EKS Cluster and IAM Role for KarpenterController
- Create a cluster with eksctl. This example configuration file specifies a basic cluster with one initial node and sets up an IAM OIDC provider for the cluster to enable IAM roles for pods
curl -fsSL https://raw.githubusercontent.com/aws/karpenter-provider-aws/"${KARPENTER_VERSION}"/website/content/en/preview/getting-started/getting-started-with-karpenter/cloudformation.yaml > $TEMPOUT \
&& aws cloudformation deploy \
--stack-name "Karpenter-${CLUSTER_NAME}" \
--template-file "${TEMPOUT}" \
--capabilities CAPABILITY_NAMED_IAM \
--parameter-overrides "ClusterName=${CLUSTER_NAME}"
eksctl create cluster -f - <<EOF
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: ${CLUSTER_NAME}
region: ${AWS_DEFAULT_REGION}
version: "${K8S_VERSION}"
tags:
karpenter.sh/discovery: ${CLUSTER_NAME}
iam:
withOIDC: true
podIdentityAssociations:
- namespace: "${KARPENTER_NAMESPACE}"
serviceAccountName: karpenter
roleName: ${CLUSTER_NAME}-karpenter
permissionPolicyARNs:
- arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:policy/KarpenterControllerPolicy-${CLUSTER_NAME}
## Optionally run on fargate or on k8s 1.23
# Pod Identity is not available on fargate
# https://docs.aws.amazon.com/eks/latest/userguide/pod-identities.html
# iam:
# withOIDC: true
# serviceAccounts:
# - metadata:
# name: karpenter
# namespace: "${KARPENTER_NAMESPACE}"
# roleName: ${CLUSTER_NAME}-karpenter
# attachPolicyARNs:
# - arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:policy/KarpenterControllerPolicy-${CLUSTER_NAME}
# roleOnly: true
iamIdentityMappings:
- arn: "arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:role/KarpenterNodeRole-${CLUSTER_NAME}"
username: system:node:{{EC2PrivateDNSName}}
groups:
- system:bootstrappers
- system:nodes
## If you intend to run Windows workloads, the kube-proxy group should be specified.
# For more information, see https://github.com/aws/karpenter/issues/5099.
# - eks:kube-proxy-windows
managedNodeGroups:
- instanceType: m5.large
amiFamily: AmazonLinux2
name: ${CLUSTER_NAME}-ng
desiredCapacity: 2
minSize: 1
maxSize: 10
addons:
- name: eks-pod-identity-agent
## Optionally run on fargate
# fargateProfiles:
# - name: karpenter
# selectors:
# - namespace: "${KARPENTER_NAMESPACE}"
EOF
export CLUSTER_ENDPOINT="$(aws eks describe-cluster --name ${CLUSTER_NAME} --query "cluster.endpoint" --output text)"
export KARPENTER_IAM_ROLE_ARN="arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:role/${CLUSTER_NAME}-karpenter"
echo $CLUSTER_ENDPOINT $KARPENTER_IAM_ROLE_ARN
- Install Karpenter Helm Chart
# Logout of helm registry to perform an unauthenticated pull against the public ECR
helm registry logout public.ecr.aws
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter --version "${KARPENTER_VERSION}" --namespace "${KARPENTER_NAMESPACE}" --create-namespace \
--set "settings.clusterName=${CLUSTER_NAME}" \
--set "settings.interruptionQueue=${CLUSTER_NAME}" \
--set controller.resources.requests.cpu=1 \
--set controller.resources.requests.memory=1Gi \
--set controller.resources.limits.cpu=1 \
--set controller.resources.limits.memory=1Gi \
--wait
Deploy the nodepool and application pods with layered constraints
Deploy the below Karpenter nodepool spec that has the following requirements:
- Architecture type (arm64 & amd64)
- Capacity type (Spot & On-demand)
cat <<EOF | envsubst | kubectl apply -f -
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"]
- key: kubernetes.io/os
operator: In
values: ["linux"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
nodeClassRef:
name: default
limits:
cpu: 1000
disruption:
consolidationPolicy: WhenUnderutilized
expireAfter: 720h # 30 * 24h = 720h
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: default
spec:
amiFamily: AL2 # Amazon Linux 2
role: "KarpenterNodeRole-${CLUSTER_NAME}" # replace with your cluster name
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: "${CLUSTER_NAME}" # replace with your cluster name
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: "${CLUSTER_NAME}" # replace with your cluster name
EOF
Run the application deployment on a specific capacity, instance type, hardware and availability zone using Pod scheduling constraints.
- Below sample deployment defines the nodeSelector with
topology.kubernetes.io/zone
kubernetes.io/zone for choosing a specific Availability zone, on-demand arm64 instance with karpenter.sh/capacity-type
& kubernetes.io/arch: arm64
and specific instance type node.kubernetes.io/instance-type
so that new Nodes can be launched by Karpenter using the below Pod scheduling constraints.
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: inflate
spec:
replicas: 0
selector:
matchLabels:
app: inflate
template:
metadata:
labels:
app: inflate
spec:
nodeSelector:
node.kubernetes.io/instance-type: r6gd.xlarge
karpenter.sh/capacity-type: on-demand
topology.kubernetes.io/zone: us-west-2a
kubernetes.io/arch: arm64
terminationGracePeriodSeconds: 0
containers:
- name: inflate
image: public.ecr.aws/eks-distro/kubernetes/pause:3.5
resources:
requests:
cpu: 1
EOF
- Scale the above deployment to see the Node scaling via the Karpenter and it would choose the above configuration from the EC2 fleet via the createFleet API for the application pods.
kubectl scale deployment inflate --replicas 3
- Review the Karpenter pod logs for events and more details.
kubectl logs -f -n kube-system -l app.kubernetes.io/name=karpenter -c controller
- Example snippet of the logs.
eksadmin:~/environment $ kubectl logs -f -n karpenter -l app.kubernetes.io/name*=*karpenter -c controller
...
...
{"level":"INFO","time":"2024-01-25T07:20:01.654Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"a70b39e","pods":"default/inflate-79c97d78f9-84hsw, default/inflate-79c97d78f9-fjhz6, default/inflate-79c97d78f9-jt4hj","duration":"12.0951ms"}
{"level":"INFO","time":"2024-01-25T07:20:01.654Z","logger":"controller.provisioner","message":"computed new nodeclaim(s) to fit pod(s)","commit":"a70b39e","nodeclaims":1,"pods":3}
{"level":"INFO","time":"2024-01-25T07:20:01.668Z","logger":"controller.provisioner","message":"created nodeclaim","commit":"a70b39e","nodepool":"default","nodeclaim":"default-xqjhj","requests":{"cpu":"3150m","pods":"6"},"instance-types":"r6gd.xlarge"}
{"level":"INFO","time":"2024-01-25T07:20:04.328Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"a70b39e","nodeclaim":"default-xqjhj","provider-id":"aws:///us-west-2a/i-04a71190d4888e0e3","instance-type":"r6gd.xlarge","zone":"us-west-2a","capacity-type":"on-demand","allocatable":{"cpu":"3920m","ephemeral-storage":"17Gi","memory":"29258Mi","pods":"58","vpc.amazonaws.com/pod-eni":"18"}}
{"level":"INFO","time":"2024-01-25T07:20:39.532Z","logger":"controller.nodeclaim.lifecycle","message":"initialized nodeclaim","commit":"a70b39e","nodeclaim":"default-xqjhj","provider-id":"aws:///us-west-2a/i-04a71190d4888e0e3","node":"ip-192-168-187-48.us-west-2.compute.internal"}
Validate the application pods with below command and the same would be in Running
state
kubectl get node -L node.kubernetes.io/instance-type,kubernetes.io/arch,karpenter.sh/capacity-type
kubectl get pods -o wide
- Example snippet of the Node output and Pods output.
eksadmin:~/environment $ kubectl get node -L node.kubernetes.io/instance-type,kubernetes.io/arch,karpenter.sh/capacity-type
NAME STATUS ROLES AGE VERSION INSTANCE-TYPE ARCH CAPACITY-TYPE
ip-192-168-187-48.us-west-2.compute.internal Ready <none> 97s v1.27.9-eks-5e0fdde r6gd.xlarge arm64 on-demand
ip-192-168-30-153.us-west-2.compute.internal Ready <none> 102m v1.27.9-eks-5e0fdde m5.large amd64
ip-192-168-71-234.us-west-2.compute.internal Ready <none> 102m v1.27.9-eks-5e0fdde m5.large amd64
eksadmin:~/environment $
eksadmin:~/environment $ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
inflate-79c97d78f9-84hsw 1/1 Running 0 2m15s 192.168.168.238 ip-192-168-187-48.us-west-2.compute.internal <none> <none>
inflate-79c97d78f9-fjhz6 1/1 Running 0 2m15s 192.168.161.63 ip-192-168-187-48.us-west-2.compute.internal <none> <none>
inflate-79c97d78f9-jt4hj 1/1 Running 0 2m15s 192.168.185.20 ip-192-168-187-48.us-west-2.compute.internal <none> <none>
eksadmin:~/environment $
From the above demonstration we can see that Karpenter's ability to apply layered constraints that was used to launch nodes that satisfied Multiple scheduling constraints of a workload, like instance type, specific AZ and hardware architecture via Karpenter.
Group less Node upgrades
As mentioned in earlier section, when using the nodegroups (Self-managed or Managed) with EKS Cluster and as part of upgrade the Worker nodes to a newer version of Kubernetes, we would have to rely on either migrating to new nodegroup for Self-managed or launching a new Autoscaling group of Worker nodes for Managed nodegroup as mentioned in Managed nodegroup upgrade behaviour . Whereas, with Karpenter group less autoscaling the upgrade of nodes works with the Drift value.
Drift handles changes to the NodePool/EC2NodeClass. For Drift, values in the NodePool/EC2NodeClass are reflected in the NodeClaimTemplateSpec/EC2NodeClassSpec in the same way that they’re set. Karpenter uses Drift to upgrade Kubernetes nodes and upgrades the nodes rolling deployment. With Karpenter version v0.33.x Drfit feature gates is enabled by default and upgrade of nodes would be respect the Drift.
Note: Karpenter supports using custom AMI and you can specify amiSelectorTerms
with EC2NodeClass, this will fully override the default AMIs that are selected on by your EC2NodeClass amiFamily
- Validate the current EKS Cluster Kubernetes version with below command.
aws eks describe-cluster --name ${CLUSTER_NAME} | grep -i version
- Example snippet of the above command.
eksadmin:~/environment $ aws eks describe-cluster --name ${CLUSTER_NAME} | grep -i version
"version": "1.27",
"platformVersion": "eks.11",
"alpha.eksctl.io/eksctl-version": "0.167.0",
eksadmin:~/environment $
- Deploy PodDisruptionBudget for your Application deployment. PodDisruptionBudget (PDB) limits the number of Pods of a replicated application that are down simultaneously from voluntary disruptions.
cat <<EOF | kubectl apply -f -
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: inflate-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: inflate
EOF
Note : With PDB you can set minAvailable or maxUnavailable as integers or as a percentage. Please refer Kubernetes documentation about Poddisruptions, and how to configure them for more details.
Example snippet of above PDB and sample application deployment that was configured in earlier section
eksadmin:~/environment $ kubectl get pdb
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
inflate-pdb 2 N/A 1 9s
eksadmin:~/environment $
eksadmin:~/environment $ kubectl get deploy inflate
NAME READY UP-TO-DATE AVAILABLE AGE
inflate 3/3 3 3 12m
eksadmin:~/environment $
- Upgrade the EKS Cluster to newer Kubernetes version via console or eksctl as mentioned in EKS documentation
- We can see that cluster got upgraded successfully to
1.21.
eksadmin:~/environment $ aws eks describe-cluster --name ${CLUSTER_NAME} | grep -i version
"version": "1.28",
"platformVersion": "eks.7",
"alpha.eksctl.io/eksctl-version": "0.167.0",
eksadmin:~/environment $
- Validate the application pods with below commands and we can see that Karpenter Launched Nodes are upgraded to 28same as that of EKS Cluster Kubernetes version.
kubectl get node -L node.kubernetes.io/instance-type,kubernetes.io/arch,karpenter.sh/capacity-type
kubectl get pods -o wide
kubectl get nodeclaims.karpenter.sh
Checking our workload and Node drifted by Karpenter and we can see that new Nodes are of version 1.28
as the Karpenter used the latest version of the EKS optimized AMI based on the new EKS Cluster version i.e 1.28
. We can observe the Drift events from the Karpenter controller logs
eksadmin:~/environment $ kubectl get node -L node.kubernetes.io/instance-type,kubernetes.io/arch,karpenter.sh/capacity-type
NAME STATUS ROLES AGE VERSION INSTANCE-TYPE ARCH CAPACITY-TYPE
ip-192-168-30-153.us-west-2.compute.internal Ready <none> 159m v1.27.9-eks-5e0fdde m5.large amd64
ip-192-168-71-234.us-west-2.compute.internal Ready <none> 159m v1.27.9-eks-5e0fdde m5.large amd64
ip-192-168-79-137.us-west-2.compute.internal Ready <none> 44m v1.28.5-eks-5e0fdde r6gd.xlarge arm64 on-demand
eksadmin:~/environment $
eksadmin:~/environment $ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
inflate-79c97d78f9-6csvw 1/1 Running 0 124m 192.168.95.216 ip-192-168-79-137.us-west-2.compute.internal <none> <none>
inflate-79c97d78f9-9pjvc 1/1 Running 0 124m 192.168.89.224 ip-192-168-79-137.us-west-2.compute.internal <none> <none>
inflate-79c97d78f9-w9fg2 1/1 Running 0 124m 192.168.84.169 ip-192-168-79-137.us-west-2.compute.internal <none> <none>
eksadmin:~/environment $
eksadmin:~/environment $ kubectl get nodeclaims.karpenter.sh
NAME TYPE ZONE NODE READY AGE
default-lmxcb r6gd.xlarge us-west-2a ip-192-168-79-137.us-west-2.compute.internal True 47m
eksadmin:~/environment $
- Review the Karpenter controller pod logs for events and more details.
kubectl logs -f -n kube-system -l app.kubernetes.io/name=karpenter -c controller
- Example snippet of the logs.
{"level":"INFO","time":"2024-01-25T07:34:23.935Z","logger":"controller.disruption","message":"disrupting via drift replace, terminating 1 candidates ip-192-168-187-48.us-west-2.compute.internal/r6gd.xlarge/on-demand and replacing with on-demand node from types r6gd.xlarge","commit":"a70b39e"}
{"level":"INFO","time":"2024-01-25T07:34:24.003Z","logger":"controller.disruption","message":"created nodeclaim","commit":"a70b39e","nodepool":"default","nodeclaim":"default-lmxcb","requests":{"cpu":"3150m","pods":"6"},"instance-types":"r6gd.xlarge"}
{"level":"INFO","time":"2024-01-25T07:34:25.893Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"a70b39e","nodeclaim":"default-lmxcb","provider-id":"aws:///us-west-2a/i-07c00e23630c2bf58","instance-type":"r6gd.xlarge","zone":"us-west-2a","capacity-type":"on-demand","allocatable":{"cpu":"3920m","ephemeral-storage":"17Gi","memory":"29258Mi","pods":"58","vpc.amazonaws.com/pod-eni":"18"}}
{"level":"INFO","time":"2024-01-25T07:35:00.487Z","logger":"controller.nodeclaim.lifecycle","message":"initialized nodeclaim","commit":"a70b39e","nodeclaim":"default-lmxcb","provider-id":"aws:///us-west-2a/i-07c00e23630c2bf58","node":"ip-192-168-79-137.us-west-2.compute.internal"}
{"level":"INFO","time":"2024-01-25T07:35:10.361Z","logger":"controller.node.termination","message":"tainted node","commit":"a70b39e","node":"ip-192-168-187-48.us-west-2.compute.internal"}
{"level":"INFO","time":"2024-01-25T07:35:20.387Z","logger":"controller.node.termination","message":"deleted node","commit":"a70b39e","node":"ip-192-168-187-48.us-west-2.compute.internal"}
{"level":"INFO","time":"2024-01-25T07:35:20.744Z","logger":"controller.nodeclaim.termination","message":"deleted nodeclaim","commit":"a70b39e","nodeclaim":"default-xqjhj","node":"ip-192-168-187-48.us-west-2.compute.internal","provider-id":"aws:///us-west-2a/i-04a71190d4888e0e3"}
Note : In the above logs, we can see that Karpenter Drifted the Node
to the latest version of the EKS optimized AMI for 1.28
and launched a New node for the workload. Later old Node was Cordoned, Drained and Deleted by Karpenter.
From the above demonstration we can see that Karpenter respected the PDB and its ability to apply Node Disruption Drift workflow for Upgrading of Nodes launched by Karpenter for a group-less management of worker nodes for Upgrades.
In general, you can configure Karpenter to disrupt Nodes through your NodePool in multiple ways by using spec.disruption.consolidationPolicy, spec.disruption.consolidateAfter or spec.disruption.expireAfter . You can use node expiry to periodically recycle nodes due to security concerns and then Drift to upgrade the nodes. Please refer to Karpenter Disruption for more details.
Cleanup
Delete all the nodepools (CRDs) that was created.
kubectl delete nodepool default
Remove Karpenter and delete the infrastructure from your AWS account.
helm uninstall karpenter --namespace kube-system
eksctl delete cluster --name ${CLUSTER_NAME}
Conclusion
In this blog, we demonstrated how the nodes can be scaled with different options for each use case using nodepool by leveraging the well known Kubernetes labels and taints and using the Pod scheduling constraints within the deployment so that Pods get deployed on the Karpenter provisioned Nodes. This demonstrates that we can run different types workloads on different capacity or requirements for each of its use cases. Further, we also see the Upgrade nodes behavior for the Nodes launched by Karpenter by leveraging the option of Drift with the nodepool.