Containers

Automating custom networking to solve IPv4 exhaustion in Amazon EKS

Introduction

When Amazon VPC Container Network Interface (CNI) plugin assigns IPv4 addresses to Pods, it allocates them from the VPC CIDR range assigned to the cluster. While it makes Pods first-class citizens within the VPC network, it often leads to exhaustion of the limited number of IPv4 addresses available in the VPCs. The long term solution for addressing this issue is adoption of IPv6; however, many customers aren’t ready to make these types of decisions at the organization level.

Custom Pod networking using VPC CNI is one approach to alleviate IPv4 address exhaustion when deploying large-scale workloads to an Amazon EKS cluster.

VPC CNI is available today as an Amazon Elastic Kubernetes Service (Amazon EKS) core add-on that provides native networking for your Amazon EKS cluster. Custom networking features of a VPC CNI allow worker nodes to allocate secondary network interfaces outside of the primary subnet of the host node, which customers can use to mitigate IPv4 address exhaustion. Unfortunately, solving IP exhaustion via VPC CNI custom networking isn’t straightforward and requires effort from the customers, especially when applied at scale across a heterogeneous enterprise landscape. When applied to an existing cluster, it requires the reconfiguration of the VPC CNI add-on to support custom networking, followed by cordoning and draining the nodes to handle graceful termination of the Pods and nodes. Only the new nodes matching the ENIConfig label or annotations are configured to take advantage of the custom networking, which isn’t a great experience.

Custom networking on Amazon EKS

Amazon VPC CNI plugin is the default choice for Container Networking Interface (CNI) implementation on Amazon EKS. By default, VPC CNI assigns Pods an IP address selected from the primary subnet of the VPC. The primary subnet is the subnet CIDR that the primary Elastic Network Interface (ENI) is attached to (i.e., usually the subnet of the worker node or host in the Amazon EKS cluster). If the primary subnet CIDR is too small, then the CNI may not be able to allocate enough IP addresses to assign to the Pods running in the cluster.

Custom networking provides a solution to the IP exhaustion issue by assigning the Pod IPs from secondary VPC address spaces (e.g., CIDR). When custom networking is enabled in VPC CNI, it creates secondary ENIs in the subnet defined under a custom resource named ENIConfig that includes an alternate subnet CIDR range created from a secondary VPC CIDR. The VPC CNI assigns Pods IP addresses from the CIDR range defined in the ENIConfig custom resource.

Solution overview

In this post, we’ll show how to use Amazon CDK EKS Blueprints and the VPC CNI Amazon EKS Blueprints Pattern to provision and manage Amazon EKS clusters with VPC CNI add-on configured to use custom networking out-of-the-box.

An Amazon EKS Blueprints pattern is a codified reference architecture that provides a ready-to-use solution that customers can apply in their environments. With Amazon EKS Blueprints Pipelines, the blueprint implementation supplied in the pattern can be easily replicated at scale within the enterprise, across multiple environments, regions, and accounts.

Within the Amazon EKS cluster, custom networking is controlled by the ENIConfig custom resource. The ENIConfig includes alternate secondary subnets carved from the secondary VPC CIDR. When custom networking is enabled, the VPC CNI creates secondary ENIs (assigned to Pods) in the subnet defined under the ENIConfig. The worker nodes IP addresses are still allocated from the primary subnets (primary VPC CIDR). This setup is technically complex and involves multiple steps to accomplish, such as allocation of secondary CIDRs, creation of secondary subnets, creation of the ENIConfig custom resource, which makes it a great candidate for blueprinting.

The Amazon EKS Blueprints pattern provides a configurable approach to create secondary subnets, VPC CNI add-on, ENIConfig along with configurable parameters for VPC CNI advanced configurations such as prefix delegation, Warm IP Target, Warm ENI Target, and more, which are fully integrated with the Amazon EKS cluster provisioning and maintenance.

The diagram below shows how primary and secondary CIDR subnets are assigned to worker nodes and Pods in the EKS cluster after the custom networking feature is applied.

Distribution of VPC primary and secondary CIDRs in Amazon EKS

Figure 1: Distribution of VPC primary and secondary CIDRs in Amazon EKS

Walkthrough

Using the Custom Networking with IPv4 pattern, you should be able to set up an Amazon EKS cluster with VPC CNI installed and configured with custom networking enabled.

This pattern deploys the following resources:

  • Creates an Amazon EKS Cluster control plane and worker nodes with a managed node group
  • Deploys supporting add-ons: VpcCni, CoreDns, KubeProxy, AWSLoadBalancerController
  • Enables Custom Networking configuration with VpcCni add-on

Prerequisites

You need the following to complete the steps in this post:

  1. An AWS account and AWS Command Line Interface(AWS CLI) version 2 to interact with AWS services using CLI commands.
  2. kubectl to communicate with the Kubernetes API server
  3. AWS CDK version 2.83.0 or later
  4. NPM version 9.6.0 or later
  5. yq for command-line JSON processing
  6. make to generate shell executables

Steps to deploy the Amazon EKS Blueprints pattern for custom networking

Step 1. Check Versions

Make sure that the right versions of Node.js and npm are installed.

Node version should be the current stable version such as 18.x (at present). Please validate the versions based on the patterns README. Example:

$ node -v
v18.12.1
$ npm -v
8.19.2

Step 2. Clone the cdk-blueprints-patterns GitHub repository

git clone https://github.com/aws-samples/cdk-eks-blueprints-patterns.git

Step 3. Install project dependencies

Once you have cloned the above repository, you can open it using your favorite development environment ( IDE) and run the below command to install the dependencies and build the existing patterns:

Please note, that the make commands listed below are optimized for Mac OSX and use brew. They may produce warnings or errors on Ubuntu and other Linux distributions, which don’t affect the execution flow needed for this post.

make deps

Step 4. View available patterns

To view patterns available to be deployed, execute the following:

export NODE_OPTIONS=--max-old-space-size=4096
# Add this above export for NODE_OPTIONS if you are using cloud9
make build

To list the existing CDK EKS Blueprints patterns, run:

make list

Step 5. Bootstrap your CDK environment

Run the following command to bootstrap your CDK environment.

npx cdk bootstrap

You can now proceed with the deployment of the custom-networking-ipv4 pattern.

Step 6. Deploy the custom-networking-ipv4 pattern

To deploy the custom-networking-ipv4 pattern, run

make pattern custom-networking-ipv4 deploy

Once the deployment is successful, run the update-kubeconfig command to update the kubeconfig file with required access. You should be able to get the command from the CDK output message.

aws eks update-kubeconfig --name custom-networking-ipv4-blueprint --region $AWS_REGION --role-arn arn:aws:iam::$AWS_ACCOUNT_ID:role/<YOUR_ROLE_NAME_SIMILAR_TO_custom-networking-ipv4-bl-customnetworkingipv4blue-2SR7PW3UBLIH>

You can verify the resources created by executing:

kubectl get node -o wide

Output:

NAME                                        STATUS   ROLES    AGE   VERSION                INTERNAL-IP   EXTERNAL-IP     OS-IMAGE         KERNEL-VERSION                  CONTAINER-RUNTIME
ip-10-0-18-208.us-east-2.compute.internal   Ready    <none>   70m   v1.24.11-eks-a59e1f0   10.0.18.208   18.116.23.237   Amazon Linux 2   5.10.173-154.642.amzn2.x86_64   containerd://1.6.19
ip-10-0-61-228.us-east-2.compute.internal   Ready    <none>   70m   v1.24.11-eks-a59e1f0   10.0

Under the hood

This pattern creates a VPC with a secondary CIDR and secondary subnets within the specified CIDR range, as shown in the following resourceProvider part of the blueprint. You can learn more about using resource providers in the blueprints on the official documentation.

resourceProvider(GlobalResources.Vpc, new VpcProvider(undefined, { 
  primaryCidr: "10.2.0.0/16", // primary VPC CIDR
  secondaryCidr: "100.64.0.0/16", // secondary CIDR
  secondarySubnetCidrs: ["100.64.0.0/24","100.64.1.0/24","100.64.2.0/24"] // Subnets in the secondary CIDR
}))

When the secondary CIDRs are passed to the VPC resource provider, the secondary subnets are created and registered as resources under names secondary-cidr-subnet-${order}.

The VPC CNI add-on sets up custom networking based on the parameters awsVpcK8sCniCustomNetworkCfg, eniConfigLabelDef: “topology.kubernetes.io/zone” for your Amazon EKS cluster workloads with secondary subnet ranges.

We enable CNI plugin with custom pod networking with the following VPC CNI add-on parameters:

  • AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG = true
  • ENI_CONFIG_LABEL_DEF = topology.kubernetes.io/zone

This deploys an ENIConfig custom resource for Pod subnets (one per availability zone [AZ]). Here is the full view of the blueprint with VPC creation and using secondary subnets in the VpcCniAddOn:

import * as cdk from 'aws-cdk-lib';
import * as blueprints from '@aws-quickstart/eks-blueprints';

const app = new cdk.App();

const addOn = new blueprints.addons.VpcCniAddOn({
  customNetworkingConfig: {
      subnets: [
          blueprints.getNamedResource("secondary-cidr-subnet-0"),
          blueprints.getNamedResource("secondary-cidr-subnet-1"),
          blueprints.getNamedResource("secondary-cidr-subnet-2"),
      ]   
  },
  awsVpcK8sCniCustomNetworkCfg: true,
  eniConfigLabelDef: 'topology.kubernetes.io/zone'
});

const blueprint = blueprints.EksBlueprint.builder()
  .addOns(addOn)
  .resourceProvider(blueprints.GlobalResources.Vpc, new blueprints.VpcProvider(undefined, {
                primaryCidr: "10.2.0.0/16", 
                secondaryCidr: "100.64.0.0/16",
                secondarySubnetCidrs: ["100.64.0.0/24","100.64.1.0/24","100.64.2.0/24"]
            }))
  .build(app, 'my-stack-name');

In the diagram shown in the Solution overview section, a secondary CIDR 100.64.0.0/16 controls the range assigned to each secondary subnet that gets created in a distinct AZ within a Region. When an Amazon EKS cluster is deployed in the VPC, the worker nodes in the cluster still gets an IP address from the Primary CIDRs 10.0.0.0/16 range, whereas the Pods get an IP address from the secondary CIDR range. When AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG=true, the CNI assigns the Pod IP address from a subnet defined in ENIConfig. The ENIConfig custom resource is used to define the subnet in which Pods will be scheduled.

This can be verified by issuing the following command:

kubectl get eniconfig

Output:

NAME         AGE
us-east-2a   47m
us-east-2b   47m
us-east-2c   47m

An ENIConfig custom resource is created in each AZ. Number of secondary ENIs associated with the worker node varies by instance type.

ENI Association with worker nodes.

Figure 2: ENI Association with worker nodes.

Additional configuration options

VPC CNI add-on provides some knobs to add additional advanced configuration on top of custom networking.

Prefix delegation

When using custom networking mode, the node’s primary ENI is no longer used to assign Pod IP addresses, which means that there’s a decrease in the number of Pods that can run on a given Amazon Elastic Compute Cloud (Amazon EC2) instance type. To work around this limitation, you can use prefix delegation with custom networking. This is an important capability because when you use custom networking, only Pods that are configured to use hostNetwork are bound to the host’s primary ENI. All other Pods are bound to secondary ENIs. However, with prefix delegation enabled, each secondary IP is replaced with a /28 prefix, which negates the IP address loss when you use custom networking.

By default, Prefix Delegation is turned off in VPC CNI. To check this, run the following command:

kubectl get ds aws-node -o yaml -n kube-system | yq '.spec.template.spec.containers[].env'

Output:

[...]

- name: ENABLE_PREFIX_DELEGATION

  value: "false"

[...]

Consider the maximum number of Pods for an m5.large instance with custom networking.

When using custom networking, the maximum number of Pods you can run without prefix delegation enabled is 20.

Download and run max-pods-calculator.sh script to calculate the maximum number of pods:

curl -o max-pods-calculator.sh https://raw.githubusercontent.com/awslabs/amazon-eks-ami/master/files/max-pods-calculator.sh
chmod +x max-pods-calculator.sh
./max-pods-calculator.sh \
    --instance-type m5.large \
    --cni-version 1.12.5-eksbuild.2 \
    --cni-custom-networking-enabled

Output:

20

To turn on Prefix Delegation, use the following command:

kubectl set env daemonset aws-node -n kube-system ENABLE_PREFIX_DELEGATION=true

Run the script again this time with prefix-delegation-enabled:

./max-pods-calculator.sh \
    --instance-type m5.large \
    --cni-version 1.12.5-eksbuild.2 \
    --cni-custom-networking-enabled \
    --cni-prefix-delegation-enabled

Output:

110

Max number of Pods with Prefix Delegation on and off

Figure 3: Max number of Pods with Prefix Delegation on and off.

The reason why we had the maximum number of pods (max-pods) as 110, as opposed to 20. This is because the instance has a relatively low number of vCPUs. In addition, the Kubernetes community recommends set the maximum number of pods (max-pods) no greater than 10 times the number of cores, up to 110. Since VPC CNI runs as a daemonset, you need to create new nodes for this to take effect.

The number of ENIs and IP addresses in a pool are configured through configuration options WARM_ENI_TARGET. For more details on these options, please refer to EKS Best Practices Networking Guide. When using the CDK Blueprint patterns, you can also enable VPC CNI attributes such as WARM_ENI_TARGET and PREFIX_DELEGATION by setting parameters warmEniTarget and enablePrefixDelegation when adding the addon to an Amazon EKS cluster.

Cleaning up

You will need to delete the provisioned resources to avoid unintended costs. To clean up the provisioned blueprint resources, run the following command:

make pattern custom-networking-ipv4 destroy

Conclusion

In this post, we showed you how to use the custom networking pattern that defines an Amazon EKS blueprint with core VPC CNI add-on to solve IP exhaustion with out-of-the-box at cluster provisioning time. This approach allows the workloads and operational software to allocate IPs on secondary interfaces in secondary subnets. We recommend that you try this solution to create and manage Amazon EKS clusters. Solving IP exhaustion is just one of the benefits of using Amazon EKS Blueprints and pattern, among many other features and patterns that’re available in our open source repository.

This and many other patterns that we have built are intended to address common customer and partner use cases, therefore we would love to get your feedback on how we can improve our patterns portfolio. You can use GitHub issues to share your feedback.

While there are many ways to approach the problem of IP exhaustion, the advantage of using Amazon EKS Blueprints becomes more apparent when applied at scale, providing platform engineers with an Infrastructure as Code (IaC) solution with integrated pipeline support to roll out enterprise-wide changes to the Amazon EKS environments through the blueprints.

For more information, see the following references: