Containers
Amazon EKS enhances Kubernetes control plane observability
Introduction
Amazon Elastic Kubernetes Service (Amazon EKS) is a managed Kubernetes service to run Kubernetes in the AWS cloud and on-premises data centers. In the cloud, Amazon EKS automatically manages the availability and scalability of the Kubernetes control plane nodes responsible for scheduling containers, managing application availability, storing cluster data, and other key tasks. However, maintaining the health and performance of the workloads running in the cluster is a collaborative effort between Amazon EKS and users. Certain workload behaviors or configurations can add load on the control plane, leading to performance degradation. In these scenarios, access to key control plane metrics and dashboards enables cluster administrators to quickly detect and troubleshoot issues with workloads running on the clusters. For example, inadequately resourced worker nodes could hinder the scheduler’s ability to schedule new pods. To promptly detect these emerging scheduling issues, a cluster administrator needs access to scheduler metrics to view pending pods and receive timely notifications. Furthermore, the sheer volume of control plane metrics presents a challenge, even for experienced cluster administrators, in selecting the optimal metrics for monitoring and creating dashboards that effectively track the performance of the control plane.
Today we are announcing enhancements to Amazon EKS that give cluster administrators visibility into performance of the Kubernetes cluster control plane so that they can quickly detect, troubleshoot, and remediate issues. Amazon EKS clusters now automatically display a curated set of dashboards visualizing key control plane metrics within the Amazon EKS console, thus you can directly observe the health and performance of Kubernetes control plane. Furthermore, a broader set of control plane metrics are made available to users in addition to the existing Kubernetes API server metrics. These metrics are available in a Prometheus endpoint and also accessible in Amazon CloudWatch, providing users with the flexibility to use their preferred monitoring solution, be it Prometheus (Amazon Managed Service for Prometheus or Self-managed Prometheus), CloudWatch, or a third-party monitoring tool. Cluster administrators can use the newly introduced metrics to configure alarms. These alert them to delays in workload scheduling and job completion, enabling proactive monitoring and timely resolution of issues.
New and existing EKS clusters with Kubernetes version >= 1.28 and on a supported platform version will automatically receive the new set of dashboards and the control plane metrics in the CloudWatch metrics under AWS/EKS namespace. These metrics are available at no extra charge to the users. You can find the complete list of available metrics in the Amazon EKS documentation.
Walkthrough
In this section, we create a new EKS cluster and walkthrough the new monitoring dashboards and Kubernetes control plane metrics. We can get started with the prerequisites.
Prerequisites
The following prerequisites are necessary to complete this solution:
- An AWS Account
- AWS Command Line Interface (AWS CLI) configured on your device or AWS CloudShell
- eksctl, a CLI tool for creating and managing EKS clusters
- kubectl, a CLI tool to interact with the Kubernetes API server
Setup
Start by creating a new EKS cluster with Kubernetes version 1.31
When the cluster is successfully created, log in to the AWS console and go to the EKS clusters page. Select your EKS cluster from the list to go to the cluster detail page. Choose View dashboard on the top right corner to navigate to the new observability dashboard page, as shown in the following figure.
On the dashboard page, you should observe the relocated Cluster health issues and Cluster insights sections under their respective tabs. Choose Control plane monitoring tab or the View control plane monitoring button to navigate to the new cluster performance metrics dashboard, as shown in the following figure.
On the new cluster performance metrics dashboard, you should observe a set of curated dashboards displaying key Kubernetes control plane metrics, such as API server request types (total, HTTP 4XX, HTTP 5XX, and HTTP 429), etcd database size, and kube-scheduler scheduling attempts, as shown in the following figure.
Scroll down to view the CloudWatch Log Insights section, which consists of a curated set of queries for analyzing Amazon EKS control plane logs. When control plane logs are enabled on the cluster, these queries offer valuable insights into top talkers, slow requests, throttled clients, and broken webhooks, as shown in the following figures.
The last section of the Control plane dashboard allows you to manage the control plane logging and displays the links to view those logs in the CloudWatch console, as shown in the following figure.
This wraps up the changes to the Amazon EKS console, now we can explore the new Kubernetes control plane metrics available in CloudWatch.
CloudWatch metrics
After the cluster is created successfully, the core Kubernetes control plane metrics are also ingested into CloudWatch metrics under the AWS/EKS
namespace. Go to the CloudWatch console and choose All metrics from the left navigation menu to go to the Metrics selection page. Choose AWS/EKS
namespace and a metrics dimension (By Cluster Name) to view the Kubernetes control plane metrics, as shown in the following figures.
You can use these newly vended metrics to create monitoring dashboards or create CloudWatch alarms to get notified on control plane performance issues such as scheduler latencies, API server throttling errors, and so on. CloudWatch Container Insights will be enhanced in the future to use these new metrics and provide key insights into the health of the control plane.
Prometheus Metrics
Amazon EKS also exposes a new Prometheus compatible endpoint that can be scraped by your observability agent to fetch Kubernetes control plane metrics. This consists of metrics from control plane components such as kube-scheduler and controller-manager, and they are exposed under new metrics API group metrics.eks.amazonaws.com. You can retrieve those metrics using the following commands:
In this walkthrough, we are using Amazon Managed Service for Prometheus and AWS managed collector to demonstrate the Amazon EKS control plane Prometheus metrics scraping functionality. The same can be adapted to other Prometheus-based solutions such as self-managed Prometheus. Alternatively, you can use AWS Observability Accelerator (available in Terraform and CDK formats), which comes with opinionated modules, curated metrics, logs, traces collection, alerting rules, and Grafana dashboards for your AWS infrastructure and custom applications.
Start by creating an Amazon Managed Service for Prometheus workspace.
Create the Prometheus agent scraper configuration and include the new Amazon EKS control plane metrics endpoints as scraping targets, as shown in the following:
An Amazon Managed Service for Prometheus collector consists of a scraper that discovers and collects metrics from an EKS cluster. Amazon Managed Service for Prometheus manages the scraper for you, giving you the scalability, security, and reliability that you need, without having to manage instances, agents, or scrapers yourself. To set up a scraper, we need to provide VPC Subnets and EC2 Security Group, so we should retrieve these values from our EKS cluster setup.
Create the scraper using the following command:
The scraper takes a few minutes to become active. You can verify the status in the Amazon EKS console under the Observability tab, as shown in the following figure:
As the metrics are ingested into the Amazon Managed Service for Prometheus workspace, we can use visualization tools such as Amazon Managed Grafana to instantly query, correlate, and visualize metrics. Follow the Amazon Managed Grafana getting started guide on how to set up a workspace and use Prometheus data source to connect to Amazon Managed Service for Prometheus workspace.
Create an example Kubernetes deployment to simulate scheduling delays:
The cluster has only two worker nodes, thus some of the pods are marked pending by the kube-scheduler. Therefore, we can use various kube-scheduler metrics to get visibility into scheduling latency trends, scheduling retries, and pending pods by scheduling queues, as depicted in following figure.
Self-managed Prometheus setup
If you are running a self-managed Prometheus in the EKS cluster using kube-prometheus-stack, you can configure it by adding necessary RBAC permissions and creating a ServiceMonitor resource. Given the Amazon EKS control plane metrics are exposed under the new API group, update the Prometheus cluster role permissions to get the new metrics:
In the kube-prometheus-stack, a ServiceMonitor is a custom resource that defines how Prometheus should monitor Kubernetes services. It specifies which services to scrape for metrics, how often to scrape them, and any more configuration needed for the scraping process. This allows for dynamic and declarative configuration of Prometheus monitoring targets. Create the ServiceMonitor objects to configure the Prometheus agent to scrape both kube-scheduler (/apis/metrics.eks.amazonaws.com/v1/ksh/container/metrics) and controller-manager metrics (/apis/metrics.eks.amazonaws.com/v1/kcm/container/metrics). You can verify the status of the scrape targets in the Prometheus console as shown in the following figure:
Key considerations
- All new EKS clusters created on 1.28 or later have access to these metrics and dashboards. For existing clusters, refer EKS documentation for supported Kubernetes and platform versions.
- This feature is available in all commercial, China, and AWS GovCloud (US) regions.
- Amazon EKS only returns a curated set of control plane metrics to users that are actionable from the users. We continue to evaluate and introduce more metrics as needed. Create a GitHub issue at AWS Containers roadmap to propose additional control plane metrics that you would like EKS to expose, along with the rationale for how you would utilize those metrics.
- To prevent potential performance issues, we strongly advise against using shorter Prometheus scraping intervals. We recommend users not to use a scrape interval less than one second to avoid unwanted performance issues.
Conclusion
In this post, we discussed the new control plane observability enhancements to the Amazon EKS clusters. You can use the new dashboards in the Amazon EKS console, Prometheus compatible Kubernetes control plane metrics endpoint, and Amazon CloudWatch metrics to quickly detect, troubleshoot, and remediate workload related issues. We showed you how to configure Amazon Managed Service for Prometheus to scrape these new metrics and visualize them in Amazon Managed Grafana. Further, we explained how you can configure your existing observability agents to scrape these metrics and create dashboards in your favorite observability platform. You may provide feedback on this feature and propose new features by leaving a comment or opening an issue on the AWS Containers roadmap that is hosted on GitHub.