Containers
Cost savings by customizing metrics sent by Container Insights in Amazon EKS
AWS Distro for OpenTelemetry (ADOT) is an AWS-provided distribution of the OpenTelemetry project. The ADOT Collector receives and exports data from multiple sources and destinations. Amazon CloudWatch Container Insights now supports ADOT for Amazon Elastic Kubernetes Service (Amazon EKS) and Amazon Elastic Container Service (Amazon ECS). This will enable customers to perform advanced configurations, such as customizing metrics that are sent to CloudWatch. The follow is a diagram of how ADOT Collector architecture looks for Amazon EKS:
As you can see in the previous graph, the ADOT Collector pipeline starts by using a receiver to collect metrics – in this case it is the Container Insights Receiver. It then uses Processors to transform or filter collected metrics. Finally, the ADOT Collector uses exporters to send metrics to different destinations. In this example, we are using the AWS EMF Exporter, which will convert processed metrics to CloudWatch embedded metric format logs. In this blog post, we will show you how to reduce CloudWatch Insight-associated costs by customizing metrics collected by the Container Insights receiver in the ADOT Collector for Amazon EKS clusters.
With the default configuration, the Container Insights receiver collects the complete set of metrics as defined by the receiver documentation. The number of metrics and dimensions collected is high, and for large clusters this will significantly increase the costs for metric ingestion and storage. We are going to demonstrate two different approaches that you can use to configure the ADOT Collector to send only metrics that bring value to you.
In this blog, we will explain how to configure the ADOT Collector for an Amazon EKS cluster, but note that metric customization using the approaches shown in this blog post are also applicable when using the ADOT Collector in Amazon ECS. Be aware that metric names for Amazon ECS are different, as shown in this documentation.
Installing ADOT Collector in EKS
To get Container Insights receiver collecting infrastructure data, you need to install the ADOT Collector as a daemonset within your Amazon EKS cluster. In this section, we describe the steps required to configure ADOT Collector in Amazon EKS.
Set up IAM role for service account
To improve security for the ADOT Collector agent, you need to enable IAM roles for your service account. This is so you can assign IAM permissions to the ADOT Collector pod.
export CLUSTER_NAME=<eks-cluster-name>
export AWS_REGION=<e.g. us-east-1>
export AWS_ACCOUNT_ID=<AWS account ID>
- Enable IAM OIDC provider
eksctl utils associate-iam-oidc-provider --region=$AWS_REGION \
--cluster=$CLUSTER_NAME \
--approve
- Create IAM policy for the Collector
cat << EOF > AWSDistroOpenTelemetryPolicy.json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"logs:PutLogEvents",
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:DescribeLogStreams",
"logs:DescribeLogGroups",
"xray:PutTraceSegments",
"xray:PutTelemetryRecords",
"xray:GetSamplingRules",
"xray:GetSamplingTargets",
"xray:GetSamplingStatisticSummaries",
"cloudwatch:PutMetricData",
"ec2:DescribeVolumes",
"ec2:DescribeTags",
"ssm:GetParameters"
],
"Resource": "*"
}
]
}
EOF
aws iam create-policy \
--policy-name AWSDistroOpenTelemetryPolicy \
--policy-document file://AWSDistroOpenTelemetryPolicy.json
- Download ADOT Collector Kubernetes manifest
curl -s -O https://raw.githubusercontent.com/aws-observability/aws-otel-collector/main/deployment-template/eks/otel-container-insights-infra.yaml
- Install Kubernetes manifest
kubectl apply -f otel-container-insights-infra.yaml
- Create IAM role and configure IRSA for the Collector
eksctl create iamserviceaccount \
--name aws-otel-sa \
--namespace aws-otel-eks \
--cluster ${CLUSTER_NAME} \
--attach-policy-arn arn:aws:iam::${AWS_ACCOUNT_ID}:policy/AWSDistroOpenTelemetryPolicy \
--approve \
--override-existing-serviceaccounts
- Restart ADOT Collector pods to start using IRSA
kubectl delete pods -n aws-otel-eks -l name=aws-otel-eks-ci
Now you have the ADOT Collector installed in your Amazon EKS cluster.Let’s review the two available approaches that you have to customize metrics sent by the Collector.
Option 1: Filter metrics using processors
This approach involves the introduction of OpenTelemetry processors to filter out metrics or attributes to reduce the size of EMF logs. In this section, we will demonstrate the basic usage of two processors. For more detailed information about these processors, you can refer to this documentation.
Filter processor
Filter processor is part of the AWS OpenTelemetry distribution. It can be used as part of the metrics collection pipeline to filter out unwanted metrics. For example, suppose that you want Container Insights to only collect pod-level metrics (with name prefix pod_
) excluding those for networking, with name prefix pod_network
. You can add the filter processor into the pipeline by editing the Kubernetes manifest otel-container-insights-infra.yaml
downloaded in the previous Installing ADOT Collector in EKS section. Then modify ConfigMap
named otel-agent-conf
to include filter
processors as follows:
extensions:
health_check:
receivers:
awscontainerinsightreceiver:
processors:
# filter processors example
filter/include:
# any names NOT matching filters are excluded from remainder of pipeline
metrics:
include:
match_type: regexp
metric_names:
# re2 regexp patterns
- ^pod_.*
filter/exclude:
# any names matching filters are excluded from remainder of pipeline
metrics:
exclude:
match_type: regexp
metric_names:
- ^pod_network.*
batch/metrics:
timeout: 60s
exporters:
awsemf:
namespace: ContainerInsights
log_group_name: '/aws/containerinsights/{ClusterName}/performance'
log_stream_name: '{NodeName}'
resource_to_telemetry_conversion:
enabled: true
dimension_rollup_option: NoDimensionRollup
parse_json_encoded_attr_values: [Sources, kubernetes]
metric_declarations:
# node metrics
- dimensions: [[NodeName, InstanceId, ClusterName]]
metric_name_selectors:
- node_cpu_utilization
- node_memory_utilization
- node_network_total_bytes
- node_cpu_reserved_capacity
- node_memory_reserved_capacity
- node_number_of_running_pods
- node_number_of_running_containers
- dimensions: [[ClusterName]]
metric_name_selectors:
- node_cpu_utilization
- node_memory_utilization
- node_network_total_bytes
- node_cpu_reserved_capacity
- node_memory_reserved_capacity
- node_number_of_running_pods
- node_number_of_running_containers
- node_cpu_usage_total
- node_cpu_limit
- node_memory_working_set
- node_memory_limit
# pod metrics
- dimensions: [[PodName, Namespace, ClusterName], [Service, Namespace, ClusterName], [Namespace, ClusterName], [ClusterName]]
metric_name_selectors:
- pod_cpu_utilization
- pod_memory_utilization
- pod_network_rx_bytes
- pod_network_tx_bytes
- pod_cpu_utilization_over_pod_limit
- pod_memory_utilization_over_pod_limit
- dimensions: [[PodName, Namespace, ClusterName], [ClusterName]]
metric_name_selectors:
- pod_cpu_reserved_capacity
- pod_memory_reserved_capacity
- dimensions: [[PodName, Namespace, ClusterName]]
metric_name_selectors:
- pod_number_of_container_restarts
- container_cpu_limit
- container_cpu_request
- container_cpu_utilization
- container_memory_limit
- container_memory_request
- container_memory_utilization
- container_memory_working_set
# cluster metrics
- dimensions: [[ClusterName]]
metric_name_selectors:
- cluster_node_count
- cluster_failed_node_count
# service metrics
- dimensions: [[Service, Namespace, ClusterName], [ClusterName]]
metric_name_selectors:
- service_number_of_running_pods
# node fs metrics
- dimensions: [[NodeName, InstanceId, ClusterName], [ClusterName]]
metric_name_selectors:
- node_filesystem_utilization
# namespace metrics
- dimensions: [[Namespace, ClusterName], [ClusterName]]
metric_name_selectors:
- namespace_number_of_running_pods
service:
pipelines:
metrics:
receivers: [awscontainerinsightreceiver]
# Add filter processors to the pipeline
processors: [filter/include, filter/exclude, batch/metrics]
exporters: [awsemf]
extensions: [health_check]
Resource processor
Resource processor is also built into the AWS OpenTelemetry Distro and can be used to remove unwanted metric attributes. For example, if you want to remove the Kubernetes
and Sources
fields from the EMF logs, you can add the resource processor to the pipeline:
extensions:
health_check:
receivers:
awscontainerinsightreceiver:
processors:
filter/include:
# any names NOT matching filters are excluded from remainder of pipeline
metrics:
include:
match_type: regexp
metric_names:
# re2 regexp patterns
- ^pod_.*
filter/exclude:
# any names matching filters are excluded from remainder of pipeline
metrics:
exclude:
match_type: regexp
metric_names:
- ^pod_network.*
# resource processors example
resource:
attributes:
- key: Sources
action: delete
- key: kubernetes
action: delete
batch/metrics:
timeout: 60s
exporters:
awsemf:
namespace: ContainerInsights
log_group_name: '/aws/containerinsights/{ClusterName}/performance'
log_stream_name: '{NodeName}'
resource_to_telemetry_conversion:
enabled: true
dimension_rollup_option: NoDimensionRollup
parse_json_encoded_attr_values: [Sources, kubernetes]
metric_declarations:
# node metrics
- dimensions: [[NodeName, InstanceId, ClusterName]]
metric_name_selectors:
- node_cpu_utilization
- node_memory_utilization
- node_network_total_bytes
- node_cpu_reserved_capacity
- node_memory_reserved_capacity
- node_number_of_running_pods
- node_number_of_running_containers
- dimensions: [[ClusterName]]
metric_name_selectors:
- node_cpu_utilization
- node_memory_utilization
- node_network_total_bytes
- node_cpu_reserved_capacity
- node_memory_reserved_capacity
- node_number_of_running_pods
- node_number_of_running_containers
- node_cpu_usage_total
- node_cpu_limit
- node_memory_working_set
- node_memory_limit
# pod metrics
- dimensions: [[PodName, Namespace, ClusterName], [Service, Namespace, ClusterName], [Namespace, ClusterName], [ClusterName]]
metric_name_selectors:
- pod_cpu_utilization
- pod_memory_utilization
- pod_network_rx_bytes
- pod_network_tx_bytes
- pod_cpu_utilization_over_pod_limit
- pod_memory_utilization_over_pod_limit
- dimensions: [[PodName, Namespace, ClusterName], [ClusterName]]
metric_name_selectors:
- pod_cpu_reserved_capacity
- pod_memory_reserved_capacity
- dimensions: [[PodName, Namespace, ClusterName]]
metric_name_selectors:
- pod_number_of_container_restarts
- container_cpu_limit
- container_cpu_request
- container_cpu_utilization
- container_memory_limit
- container_memory_request
- container_memory_utilization
- container_memory_working_set
# cluster metrics
- dimensions: [[ClusterName]]
metric_name_selectors:
- cluster_node_count
- cluster_failed_node_count
# service metrics
- dimensions: [[Service, Namespace, ClusterName], [ClusterName]]
metric_name_selectors:
- service_number_of_running_pods
# node fs metrics
- dimensions: [[NodeName, InstanceId, ClusterName], [ClusterName]]
metric_name_selectors:
- node_filesystem_utilization
# namespace metrics
- dimensions: [[Namespace, ClusterName], [ClusterName]]
metric_name_selectors:
- namespace_number_of_running_pods
service:
pipelines:
metrics:
receivers: [awscontainerinsightreceiver]
# Add resource processor to the pipeline
processors: [filter/include, filter/exclude, resource, batch/metrics]
exporters: [awsemf]
extensions: [health_check]
The processor approach is more generic for the ADOT Collector and it is the only way to customize metrics that can be sent to different destinations. Also, as customization and filtering happens in an early stage of the pipeline, it is efficient and can manage high volume of metrics with minimal impact on performance.
Option 2: Customize metrics and dimensions
In this approach, instead of using OpenTelemetry processors, you will configure the CloudWatch EMF exporter to generate only the set of metrics that you want to send to CloudWatch Logs. The metric_declaration section of CloudWatch EMF exporter configuration can be used to define the set of metrics and dimensions that you want to export. For example, you can keep only pod metrics from the default configuration. This metric_declaration
section will look like the following:
extensions:
health_check:
receivers:
awscontainerinsightreceiver:
processors:
batch/metrics:
timeout: 60s
exporters:
awsemf:
namespace: ContainerInsights
log_group_name: '/aws/containerinsights/{ClusterName}/performance'
log_stream_name: '{NodeName}'
resource_to_telemetry_conversion:
enabled: true
dimension_rollup_option: NoDimensionRollup
parse_json_encoded_attr_values: [Sources, kubernetes]
# Customized metric declaration section
metric_declarations:
# pod metrics
- dimensions: [[PodName, Namespace, ClusterName], [Service, Namespace, ClusterName], [Namespace, ClusterName], [ClusterName]]
metric_name_selectors:
- pod_cpu_utilization
- pod_memory_utilization
- pod_network_rx_bytes
- pod_network_tx_bytes
- pod_cpu_utilization_over_pod_limit
- pod_memory_utilization_over_pod_limit
service:
pipelines:
metrics:
receivers: [awscontainerinsightreceiver]
processors: [batch/metrics]
exporters: [awsemf]
extensions: [health_check]
To reduce the number of metrics, you can keep the dimension set only [PodName, Namespace, ClusterName]
if you do not care about others:
extensions:
health_check:
receivers:
awscontainerinsightreceiver:
processors:
batch/metrics:
timeout: 60s
exporters:
awsemf:
namespace: ContainerInsights
log_group_name: '/aws/containerinsights/{ClusterName}/performance'
log_stream_name: '{NodeName}'
resource_to_telemetry_conversion:
enabled: true
dimension_rollup_option: NoDimensionRollup
parse_json_encoded_attr_values: [Sources, kubernetes]
metric_declarations:
# pod metrics
- dimensions: [[PodName, Namespace, ClusterName]] # Reduce exported dimensions
metric_name_selectors:
- pod_cpu_utilization
- pod_memory_utilization
- pod_network_rx_bytes
- pod_network_tx_bytes
- pod_cpu_utilization_over_pod_limit
- pod_memory_utilization_over_pod_limit
service:
pipelines:
metrics:
receivers: [awscontainerinsightreceiver]
processors: [batch/metrics]
exporters: [awsemf]
extensions: [health_check]
Additionally, if you want to ignore the pod network metrics, you can delete the metrics pod_network_rx_bytes
and pod_network_tx_bytes
. Suppose that you are interested in the dimension PodName
. You can add it to the dimension set [PodName, Namespace, ClusterName]
. With the previous customization, the final metric_declarations
will look like the following:
extensions:
health_check:
receivers:
awscontainerinsightreceiver:
processors:
batch/metrics:
timeout: 60s
exporters:
awsemf:
namespace: ContainerInsights
log_group_name: '/aws/containerinsights/{ClusterName}/performance'
log_stream_name: '{NodeName}'
resource_to_telemetry_conversion:
enabled: true
dimension_rollup_option: NoDimensionRollup
parse_json_encoded_attr_values: [Sources, kubernetes]
metric_declarations:
# reduce pod metrics by removing network metrics
- dimensions: [[PodName, Namespace, ClusterName]]
metric_name_selectors:
- pod_cpu_utilization
- pod_memory_utilization
- pod_cpu_utilization_over_pod_limit
- pod_memory_utilization_over_pod_limit
service:
pipelines:
metrics:
receivers: [awscontainerinsightreceiver]
processors: [batch/metrics]
exporters: [awsemf]
extensions: [health_check]
This configuration will produce and stream the following four metrics within single dimension [PodName, Namespace, ClusterName]
rather than 55 different metrics for multiple dimensions in the default configuration:
- pod_cpu_utilization
- pod_memory_utilization
- pod_cpu_utilization_over_pod_limit
- pod_memory_utilization_over_pod_limit
With this configuration, you will only send the metrics that you are interested in rather than all the metrics configured by default. As a result, you will be able to decrease metric ingestion cost for Container Insights considerably. Having this flexibility will provide Container Insights costumers with high level of control over metrics being exported.
Customizing metrics by modifying the awsemf
exporter configuration is also highly flexible, and you can customize both the metrics that you want to send and their dimensions. Note that this is only applicable to logs that are sent to CloudWatch.
Conclusions
The two approaches demonstrated in this blog post are not mutually exclusive with each other. In fact, they both can be combined for a high degree of flexibility in customizing metrics that we want ingested into our monitoring system. We use this approach to decrease costs associated with metric storage and processing, as show in following graph.
In the preceding Cost Explorer graph, we can see daily cost associated with CloudWatch using different configurations on the ADOT Collector in a small EKS cluster (20 Worker nodes, 220 pods). Aug 15th shows CloudWatch bill using ADOT Collector with the default configuration. On Aug 16th, we have used the Customize EMF exporter approach and can see about 30% cost savings. On Aug 17th, we used the Processors approach, which achieves about 45% costs saving.
You must consider the trade-offs of customizing metrics sent by Container Insights as you will be able to decrease monitoring costs by sacrificing visibility of the monitored cluster. But also, the built-in dashboard provided by Container Insights within the AWS Console can be impacted by customized metrics as you can select not sending metrics and dimensions used by the dashboard.
AWS Distro for OpenTelemetry support for Container Insights metrics on Amazon EKS and Amazon ECS is available now and you can start using it today. To learn more about ADOT, read the official documentation and check out the Container Insights for EKS Support AWS Distro for OpenTelemetry Collector blog post.
This is an open source project and welcomes your pull requests! We will be tracking the upstream repository and plan to release a fresh version of the toolkit monthly. If you need feedback/review for AWS related components, feel free to tag us on GitHub PR/Issues. You can also open issues in ADOT repo directly if you have questions.