AWS Distro for OpenTelemetry を使用した CloudWatch Container Insights の EKS Fargate サポートのご紹介

この記事は Introducing Amazon CloudWatch Container Insights for Amazon EKS Fargate using AWS Distro for OpenTelemetry (記事公開日: 2022 年 2 月 17 日) を翻訳したものです。

イントロダクション

Amazon CloudWatch Container Insights は、お客様がコンテナ化されたアプリケーションやマイクロサービスからメトリクスやログを収集、集約、要約するのに役立ちます。メトリクスデータは、埋め込みメトリクスフォーマット (Embedded Metric Format, EMF) を使用したパフォーマンスログイベントとして収集されます。これらのパフォーマンスログイベントは、構造化された JSON スキーマを使用しており、カーディナリティの高いデータを大規模に取り込んで保存することが可能です。このデータから、CloudWatch はクラスター、ノード、Pod、タスク、および Service の各レベルで集約されたメトリクスを CloudWatch メトリクスとして作成します。Container Insights が収集したメトリクスは、CloudWatch の自動ダッシュボードで利用できます。

AWS Distro for OpenTelemetry (ADOT) はセキュアで、AWS がサポートする OpenTelemetry プロジェクトのディストリビューションです。ADOT を利用することで、ユーザーはアプリケーションを一度だけインストルメント (訳注: アプリケーションに計測のためのコードを追加すること) し、相互に関連があるメトリクスやトレースを、複数の監視ソリューションに送信することができます。最近の ADOT の CloudWatch Container Insights のサポートの開始 (プレビュー) により、お客様は、ADOT を使用して Amazon Elastic Compulte Cloud (Amazon EC2) で実行されている Amazon EKS および Kubernetes クラスターから、CPU、メモリ、ディスク、ネットワーク使用率などのシステムメトリクスを収集することができ、Amazon CloudWatch エージェントと同様のエクスペリエンスが提供されます。本日、ADOT Collector は、CloudWatch Container Insights の EKS Fargate サポートをプレビューで利用できるようになりました。お客様は、Amazon EKS クラスターにデプロイされ AWS Fargate で実行されている Pod の CPU やメモリ使用率などのコンテナおよび Pod のメトリクスを収集し、既存の CloudWatch Container Insights のエクスペリエンスに変更を加えることなく CloudWatch ダッシュボードで表示できるようになりました。これにより、お客様は、トラフィックへの対応やコスト削減のために、スケールアップまたはスケールダウンするかどうかを判断することもできます。

このブログ記事では、EKS Fargate ワークロードから Container Insights メトリクスの収集を可能にする ADOT Collector のパイプラインのコンポーネントのデザインについて説明します。次に、ADOT Collector を設定およびデプロイし、EKS Fargate クラスターにデプロイされたワークロードからシステムメトリクスを収集して CloudWatch に送信する方法を紹介します。

ADOT Collector を使用した Container Insights の EKS Fargate サポートのデザイン

ADOT Collector には、Receiver、Processor、Exporter という 3 つの主要なタイプのコンポーネントで構成されるパイプラインのコンセプトがあります。Receiver は、データが Collector に入る部分です。特定のフォーマットでデータを受け取り、内部フォーマットに変換し、パイプラインで定義された Processor や Exporter に渡します。Receiver はプル型とプッシュ型があります。Processor はオプションのコンポーネントで、データを受信してからエクスポートするまでの間に、データのバッチ処理、フィルタリング、変換などのタスクを実行するために使用されます。Exporter は、メトリクス、ログ、トレースの送信先を決定するために使用されます。Collector のアーキテクチャでは、このようなパイプラインの複数のインスタンスを YAML 設定によって定義することができます。以下の図は、EKS Fargate にデプロイされた ADOT Collector インスタンスのパイプラインコンポーネントを示しています。

EKS Fargate にデプロイされた ADOT コレクターインスタンスのパイプラインコンポーネント

Kubernetes クラスターのワーカーノード上の kubelet は、CPU、メモリ、ディスク、ネットワーク使用量などのリソースメトリクスを /metrics/cadvisor エンドポイントで公開しています。しかし、EKS Fargate のネットワーキングアーキテクチャでは、Pod はそのワーカーノード上の kubelet に直接アクセスすることを許可されていません。したがって、ADOT Collector は Kubernetes API サーバーを呼び出して、ワーカーノード上の kubelet への接続をプロキシし、そのノード上のワークロードの kubelet の cAdvisor メトリクスを収集します。これらのメトリクスは Prometheus 形式で利用可能です。そのため、Collector は Prometheus Receiver のインスタンスを Prometheus サーバーの代替として使用し、Kubernetes API サーバーのエンドポイントからこれらのメトリクスをスクレイピングします。Kubernetes のサービスディスカバリーを使用して、Receiver は EKS クラスター内のすべてのワーカーノードを検出することができます。したがって、クラスター内のすべてのノードからリソースメトリクスを収集するには、ADOT Collector の 1 つのインスタンスで十分です。

次に、メトリクスは、フィルタリング、リネーム、データの集約と変換などを行う一連の Processor を通過します。以下は、上で示した EKS Fargate 向けの ADOT Collector インスタンスのパイプラインで使用される Processor のリストです。

Filter Processor: 名前に基づいてメトリクスを含めたり除外したりします。
Metrics Transform Processor: メトリクスの名前を変更したり、ラベルをまたいだメトリクスの集約を行います。
Cumulative to Delta Processor: Cumulative Sum (累積和) を Delta Sum (デルタ和) に変換します。
Delta to Rate Processor: デルタ和のメトリクスをレートのメトリクスに変換します。レートは Gauge タイプとなります。
Metrics Generation Processor: 既存のメトリクスを使用して新しいメトリクスを作成します。

パイプラインの最後のコンポーネントは AWS CloudWatch EMF Exporter で、メトリクスを埋め込みメトリクスフォーマット (EMF) に変換し、PutLogEvents API を使用して CloudWatch Logs に直接送信します。以下のリストのメトリクスが、EKS Fargate 上で実行されているワークロードのそれぞれについて、ADOT Collector によって CloudWatch に送信されます。

pod_cpu_utilization_over_pod_limit
pod_cpu_usage_total
pod_cpu_limit
pod_memory_utilization_over_pod_limit
pod_memory_working_set
pod_memory_limit
pod_network_rx_bytes
pod_network_tx_bytes

各メトリクスは以下のディメンションセットに関連付けられ、ContainerInsights という名前の CloudWatch 名前空間で収集されます。

ClusterName, LaunchType
ClusterName, Namespace, LaunchType
ClusterName, Namespace, PodName, LaunchType

EKS Fargate への ADOT Collector のデプロイ

EKS Fargate クラスターに ADOT Collector をインストールして、ワークロードからメトリクスデータを収集する方法を詳しく見ていきましょう。以下は、ADOT Collector をインストールするための前提条件のリストです。

Kubernetes のバージョン 1.18 以降をサポートする EKS クラスター。EKS クラスターは、ここで説明されているいずれかの方法で作成できます。
クラスターが AWS Fargate 上に Pod を作成する場合、Fargate インフラストラクチャ上で実行されるコンポーネントは、お客様に代わって AWS API への呼び出しを行う必要があります。これは、Amazon ECR からコンテナイメージをプルするなどのアクションを実行できるようにするためです。EKS Fargate ポッド実行ロールは、これを行うための IAM アクセス許可を提供します。ここで説明されている手順に従って、EKS Fargate ポッド実行ロールを作成します。
Fargate で実行される Pod をスケジュールする前に、起動時にどの Pod が Fargate で実行されるかを指定する Fargate プロファイルを定義する必要があります。この記事の実装では、ここで説明されている手順に従って 2 つの Fargate プロファイルを作成します。1 つ目の Fargate プロファイルは fargate-container-insights という名前で、fargate-container-insights という Namespace を指定します。2 つ目の Fargate プロファイルは applications という名前で、golang という Namespace を指定します。
ADOT Collector には、パフォーマンスログイベントを CloudWatch に送信するために IAM アクセス許可が必要です。これは、EKS でサポートされている IAM Roles for Service Accounts (IRSA) 機能を使って、Kubernetes のサービスアカウント (Service Account) を IAM ロールに関連付けることで実現します。IAM ロールには、AWS 管理ポリシーの CloudWatchAgentServerPolicy をアタッチします。以下に示すヘルパースクリプトを使用して、CLUSTER_NAME と REGION 変数を置換した後、EKS-ADOT-ServiceAccount-Role という名前の IAM ロールを作成してアクセス許可を付与し、adot-collector という Kubernetes サービスアカウントに関連付けることができます。

#!/bin/bash
CLUSTER_NAME=YOUR-EKS-CLUSTER-NAME
REGION=YOUR-EKS-CLUSTER-REGION
SERVICE_ACCOUNT_NAMESPACE=fargate-container-insights
SERVICE_ACCOUNT_NAME=adot-collector
SERVICE_ACCOUNT_IAM_ROLE=EKS-Fargate-ADOT-ServiceAccount-Role
SERVICE_ACCOUNT_IAM_POLICY=arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy

eksctl utils associate-iam-oidc-provider \
--cluster=$CLUSTER_NAME \
--approve

eksctl create iamserviceaccount \
--cluster=$CLUSTER_NAME \
--region=$REGION \
--name=$SERVICE_ACCOUNT_NAME \
--namespace=$SERVICE_ACCOUNT_NAMESPACE \
--role-name=$SERVICE_ACCOUNT_IAM_ROLE \
--attach-policy-arn=$SERVICE_ACCOUNT_IAM_POLICY \
--approve

次に、以下のマニフェストのプレースホルダー変数 YOUR-EKS-CLUSTER-NAME と YOUR-AWS-REGION を EKS クラスター名と AWS リージョン名に置き換え、Kubernetes StatefulSet として ADOT Collector をデプロイします。

---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: adotcol-admin-role
rules:
  - apiGroups: [""]
    resources:
      - nodes
      - nodes/proxy
      - nodes/metrics
      - services
      - endpoints
      - pods
      - pods/proxy
    verbs: ["get", "list", "watch"]
  - nonResourceURLs: [ "/metrics/cadvisor"]
    verbs: ["get", "list", "watch"]

---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: adotcol-admin-role-binding
subjects:
  - kind: ServiceAccount
    name: adot-collector
    namespace: fargate-container-insights
roleRef:
  kind: ClusterRole
  name: adotcol-admin-role
  apiGroup: rbac.authorization.k8s.io

# collector configuration section
# update `ClusterName=YOUR-EKS-CLUSTER-NAME` in the env variable OTEL_RESOURCE_ATTRIBUTES
# update `region=YOUR-AWS-REGION` in the emfexporter with the name of the AWS Region where you want to collect Container Insights metrics.
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: adot-collector-config
  namespace: fargate-container-insights
  labels:
    app: aws-adot
    component: adot-collector-config
data:
  adot-collector-config: |
    receivers:
      prometheus:
        config:
          global:
            scrape_interval: 1m
            scrape_timeout: 40s

          scrape_configs:
          - job_name: 'kubelets-cadvisor-metrics'
            sample_limit: 10000
            scheme: https

            kubernetes_sd_configs:
            - role: node
            tls_config:
              ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
            bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

            relabel_configs:
              - action: labelmap
                regex: __meta_kubernetes_node_label_(.+)
                # Only for Kubernetes ^1.7.3.
                # See: https://github.com/prometheus/prometheus/issues/2916
              - target_label: __address__
                # Changes the address to Kube API server's default address and port
                replacement: kubernetes.default.svc:443
              - source_labels: [__meta_kubernetes_node_name]
                regex: (.+)
                target_label: __metrics_path__
                # Changes the default metrics path to kubelet's proxy cadvdisor metrics endpoint
                replacement: /api/v1/nodes/$${1}/proxy/metrics/cadvisor
            metric_relabel_configs:
              # extract readable container/pod name from id field
              - action: replace
                source_labels: [id]
                regex: '^/machine\.slice/machine-rkt\\x2d([^\\]+)\\.+/([^/]+)\.service$'
                target_label: rkt_container_name
                replacement: '$${2}-$${1}'
              - action: replace
                source_labels: [id]
                regex: '^/system\.slice/(.+)\.service$'
                target_label: systemd_service_name
                replacement: '$${1}'
    processors:
      # rename labels which apply to all metrics and are used in metricstransform/rename processor
      metricstransform/label_1:
        transforms:
          - include: .*
            match_type: regexp
            action: update
            operations:
              - action: update_label
                label: name
                new_label: container_id
              - action: update_label
                label: kubernetes_io_hostname
                new_label: NodeName
              - action: update_label
                label: eks_amazonaws_com_compute_type
                new_label: LaunchType

      # rename container and pod metrics which we care about.
      # container metrics are renamed to `new_container_*` to differentiate them with unused container metrics
      metricstransform/rename:
        transforms:
          - include: container_spec_cpu_quota
            new_name: new_container_cpu_limit_raw
            action: insert
            match_type: regexp
            experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
          - include: container_spec_cpu_shares
            new_name: new_container_cpu_request
            action: insert
            match_type: regexp
            experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
          - include: container_cpu_usage_seconds_total
            new_name: new_container_cpu_usage_seconds_total
            action: insert
            match_type: regexp
            experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
          - include: container_spec_memory_limit_bytes
            new_name: new_container_memory_limit
            action: insert
            match_type: regexp
            experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
          - include: container_memory_cache
            new_name: new_container_memory_cache
            action: insert
            match_type: regexp
            experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
          - include: container_memory_max_usage_bytes
            new_name: new_container_memory_max_usage
            action: insert
            match_type: regexp
            experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
          - include: container_memory_usage_bytes
            new_name: new_container_memory_usage
            action: insert
            match_type: regexp
            experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
          - include: container_memory_working_set_bytes
            new_name: new_container_memory_working_set
            action: insert
            match_type: regexp
            experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
          - include: container_memory_rss
            new_name: new_container_memory_rss
            action: insert
            match_type: regexp
            experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
          - include: container_memory_swap
            new_name: new_container_memory_swap
            action: insert
            match_type: regexp
            experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
          - include: container_memory_failcnt
            new_name: new_container_memory_failcnt
            action: insert
            match_type: regexp
            experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
          - include: container_memory_failures_total
            new_name: new_container_memory_hierarchical_pgfault
            action: insert
            match_type: regexp
            experimental_match_labels: {"container": "\\S", "LaunchType": "fargate", "failure_type": "pgfault", "scope": "hierarchy"}
          - include: container_memory_failures_total
            new_name: new_container_memory_hierarchical_pgmajfault
            action: insert
            match_type: regexp
            experimental_match_labels: {"container": "\\S", "LaunchType": "fargate", "failure_type": "pgmajfault", "scope": "hierarchy"}
          - include: container_memory_failures_total
            new_name: new_container_memory_pgfault
            action: insert
            match_type: regexp
            experimental_match_labels: {"container": "\\S", "LaunchType": "fargate", "failure_type": "pgfault", "scope": "container"}
          - include: container_memory_failures_total
            new_name: new_container_memory_pgmajfault
            action: insert
            match_type: regexp
            experimental_match_labels: {"container": "\\S", "LaunchType": "fargate", "failure_type": "pgmajfault", "scope": "container"}
          - include: container_fs_limit_bytes
            new_name: new_container_filesystem_capacity
            action: insert
            match_type: regexp
            experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
          - include: container_fs_usage_bytes
            new_name: new_container_filesystem_usage
            action: insert
            match_type: regexp
            experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
          # POD LEVEL METRICS
          - include: container_spec_cpu_quota
            new_name: pod_cpu_limit_raw
            action: insert
            match_type: regexp
            experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
          - include: container_spec_cpu_shares
            new_name: pod_cpu_request
            action: insert
            match_type: regexp
            experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
          - include: container_cpu_usage_seconds_total
            new_name: pod_cpu_usage_seconds_total
            action: insert
            match_type: regexp
            experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
          - include: container_spec_memory_limit_bytes
            new_name: pod_memory_limit
            action: insert
            match_type: regexp
            experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
          - include: container_memory_cache
            new_name: pod_memory_cache
            action: insert
            match_type: regexp
            experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
          - include: container_memory_max_usage_bytes
            new_name: pod_memory_max_usage
            action: insert
            match_type: regexp
            experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
          - include: container_memory_usage_bytes
            new_name: pod_memory_usage
            action: insert
            match_type: regexp
            experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
          - include: container_memory_working_set_bytes
            new_name: pod_memory_working_set
            action: insert
            match_type: regexp
            experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
          - include: container_memory_rss
            new_name: pod_memory_rss
            action: insert
            match_type: regexp
            experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
          - include: container_memory_swap
            new_name: pod_memory_swap
            action: insert
            match_type: regexp
            experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
          - include: container_memory_failcnt
            new_name: pod_memory_failcnt
            action: insert
            match_type: regexp
            experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
          - include: container_memory_failures_total
            new_name: pod_memory_hierarchical_pgfault
            action: insert
            match_type: regexp
            experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate", "failure_type": "pgfault", "scope": "hierarchy"}
          - include: container_memory_failures_total
            new_name: pod_memory_hierarchical_pgmajfault
            action: insert
            match_type: regexp
            experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate", "failure_type": "pgmajfault", "scope": "hierarchy"}
          - include: container_memory_failures_total
            new_name: pod_memory_pgfault
            action: insert
            match_type: regexp
            experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate", "failure_type": "pgfault", "scope": "container"}
          - include: container_memory_failures_total
            new_name: pod_memory_pgmajfault
            action: insert
            match_type: regexp
            experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate", "failure_type": "pgmajfault", "scope": "container"}
          - include: container_network_receive_bytes_total
            new_name: pod_network_rx_bytes
            action: insert
            match_type: regexp
            experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"}
          - include: container_network_receive_packets_dropped_total
            new_name: pod_network_rx_dropped
            action: insert
            match_type: regexp
            experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"}
          - include: container_network_receive_errors_total
            new_name: pod_network_rx_errors
            action: insert
            match_type: regexp
            experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"}
          - include: container_network_receive_packets_total
            new_name: pod_network_rx_packets
            action: insert
            match_type: regexp
            experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"}
          - include: container_network_transmit_bytes_total
            new_name: pod_network_tx_bytes
            action: insert
            match_type: regexp
            experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"}
          - include: container_network_transmit_packets_dropped_total
            new_name: pod_network_tx_dropped
            action: insert
            match_type: regexp
            experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"}
          - include: container_network_transmit_errors_total
            new_name: pod_network_tx_errors
            action: insert
            match_type: regexp
            experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"}
          - include: container_network_transmit_packets_total
            new_name: pod_network_tx_packets
            action: insert
            match_type: regexp
            experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"}

      # filter out only renamed metrics which we care about
      filter:
        metrics:
          include:
            match_type: regexp
            metric_names:
              - new_container_.*
              - pod_.*

      # convert cumulative sum datapoints to delta
      cumulativetodelta:
        metrics:
          - new_container_cpu_usage_seconds_total
          - pod_cpu_usage_seconds_total
          - pod_memory_pgfault
          - pod_memory_pgmajfault
          - pod_memory_hierarchical_pgfault
          - pod_memory_hierarchical_pgmajfault
          - pod_network_rx_bytes
          - pod_network_rx_dropped
          - pod_network_rx_errors
          - pod_network_rx_packets
          - pod_network_tx_bytes
          - pod_network_tx_dropped
          - pod_network_tx_errors
          - pod_network_tx_packets
          - new_container_memory_pgfault
          - new_container_memory_pgmajfault
          - new_container_memory_hierarchical_pgfault
          - new_container_memory_hierarchical_pgmajfault

      # convert delta to rate
      deltatorate:
        metrics:
          - new_container_cpu_usage_seconds_total
          - pod_cpu_usage_seconds_total
          - pod_memory_pgfault
          - pod_memory_pgmajfault
          - pod_memory_hierarchical_pgfault
          - pod_memory_hierarchical_pgmajfault
          - pod_network_rx_bytes
          - pod_network_rx_dropped
          - pod_network_rx_errors
          - pod_network_rx_packets
          - pod_network_tx_bytes
          - pod_network_tx_dropped
          - pod_network_tx_errors
          - pod_network_tx_packets
          - new_container_memory_pgfault
          - new_container_memory_pgmajfault
          - new_container_memory_hierarchical_pgfault
          - new_container_memory_hierarchical_pgmajfault

      experimental_metricsgeneration/1:
        rules:
          - name: pod_network_total_bytes
            unit: Bytes/Second
            type: calculate
            metric1: pod_network_rx_bytes
            metric2: pod_network_tx_bytes
            operation: add
          - name: pod_memory_utilization_over_pod_limit
            unit: Percent
            type: calculate
            metric1: pod_memory_working_set
            metric2: pod_memory_limit
            operation: percent
          - name: pod_cpu_usage_total
            unit: Millicore
            type: scale
            metric1: pod_cpu_usage_seconds_total
            operation: multiply
            # core to millicore: multiply by 1000
            # millicore seconds to millicore nanoseconds: multiply by 10^9
            scale_by: 1000
          - name: pod_cpu_limit
            unit: Millicore
            type: scale
            metric1: pod_cpu_limit_raw
            operation: divide
            scale_by: 100

      experimental_metricsgeneration/2:
        rules:
          - name: pod_cpu_utilization_over_pod_limit
            type: calculate
            unit: Percent
            metric1: pod_cpu_usage_total
            metric2: pod_cpu_limit
            operation: percent

      # add `Type` and rename metrics and labels
      metricstransform/label_2:
        transforms:
          - include: pod_.*
            match_type: regexp
            action: update
            operations:
              - action: add_label
                new_label: Type
                new_value: "Pod"
          - include: new_container_.*
            match_type: regexp
            action: update
            operations:
              - action: add_label
                new_label: Type
                new_value: Container
          - include: .*
            match_type: regexp
            action: update
            operations:
              - action: update_label
                label: namespace
                new_label: Namespace
              - action: update_label
                label: pod
                new_label: PodName
          - include: ^new_container_(.*)$$
            match_type: regexp
            action: update
            new_name: container_$$1

      # add cluster name from env variable and EKS metadata
      resourcedetection:
        detectors: [env, eks]

      batch:
        timeout: 60s

    # only pod level metrics in metrics format, details in https://aws-otel.github.io/docs/getting-started/container-insights/eks-fargate
    exporters:
      awsemf:
        log_group_name: '/aws/containerinsights/{ClusterName}/performance'
        log_stream_name: '{PodName}'
        namespace: 'ContainerInsights'
        region: YOUR-AWS-REGION
        resource_to_telemetry_conversion:
          enabled: true
        eks_fargate_container_insights_enabled: true
        parse_json_encoded_attr_values: ["kubernetes"]
        dimension_rollup_option: NoDimensionRollup
        metric_declarations:
          - dimensions: [ [ClusterName, LaunchType], [ClusterName, Namespace, LaunchType], [ClusterName, Namespace, PodName, LaunchType]]
            metric_name_selectors:
              - pod_cpu_utilization_over_pod_limit
              - pod_cpu_usage_total
              - pod_cpu_limit
              - pod_memory_utilization_over_pod_limit
              - pod_memory_working_set
              - pod_memory_limit
              - pod_network_rx_bytes
              - pod_network_tx_bytes

    extensions:
      health_check:

    service:
      pipelines:
        metrics:
          receivers: [prometheus]
          processors: [metricstransform/label_1, resourcedetection, metricstransform/rename, filter, cumulativetodelta, deltatorate, experimental_metricsgeneration/1, experimental_metricsgeneration/2, metricstransform/label_2, batch]
          exporters: [awsemf]
      extensions: [health_check]

# configure the service and the collector as a StatefulSet
---
apiVersion: v1
kind: Service
metadata:
  name: adot-collector-service
  namespace: fargate-container-insights
  labels:
    app: aws-adot
    component: adot-collector
spec:
  ports:
    - name: metrics # default endpoint for querying metrics.
      port: 8888
  selector:
    component: adot-collector
  type: ClusterIP

---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: adot-collector
  namespace: fargate-container-insights
  labels:
    app: aws-adot
    component: adot-collector
spec:
  selector:
    matchLabels:
      app: aws-adot
      component: adot-collector
  serviceName: adot-collector-service
  template:
    metadata:
      labels:
        app: aws-adot
        component: adot-collector
    spec:
      serviceAccountName: adot-collector
      securityContext:
        fsGroup: 65534
      containers:
        - image: amazon/aws-otel-collector:v0.15.1
          name: adot-collector
          imagePullPolicy: Always
          command:
            - "/awscollector"
            - "--config=/conf/adot-collector-config.yaml"
          env:
            - name: OTEL_RESOURCE_ATTRIBUTES
              value: "ClusterName=YOUR-EKS-CLUSTER-NAME"
          resources:
            limits:
              cpu: 2
              memory: 2Gi
            requests:
              cpu: 200m
              memory: 400Mi
          volumeMounts:
            - name: adot-collector-config-volume
              mountPath: /conf
      volumes:
        - configMap:
            name: adot-collector-config
            items:
              - key: adot-collector-config
                path: adot-collector-config.yaml
          name: adot-collector-config-volume
---

以下の Deployment マニフェストを使用して、サンプルのステートレスワークロードをクラスターにデプロイします。

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: webapp
  namespace: golang
spec:
  replicas: 2
  selector:
    matchLabels:
      app: webapp
      role: webapp-service
  template:
    metadata:
      labels:
        app: webapp
        role: webapp-service             
    spec: 
      containers:          
        - name: go  
          image: public.ecr.aws/awsvijisarathy/prometheus-webapp:latest
          imagePullPolicy: Always   
          resources:
            requests:
              cpu: "256m"
              memory: "512Mi"
            limits:
              cpu: "512m"
              memory: "1024Mi"

上記のデプロイは両方とも Fargate プロファイルに関連付けられた Namespace を対象としているため、ワークロードは Fargate で実行するようにスケジュールされます。Fargate ワーカーノードがこれらのワークロードごとにプロビジョニングされ、Pod が Ready ステータスになるまでには数分かかる場合があります。kubectl get nodes -l eks.amazonaws.com/compute-type=fargate コマンドを実行すると、プレフィックス fargate で名前が付けられた Fargate ワーカーノードがリストアップされるはずです。以下のコマンドを使用して、ADOT Collector とサンプルワークロードの Pod がすべて実行されていることを確認してください。

kubectl get pods -n fargate-container-insights
kubectl get pods -n golang

CloudWatch Container Insights を使用した EKS Fargate のリソースメトリクスの可視化

ワークロードのパフォーマンスログイベントは、以下のように、/aws/containerinsights/CLUSTER_NAME/performance という名前のロググループに表示されます。Fargate で実行されている各 Pod ごとに、個別のログストリームが作成されます。

ワークロードのパフォーマンスログイベントのユーザーインターフェース

以下に示すのは、ログイベントの 1 つに含まれる埋め込みメトリクス形式の JSON データの代表的な例です。pod_cpu_usage_total と pod_cpu_utilization_over_pod_limit という名前のメトリクスに関連するデータであることがわかります。

ログイベントを示すユーザーインターフェース

以下は、同じ pod_cpu_utilization_over_pod_limit メトリクスについて CloudWatch のメトリクスダッシュボードで表示されるグラフです。

CloudWatch での Pod のメトリクスのグラフ

メトリクスは、クラスター、ノード、Namespace、Service、Pod の各レベルでデータを表示する事前に構築された Container Insights ダッシュボードを使用して可視化することもできます。以下は、EKS Fargateのメトリクスをクラスターレベルで表示するダッシュボードのビューです。

CloudWatch Container Insights による可視化

まとめ

このブログ記事では、CloudWatch Container Insights で EKS Fargate をサポートするための ADOT Collector のデザインの概要を紹介し、そのデプロイと EKS Fargate クラスター上のワークロードからのメトリクス収集のデモを行いました。単一の Collector インスタンスは、Kubernetes サービスディスカバリーを使用して EKS クラスター内のすべてのワーカーノードを検出し、ワーカーノード上の kubelet へのプロキシとして Kubernetes API サーバーを使用することで、ワーカーノードからメトリクスを収集することができます。EKS のお客様は、EKS Fargate クラスターにデプロイされたワークロードから、CPU、メモリ、ディスク、ネットワーク使用量などのシステムメトリクスを収集し、CloudWatch ダッシュボードで可視化することができるようになり、CloudWatch エージェントと同様のエクスペリエンスが提供されます。

翻訳はプロフェッショナルサービスの杉田が担当しました。原文はこちらです。

Amazon Web Services ブログ

AWS Distro for OpenTelemetry を使用した CloudWatch Container Insights の EKS Fargate サポートのご紹介

イントロダクション

ADOT Collector を使用した Container Insights の EKS Fargate サポートのデザイン

EKS Fargate への ADOT Collector のデプロイ

CloudWatch Container Insights を使用した EKS Fargate のリソースメトリクスの可視化

まとめ

お役立ちリンク

フォローお願いいたします