Monitoring and automating recovery from AZ impairments in Amazon EKS with Istio and ARC Zonal Shift

Introduction

Running microservice-style architectures in the cloud can quickly become a complex operation. Teams must account for a growing number of moving pieces, such as multiple instances of independent workloads, along with their infrastructure dependencies. These components can then be distributed across different topology domains, such as multiple Amazon Elastic Compute Cloud (Amazon EC2) instances, Availability Zones (AZs), or AWS Regions. Kubernetes alleviates some of the operational burden of setting up and managing these environments by automatically deploying container workloads and infrastructure based on the previously mentioned topologies. However, because of the size, complexity, and network dependency of these environments, a sub-section of the overall architecture inevitably runs in a degraded or failing state at some point in time. As such, teams should design for failure by making sure that they can understand the health of their Kubernetes applications and infrastructure at runtime, and build them to quickly adapt and recover from unexpected events that risk the availability of their workloads.

A best practice and common approach for both applications and infrastructure is to have redundancy and to avoid single points of failure. To eliminate single points of failure, customers are increasingly deploying highly available applications in Amazon Elastic Kubernetes Service (Amazon EKS) across multiple AZs. However, application and infrastructure redundancy is just one of multiple strategies that should be applied for the best results. Whether or not you apply certain resiliency or failure design mechanisms and how you apply them, depends on your application’s targeted Recovery Time Objective (RTO) and Recovery Point Objective (RPO). Some applications may be of a critical nature with minimal to zero downtime requirements, whereas others may not fit into this mold.

The scope of a failure or degradation can happen at different levels. In your Amazon EKS environment, an issue may occur to a single worker node, a subset of worker nodes, or even an entire AZ. In the case of AZ impairments, you can use Amazon Application Recovery Controller (ARC) zonal shift as part of your resiliency and recovery strategy. With ARC zonal shift, you can temporarily redirect in-cluster network traffic away from the impacted AZ. However, to optimally manage the zonal shift process, you must ensure you have sufficient monitoring in place to detect zonal impairments.

Alternatively, you can allow AWS to manage this for you using zonal autoshift. With zonal autoshift, AWS monitors the overall health of your AZs and responds to a potential impairment by automatically shifting traffic away from the impaired AZ in your cluster environment on your behalf. AWS also manages restoring in-cluster traffic to that AZ when it returns to a healthy state.

Today, a number of customers use a service mesh implementation such as Istio to manage the network infrastructure of their application environment. Istio can help improve the network observability of your system because its data-plane proxies expose key metrics related to network requests and microservice interactions that can signal various issues, such as a problem with a particular AZ. This post focuses on how you can monitor and automate quick application recovery in the event of an unhealthy or degraded AZ using metrics from Istio as signals to manage your zonal shifts. The solution in this post is ideal for teams who have already adopted Istio into their Amazon EKS environments. As an alternative, teams can consider using Amazon CloudWatch and its embedded metric format (EMF) that enables you to embed custom metrics with log data.

Solution overview

In this walkthrough, you deploy a sample application across multiple AZs in EKS. This application is part of an Istio service mesh running in sidecar mode, which means each Pod will contain both the container application as well as the Istio sidecar proxy (Envoy). The sidecar proxy mediates inbound and outbound communication to the application. To determine the health of every AZ, you continuously evaluate the network responses (such as 2xx and 5xx) for requests captured by these application sidecar proxies at an AZ-level. This will be done by monitoring the Envoy clusters of these sidecar proxies using Prometheus. The metric you use for this is envoy_cluster_zone_availability_zone__upstream_rq. For clarity, an Envoy cluster (not to be confused with an EKS cluster) refers to a group of similar upstream hosts that accept traffic for a particular application. Furthermore, Grafana is used to visualize these data for the respective AZs, and to send alerts to a Slack channel in the event that things go awry. Lastly, you trigger an ARC zonal shift in Amazon EKS to test application recovery. Although this example focuses on one signal for AZ health monitoring, when you’re observing multi-AZ environments in production, you should also account for different types of issues and failures such as latency, silent failures (not receiving traffic), gray failures, and regular failures to get a better perspective on the health and state of your overall environment, and the AZs. You can view a list of the cluster statistics exposed by Envoy.

The following diagrams depict the application environment that is demonstrated and discussed in this post.

Solution overview in EKS with Istio and monitoring stack (Prometheus and Grafana)

Event notification path to the Platform team

As mentioned previously, this solution uses the collective network responses from Pods in an AZ and at certain intervals to indicate whether or not an AZ is healthy. The following diagram depicts a cluster where the AZs are considered healthy.

All AZs considered healthy based on network responses

The following diagram shows an AZ (af-south-1c) that’s considered to be unhealthy based on the number of server-side errors encountered in the Pod network requests at an AZ-level, and over a certain period of time.

One AZ is considered unhealthy based on isolated network failures for the same workload

The following diagram shows how the impacted AZ can be temporarily isolated and prevented from receiving any east-to-west or north-to-south network traffic by starting an ARC zonal shift to quickly recover and adapt to the change in the Amazon EKS environment. When you start an ARC zonal shift in Amazon EKS, the EndpointSlice controller removes the Pod endpoints in unhealthy AZs from the EndpointSlices. But how does this translate to service discovery in Istio?

Istio’s sidecar proxies use a set of service discovery APIs called the xDS APIs. One of the xDS APIs is the Endpoint Discovery Service (EDS) API, which allows for the automated discovery of members (endpoints) of an upstream Envoy cluster. During a zonal shift, the list of discoverable endpoints (or members of an upstream Envoy cluster) is only the ones running in healthy AZs of the cluster. You can still use Destination Rules to apply Istio network policies during a zonal shift.

Shifting traffic away from the unhealthy AZ using zonal shift

Prerequisites

To conduct this example, work through the following prerequisites:

An AWS account.
Provision an EKS cluster (v 1.28 or above) across multiple AZs.
Enable ARC zonal shift for your EKS cluster.
Install Istio (sidecar mode).
Install Prometheus and Grafana using kube-prometheus for observability. Alternatively, you can setup Amazon Managed Prometheus and Amazon Managed Grafana using this guide.
Set up incoming web hooks in a Slack You can use a free Slack account for this step.

Walkthrough

The following steps walk you through this solution.

Configure Istio networking for sample application

First, you configure and deploy the relevant Istio resources to make sure that the sample application can receive external network requests. For this, you must first configure the Istio ingress gateway. The ingress gateway is the entry point into the service mesh and is responsible for protecting the workloads in the mesh, and controlling any traffic that flows into it. You can configure the Istio ingress gateway by creating a Gateway resource that specifies which port to open and the virtual hosts associated with it.

Gateway

apiVersion: networking.istio.io/v1
kind: Gateway
metadata:
  name: ecommerce-gateway
  namespace: ecommerce
spec:
  selector:
    istio: ingressgateway # use istio default controller
  servers:
  - port:
      number: 80
      name: http
      protocol: HTTP
    hosts:
    - "*"

After network traffic enters the service mesh through the ingress gateway, it needs to be routed to the correct destination. VirtualService resources are responsible for managing this routing process.

Virtual Service

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: payments-virtualservice
  namespace: ecommerce
spec:
  hosts:
  - "*"
  gateways:
  - ecommerce-gateway
  http:
  - match:
    - uri:
        prefix: /v1/payments
    route:
    - destination:
        host: payments-service
        port:
          number: 3005
    retries:
      attempts: 3
      perTryTimeout: 2s

In the next step, you deploy the sample application.

Run and spread multiple pod replicas across AZs

Running multiple instances of an application and spreading it across multiple AZs increases both its fault tolerance and availability. With topology spread constraints, you can set up your applications to have pre-existing, static stability so that, in the case of an AZ impairment, you have enough replicas in the healthy AZs to immediately handle any more spikes or surges in traffic that they may experience.

As a first step, you should create the ecommerce namespace where the application instances can reside. After that, you should label the namespace appropriately so that Istio is aware that it should inject sidecar proxies for any application running inside that namespace.

kubectl create ns ecommerce
kubectl label namespace ecommerce istio-injection=enabled

When you complete these steps, you can use the following code to deploy the sample payments application. For the best spread (or replica distribution) results, you should progressively scale the application after other replicas are already up and running. This allows the Scheduler to look at the replicas running on worker nodes in certain AZs and then attempt to respect the scheduling constraints that you define.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: payments-service-account
  namespace: ecommerce
---
apiVersion: v1
kind: Service
metadata:
  name: payments-service
  namespace: ecommerce
spec:
  selector:
    app: payments
  type: ClusterIP
  ports:
  - protocol: TCP
    port: 3005
    targetPort: 3005
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payments
  namespace: ecommerce
  labels:
    app.kubernetes.io/version: "0.0.1" 
spec:
  replicas: 6
  selector:
    matchLabels:
      app: payments
      workload: ecommerce
  template:
    metadata:
      labels:
        app: payments
        workload: ecommerce
        version: "0.0.1"
    spec:
      serviceAccountName: payments-service-account
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: "topology.kubernetes.io/zone"
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: payments
      containers:
      - name: payments-container
        image: "lukondefmwila/ecommerce-payments:0.0.1"
        readinessProbe:
          httpGet:
            path: /v1/payments
            port: 3005
          initialDelaySeconds: 5
          periodSeconds: 10
        ports:
        - containerPort: 3005
        resources:
          requests:
            cpu: "1"
            memory: "64Mi"

After applying the preceding resources, you can verify that the Pods are running in your cluster as expected. The following screenshot shows a view using K9s.

Deployed sample application running in the cluster

After deploying the application, you can review its distribution across the different AZs by listing the payments upstream clusters that the Istio ingress gateway knows about by entering the following command. The results display the Pod endpoints and the respective AZs in which they reside.

kubectl exec -it deploy/istio-ingressgateway -n istio-system -c istio-proxy \
-- curl localhost:15000/clusters | grep payments | grep zone

List of Pod endpoints per AZ

Next, you can test that the application is running as expected.

First, get the hostname for the Istio ingress gateway, append the path /v1/payments to the hostname, and then run a GET request in your terminal, the browser, or an API client tool.

ISTIO_IGW_HOST=$(kubectl get svc --namespace istio-system istio-ingressgateway -o json | jq -r ".status.loadBalancer.ingress | .[].hostname")
curl "http://$ISTIO_IGW_HOST/v1/payments"

You should see results similar to the following screenshot.

Sample application API response

Setting up Prometheus to monitor Istio sidecar proxies

As detailed in the Solution overview section, you gauge the health of an AZ by evaluating the responses to upstream requests handled by the application sidecar proxies in each zone. To do this, you must first configure Prometheus to scrape the metrics from the Pods that have the Istio sidecar proxy.

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: envoy-stats-monitor
  namespace: prometheus
  labels:
    monitoring: istio-proxies
    release: prom
spec:
  selector:
    matchExpressions:
    - {key: istio-prometheus-ignore, operator: DoesNotExist}
  namespaceSelector:
    any: true
  jobLabel: envoy-stats
  podMetricsEndpoints:
  - path: /stats/prometheus
    interval: 15s
    relabelings:
    - action: keep
      sourceLabels: [__meta_kubernetes_pod_container_name]
      regex: "istio-proxy"
    - action: keep
      sourceLabels: [
        __meta_kubernetes_pod_annotationpresent_prometheus_io_scrape]
    - sourceLabels: [
      __address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
      action: replace
      regex: ([^:]+)(?::\d+)?;(\d+)
      replacement: $1:$2
      targetLabel: __address__
    - action: labeldrop
      regex: "__meta_kubernetes_pod_label_(.+)"
    - sourceLabels: [__meta_kubernetes_namespace]
      action: replace
      targetLabel: namespace
    - sourceLabels: [__meta_kubernetes_pod_name]
      action: replace
      targetLabel: pod_name
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: istio-component-monitor
  namespace: prometheus
  labels:
    monitoring: istio-components
    release: prom
spec:
  jobLabel: istio
  targetLabels: [app]
  selector:
    matchExpressions:
    - {key: istio, operator: In, values: [pilot]}
  namespaceSelector:
    any: true
  endpoints:
  - port: http-monitoring
    interval: 15s

After applying the preceding resources, you can run the following command to access the Prometheus dashboard:

kubectl -n prometheus port-forward statefulset/prometheus-prom-kube-prometheus-stack-prometheus 9090

Next, you must generate some traffic for the sample application so that you can see the metrics in Prometheus. To do this, you can run the following command in your terminal. This enters 150 GET requests to the payments application. You can adjust the total number of queries as you see fit:

for i in {1..150}; do curl <insert-your-instio-ingressgateway-hostname>/v1/payments; sleep .5s; done

After this, you can re-open the Prometheus dashboard and search for the metric results for the Envoy clusters in a particular AZ as shown in the following screenshots.

af-south-1a – envoy_cluster_zone_af_south_1a__upstream_rq

af-south-1b – envoy_cluster_zone_af_south_1b__upstream_rq

af-south-1c – envoy_cluster_zone_af_south_1c__upstream_rq

Metric query executor in Prometheus dashboard

Metric query executor for AZ network responses in Prometheus dashboard

Creating Grafana dashboards

Next, you create a dashboard in Grafana to visualize the relevant data in a more organized form. You use Prometheus as the data source for the Grafana panels that get created.

To access Grafana in your browser, you can run the following command:

kubectl -n prometheus port-forward svc/prom-grafana 3000:80

If you’re launching Grafana for the first time since installing the observability stack with kube-prometheus, then you can use the username `admin` and get the password using the following command:

kubectl get secret prom-grafana -n prometheus -o jsonpath="{.data.admin-password}" | base64 --decode ; echo

After getting the credentials, you log in to Grafana. By default, kube-prometheus comes pre-configured with Prometheus and Alertmanager set up as data sources in your Grafana installation. However, if you need to set up Prometheus as a data source, then you can follow these steps:

Choose Connections in the left-side menu.
Under Connections, choose Add new connections.
Enter “Prometheus” in the search bar.
Choose
Choose Add new data source in the top right corner.

To configure the data source, you must provide a name for the connection, and then provide the URL of your Prometheus server, as shown in the following screenshot.

The Prometheus server URL for kube-prometheus is ”http://prom-kube-prometheus-stack-prometheus.prometheus:9090/”.

Data source configuration in Grafana

After your Prometheus data source successfully connects, you can create a dashboard in Grafana. Each panel you add to the dashboard should represent an AZ for your EKS cluster using the same metrics from the previous section (such as envoy_cluster_zone_af_south_1a__upstream_rq). Repeat this step for each AZ in your cluster, as shown in the following screenshot.

Selecting metrics for Visualizations in Grafana

After completing this process, and depending on your panel configurations, you should see something similar to the following screenshot.

Grafana visualizations for monitoring AZs

Configuring Grafana alerts for the EKS cluster AZs

In this section, you focus on configuring Grafana alerts. At this point, your EKS cluster environment is set up so you can monitor the health of each AZ based on the network responses from Pods in a given AZ. However, when defining alert rules, you must consider the following:

What qualifies as a health issue within an AZ?
Who should be notified in the event of an issue?
How are they notified?

For this example, consider a spike in server-side errors within a particular AZ to be a signal that something is wrong with it. You can also use other indicators, such as latency, request timeouts, or connection failures. As you did for the dashboard panels, you should create an alerting rule for each AZ.

First, in the Alert Rules section, configure your alerting rules to use data from the most recent 30 minutes (now-30m to now), as shown in the following screenshot. Next, you can set up a rule that allows you to track server-side errors (5xx) for the respective AZs.

envoy_cluster_zone_af_south_1a__upstream_rq{response_code_class=”5xx”}

Setting up alert rules in Grafana

After that, configure how often you want Grafana to evaluate the condition you just set up, and then assign labels to the alert for the notification policies.

Next, set up a contact for the Slack channel that receives the alert notifications in the Contact Point section, and test that it’s working as expected, as shown in the following screenshot.

Test alert contact point

Lastly, in the Notification Policies section, create policies so that Grafana can match alerts with the appropriate contact points based on the labels you added to each rule, as shown in the following screenshot.

Configure notification policies in Grafana

To test that your alert notification system is working as expected, you can modify the alert condition for one of the AZs (af-south-1c in this example) so that Grafana can instead notify you when there are successful requests (response_code_class=“2xx”) being sent. Furthermore, you can reduce the evaluation intervals so that you don’t have to wait an extended period of time to see the results.

To do this, update the metric and label filter for the alert rule to have the following condition for the AZ on which you want to test alerting:

envoy_cluster_zone_af_south_1c__upstream_rq{response_code_class=”2xx”}

Then, you can reduce the count threshold for testing purposes, and update the alert rule evaluation intervals. Choose Preview in Grafana to see which values result in firing alerts. The following screenshot shows an example of an alert rule preview.

Previewof firing alerts from Grafana

After you save the alert rule configurations, you can test the notification system by entering a high number of queries to the application to cross the alert rule threshold, as shown in the following screenshot.

Grafana alert notifications in Slack

Triggering a zonal shift in Amazon EKS

If you get an alert that an AZ in your EKS cluster is unhealthy or impaired, then you can respond by shifting the network traffic away from the impacted zone using ARC zonal shift. When you trigger a zonal shift in Amazon EKS, the following steps are automatically applied:

The nodes in the impacted AZ are cordoned. This prevents the Kubernetes Scheduler from scheduling new Pods onto the nodes in the unhealthy AZ.
If you’re using Managed Node Groups, AZ rebalancing is suspended, and your Auto Scaling Group (ASG) is updated to make sure that new EKS Data Plane nodes are only launched in the healthy AZs.
The nodes in the unhealthy AZ aren’t terminated and the Pods aren’t evicted from these nodes. This is to make sure that when a zonal shift expires or gets cancelled your traffic can be safely returned to the AZ that still has full capacity.
The EndpointSlice controller finds the Pod endpoints in the impaired AZ and removes them from the relevant EndpointSlices. This makes sure that only Pod endpoints in healthy AZs are targeted to receive network traffic. When a zonal shift is cancelled or expires, the EndpointSlice controller updates the EndpointSlices to include the endpoints in the restored AZ.

You can start a zonal shift in your cluster using the following command:

export AWS_REGION="af-south-1"
export CLUSTER_NAME="beta"
export ACCOUNT_ID="your-account-id"
export AVAILABILITY_ZONES=("afs1-az1" "afs1-az2" "afs1-az3")

aws arc-zonal-shift start-zonal-shift \
--resource-identifier arn:aws:eks:$AWS_REGION:$ACCOUNT_ID:cluster/$CLUSTER_NAME \
--endpoint https://arc-zonal-shift.$AWS_REGION.amazonaws.com/ \
--region $AWS_REGION \
--away-from ${AVAILABILITY_ZONES[2]} --comment "af-south-1c is unhealthy" --expires-in 1h

Remember to update the preceding script with the AZ details based on the Region where your resources are deployed, as shown in the following screenshot.

ARC zonal shift console showing a shift in progress

To verify that the affected endpoints are no longer available to receive traffic, you can run the following command to get the Envoy clusters associated with payments application:

kubectl exec -it deploy/istio-ingressgateway -n istio-system -c istio-proxy \
-- curl localhost:15000/clusters | grep payments | grep zone

List of Pod endpoints for the sample application has been updated based on the zonal shift in progress

As you can see in the preceding screenshot, the only available upstream clusters for the payments application are the ones in af-south-1a and af-south-1b.

Then, you can run queries to the payments application to make sure that it’s still available and functioning as expected based on network responses from Pods running in the other AZs (af-south-1a and af-south-1b).

Cleaning up

To avoid incurring any more costs, make sure to destroy the infrastructure that you provisioned in relation to the examples detailed in this post. Remember to use the file names you used for saving your Kubernetes manifest files in the walkthrough.

Delete application resources:

kubectl delete -f sample-application-manifest.yaml

Delete Prometheus custom resources for monitoring Envoy sidecars:

kubectl delete -f pod-monitoring-manifest.yaml,service-monitoring-manifest.yaml

Uninstall Prometheus:

helm uninstall <release-name> -n <namespace> -f values.yaml

Uninstall Istio:

istioctl uninstall --purge

Conclusion

In this post, we covered a hands-on approach to monitoring and automating recovery from AZ impairments in your Amazon EKS cluster using Istio, Prometheus, Grafana, and ARC zonal shift. To learn more about ARC zonal shift and zonal autoshift in Amazon EKS, you can read the documentation.

Containers