AWS Cloud Operations Blog
Unlocking Insights: Turning Application Logs into Actionable Metrics
Modern software development teams understand the importance of observability as a critical aspect of building reliable and resilient applications. By implementing observability practices, software teams can proactively identify issues, uncover performance bottlenecks, and enhance system reliability. However, it is a fairly recent trend and still lacks industry-wide adoption.
As organizations standardize on containers, they often lift and shift legacy applications into containers. Such legacy applications provide insights mostly through logs. Unlike modern applications, they don’t generate metrics, as a result, teams find it challenging to observe and extract meaningful insights from application logs.
In this post, we demonstrate how to improve the observability of applications that don’t generate metrics. We further demonstrate on how you can use the same capability with control plane logs, such as audit logs, and receive timely notifications when errors arise. We use Amazon CloudWatch to create metrics from log events using filters. This post uses Amazon EKS to run the application. Customers can use this approach for applications running in Amazon ECS, Amazon EC2, or AWS Lambda, as well as applications running in their data centers using IAM Roles Anywhere. Finally, we will dive in to enhancing EKS security with automated anomaly detection and alerting using Amazon CloudWatch Anomaly Detection.
Solution Overview
For this demonstration, we use Amazon EKS Blueprints to create an Amazon EKS cluster with AWS for Fluent Bit agent to aggregate application logs produced in common log format and ingest into Amazon CloudWatch. The application is designed to inject failures at random. Using CloudWatch metric filters, we match required terms in application’s logs to convert log data into metrics. Next, will create alarms in CloudWatch to detect increased error rate. Whenever the sample application’s error rate breaches the set threshold, a notification gets sent to a Slack channel.
Figure 1: Solution Architecture for Turning Application Logs into Actionable Metrics
Here is a functional flow of this solution:
- AWS for Fluent Bit agent collects and processes application logs
- AWS for Fluent Bit agent deploys a Fluent bit agent to forwards logs to Amazon CloudWatch Logs to be stored in log groups
- When activated, Amazon EKS control plane logs are available as vended logs in to Amazon CloudWatch Logs
- Amazon CloudWatch Patterns surface emerging trends and identify frequently occurring or high-cost log lines
- Amazon CloudWatch custom metric filters for extracting metric data points based on metric filter expressions and creates metric time series
- CloudWatch Alarm triggers notification to Amazon SNS when a threshold is breached
- Amazon SNS invokes an Amazon Lambda function, which in-turn sends CloudWatch alarm notifications to Slack.
Prerequisites
Install the following utilities on a Linux based host machine, which can be an Amazon EC2 instance, AWS Cloud9 instance or a local machine with access to your AWS account:
- AWS CLI version 2 or later to interact with AWS services using CLI commands
- js (v16.0.0 or later) and npm (8.10.0 or later)
- AWS CDK v2.114.1 or later to build and deploy cloud infrastructure and Kubernetes resources programmatically
- AWS SAM CLI to deploy AWS Lambda function
- Kubectl to communicate with the Kubernetes API server
- Git to clone required source repository from GitHub
Let’s start by setting environment variables:
export CAP_ACCOUNT_ID=$(aws sts get-caller-identity --query 'Account' --output text)
export CAP_CLUSTER_REGION="us-west-2"
export AWS_REGION=$CAP_CLUSTER_REGION
export CAP_CLUSTER_NAME="demo-cluster"
export CAP_FUNCTION_NAME="cloudwatch-to-slack"
Clone the sample repository which contains the code for our solution:
git clone https://github.com/aws-samples/containers-blog-maelstrom.git
cd ./containers-blog-maelstrom/aws-cdk-eks-app-alarms-to-slack
Bootstrap the Environment
As the solution uses Amazon EKS CDK Blueprints to provision an Amazon EKS cluster, you must bootstrap your environment in the required AWS Region of your AWS account.
Bootstrap your environment and install all Node.js dependencies:
bash ./bootstrap-env.sh
Create EKS cluster
Once you’ve bootstrapped the environment, create the cluster:
cdk deploy "*" --require-approval never
Deployment will take approximately 20-30 minutes to complete. Upon completion, you will have a fully functioning EKS cluster deployed in your account.
Figure 2: Snapshot of output from cdk deployment
Please copy and run the aws eks update-kubeconfig
command as shown in the screenshot to gain access to your Amazon EKS cluster using kubectl
.
Create an Incoming Webhook in Slack
Slack allows you to send messages from other applications using incoming webhook. Please refer to sending messages using incoming webhooks for more details. We will use this incoming webhook to send notifications to a Slack Channel whenever an alarm is triggered.
Follow these steps to configure incoming webhook in Slack:
- Create or pick a Slack channel to send CloudWatch alarms notifications.
- Go to
https://<your-team-domain>.slack.com/services/new
and search for Incoming WebHooks, select and click Add to Slack. - Under Post to Channel, Choose the Slack channel where you send messages and click Add Incoming WebHooks Integration.
- Copy webhook URL from the setup instructions and save it. You’ll use this URL in the Lambda function.
Create a KMS Key
To increase security posture of incoming webhook URL, we will encrypt incoming webhook URL using AWS KMS keys. We will create KMS Key with key alias alias/${CAP_FUNCTION_NAME}-key
as part of script deploy-sam-app.sh
.
Create an Amazon Lambda function
As next step, create a Lambda function to send CloudWatch alarm notifications to Slack. The script uses AWS Serverless Application Model (SAM) to create:
- An Amazon SNS topic, and CloudWatch alarm will publish notifications to that topic
- A Lambda execution role to grant function permission with basic access and to decrypt using KMS Key
- A Lambda function to send notifications to Slack using incoming webhook URL
- Lambda permissions for SNS to trigger the Lambda function
The script deploy-sam-app.sh
intakes the following two input values to deploy SAM template.
- Slack incoming webhook URL which you created previously.
- Slack channel name (you selected previously) to which notifications need to be sent
The deploy-sam-app.sh script
will encrypt (client-side) the Slack incoming webhook URL using a KMS Key with a specific encryption context. Lambda function will decrypt using same encryption context. Lambda execution role is provided with fine-grained access to use KMS Key only for the specific encryption context.
Run the following command to deploy SAM template.
bash ./deploy-sam-app.sh
Test Lambda function
Let’s validate the Lambda function by pushing a test event using the payload available at templates/test-event.json
:
aws lambda invoke --region ${CAP_CLUSTER_REGION} \
--function-name ${CAP_FUNCTION_NAME} \
--log-type Tail \
--query LogResult --output text \
--payload $(cat templates/test-event.json | base64 | tr -d '\n') - \
| base64 -d
A successful execution will post a test message to the Slack channel and have command output as shown in the image:
Figure 3: Incoming Webhook on slack
[INFO] 2023-04-12T01:04:49.880Z 81699331-10e9-416f-b8ae-4fb7f44f1d29 Message posted to httphandler-cloudwatch-alarms
END RequestId: 81699331-10e9-416f-b8ae-4fb7f44f1d29
REPORT RequestId: 81699331-10e9-416f-b8ae-4fb7f44f1d29 Duration: 587.33 ms Billed Duration: 588 ms Memory Size: 128 MB Max Memory Used: 68 MB Init Duration: 448.31 ms
AWS for Fluent Bit Agent
AWS for Fluent Bit agent aggregates application logs and forwards them to CloudWatch. To forward logs to CloudWatch logs, you need to provide IAM role that grants required permissions. Amazon EKS CDK Blueprints provides options to configure Fluent-Bit add-on as well as create IAM policies with required permissions.
Verify that Fluent-Bit is running in your cluster:
kubectl get po -n kube-system \
-l app.kubernetes.io/name=aws-for-fluent-bit
NAME READY STATUS RESTARTS AGE
blueprints-addon-aws-for-fluent-bit-9db6l 1/1 Running 0 30m
Deploy Sample Application
Next, let us deploy a sample application HTTPHandler, which is an HTTP server that injects error randomly. We’ll also deploy a curl container to generate traffic.
Deploy the sample application httphandler
:
kubectl apply -f ./templates/sample-app.yaml
Let’s send a few request to the sample application to see the logs it produces.
kubectl exec -n sample-app -it \
curl -- sh -c 'for i in $(seq 1 15); do curl http://httphandler.sample-app.svc.cluster.local; sleep 1; echo $i; done'
Then, check the logs and observe response code and count. With every HTTP request, this sample application injects errors randomly and logs respective response code with counts.
kubectl -n sample-app logs -l app=httphandler
2023/04/14 21:40:35 Listening on :8080...
192.168.99.44 - - [14/Apr/2023:21:40:44 +0000] "GET / HTTP/1.1" 500 22
192.168.99.44 - - [14/Apr/2023:21:40:45 +0000] "GET / HTTP/1.1" 200 13
192.168.99.44 - - [14/Apr/2023:21:40:46 +0000] "GET / HTTP/1.1" 200 13
192.168.99.44 - - [14/Apr/2023:21:40:47 +0000] "GET / HTTP/1.1" 200 13
192.168.99.44 - - [14/Apr/2023:21:40:48 +0000] "GET / HTTP/1.1" 500 22
192.168.99.44 - - [14/Apr/2023:21:40:49 +0000] "GET / HTTP/1.1" 200 13
192.168.99.44 - - [14/Apr/2023:21:40:50 +0000] "GET / HTTP/1.1" 200 13
Create CloudWatch Metrics
AWS for Fluent Bit agent is configured to forward application logs with log key “log” to CloudWatch log group with name like /aws/eks/fluentbit-cloudwatch/<CAP_CLUSTER_NAME>/workload/<NAMESPACE>
. Hence, sample application (httphandler) logs are forwarded to log group /aws/eks/fluentbit-cloudwatch/demo-cluster/workload/sample-app
. Each pod will have one log stream.
You can convert log data into numerical CloudWatch metrics using metric filters. Metric filters allow you to configure rules to extract metric data from log messages. Logs from httphandler
are parsed using filter pattern [host, logName, user, timestamp, request, statusCode>200, size]
and used in the put-metric-filter command to create metric filter with dimension on response status code.
Create a metrics filter:
aws logs put-metric-filter --region ${CAP_CLUSTER_REGION} \
--log-group-name /aws/eks/fluentbit-cloudwatch/${CAP_CLUSTER_NAME}/workload/sample-app \
--cli-input-json file://templates/sample-app-metric-filter.json
Send CloudWatch alarms to Slack
At this time, application logs are being forwarded to CloudWatch and CloudWatch is generating metrics from logs. Now we’d like to send a Slack notification to our SRE channel so that when errors exceed a set threshold, the team gets notified immediately.
Create CloudWatch Alarms
Next, we’ll create a CloudWatch alarm on a metric to create and send notifications to SNS topic. Below mentioned command creates CloudWatch alarm that monitors for statuscode>200 and when the number of errors is above 10 in last 5 minutes, then a notification is sent to SNS topic.
SNS_TOPIC=$(aws cloudformation --region ${CAP_CLUSTER_REGION} describe-stacks --stack-name ${CAP_FUNCTION_NAME}-app --query 'Stacks[0].Outputs[?OutputKey==`CloudwatchToSlackTopicArn`].OutputValue' --output text)
aws cloudwatch put-metric-alarm --region ${CAP_CLUSTER_REGION} \
--alarm-actions ${SNS_TOPIC} \
--cli-input-json file://templates/sample-app-400-alarm.json
aws cloudwatch describe-alarms --region ${CAP_CLUSTER_REGION} \
--alarm-names "400 errors from sample app"
Generate traffic to sample application httphandler
using the following command which in-turn generates metrics. Run this script in a separate terminal :
kubectl exec -n sample-app -it curl -- sh -c 'for i in $(seq 1 5000); do curl http://httphandler.sample-app.svc.cluster.local; sleep 1; echo $i; done'
Let this run for 10 minutes and check CloudWatch alarm status. If there is a breach in threshold, you will get notification in your Slack channel.
Open Amazon CloudWatch in the AWS Management Console and navigate to Metrics → All metrics → SampleAppMetrics → Metrics with no dimensions → response_count to visualize metrics CloudWatch creates from application logs.
Figure 4: CloudWatch console showing sample app metrics
You can select the Alarms from Metrics page or you can view the alarm we’ve created for 400 errors by going to CloudWatch → Alarms → All alarms → 400 errors from sample app
Figure 5: CloudWatch alarms page
Whenever the threshold is breached, you’ll get a notification in Slack:
Figure 6: Notifications on slack
With this setup, we are now notified whenever our application experiences issues.
Enhancing Amazon EKS Security with automated anomaly detection and alerting
As cyber threats are surging, customers are looking for quick ways to identify anonymous requests targeting their infrastructure and applications. Thanks to Amazon CloudWatch Anomaly Detection which can help detect anomalies for a metric using statistical and machine learning algorithms. These algorithms continuously analyze metrics of systems and applications, determine normal baselines, and surface anomalies with minimal user intervention. CloudWatch anomaly detection is available with any AWS service metric or custom CloudWatch metric that has a discernible trend or pattern.
EKS control plane logging is recommended to be selected to collect and analyze audit logs, which is essential for root cause analysis and attribution, including ascribing a change to a particular user. When required logs have been collected, they can be used to detect anomalous behaviors. On EKS, the audit logs are sent to Amazon CloudWatch Logs, once turned on under the log group /aws/eks/<CAP_CLUSTER_NAME>/cluster
. To activate CloudWatch Anomaly detection for the control plane logs, select the control plane log group under the Anomaly Detection tab select the Create anomaly detector. Review the options like frequency and filter patterns and Activate anomaly detection. Please note this will take up to 24 hours to train and detect anomalies.
Figure 7: CloudWatch Anomaly Detection Configuration
We can create a filter pattern for unauthorized access errors and create custom metric-filter to detect any anonymous requests. Using CloudWatch Alarms alerts can also be triggered to the same slack channel we created earlier.
aws logs put-metric-filter --region ${CAP_CLUSTER_REGION} \
--log-group-name /aws/eks/${CAP_CLUSTER_NAME}/cluster \
--cli-input-json file://templates/cluster-403-metric-filter.json
aws cloudwatch put-metric-alarm --region ${CAP_CLUSTER_REGION} \
--alarm-actions ${SNS_TOPIC} \
--cli-input-json file://templates/cluster-403-alarm.json
aws cloudwatch describe-alarms --region ${CAP_CLUSTER_REGION} \
--alarm-names "403 errors from Cluster API Server"
We can generate some anonymous requests to the cluster using the EKS cluster endpoint. Get cluster endpoint from kubectl config view
or you can get API server endpoint from the AWS console. From the terminal, execute the below command to generate unauthorized errors for a unavailable endpoint.
export CAP_CLUSTER_API_ENDPOINT=$(kubectl config view --minify | grep server | cut -f 2- -d":" | tr -d " ")
for i in `seq 1 10`; do curl -k $CAP_CLUSTER_API_ENDPOINT/anomaly; done
AWS CloudWatch Log Insights can be used to easily extract information from logs, identify patterns, and gain deeper insights into your applications and infrastructure. We can identify common patterns on the EKS control plane logs using Log Insights and the same filter pattern can be used to detect anomalies or any suspicious events to which alarms can be created. For the unauthorized anonymous requests, we can find the pattern in LogInsights.
Figure 8: CloudWatch Log Insights
When the threshold is breached for the anonymous requests, you’ll get a notification in Slack for which we created the alarm based on the filter pattern.
Figure 9: Anonymous request notifications on Slack
AWS Systems Manager Incident Manager starts engaging the right responders promptly, tracking incident updates, and automating remediation actions. It reduces Mean Time To Recover (MTTR) through Response plans which defines who responds, automated mitigation actions, and collaboration tools for responder communication and notifications. Leverage our blog posts on Creating contacts, escalation plans, and response plans in AWS Systems Manager Incident Manager and AWS Systems Manager Incident Manager integration with Amazon CloudWatch to take full advantage of CloudWatch alarms generated here to trigger incident-specific response plans for streamlined, automated incident response.
Figure 10: CloudWatch alarm configurations for 400 errors
Figure 11: AWS Systems Manager action for CloudWatch alarms
Cleanup
Run cleanup.sh script to clean up all resources deployed as part of this post.
bash ./cleanup.sh
Conclusion
This post demonstrates a solution to capture errors in application logs to metrics that you can track to improve the reliability of your systems. Using CloudWatch, you can create metrics from log events and monitor applications without changing its code. You can use this technique to improve the reliability and observability of applications that don’t generate metrics. We also showed how you can streamline monitoring by sending alarm notifications in Slack. Furthermore, we have demonstrated monitoring of Control Plane logs, guiding us to capture and report errors promptly.
For more information, see the following references:
- AWS Observability Best Practices Guide
- One Observability Workshop
- Terraform AWS Observability Accelerator
- CDK AWS Observability Accelerator