AWS Cloud Operations Blog
Category: Monitoring and observability
Monitoring Windows services with Amazon CloudWatch
If you run Windows workloads on Amazon Elastic Compute Cloud (Amazon EC2), monitoring the health and performance of your Windows Services is essential for reliable systems administration. It’s not just about ensuring uptime; it’s about having a pulse on your system’s health and performance. With a variety of services operating in the background, each playing […]
How Unitary achieved automatic metric collection with Amazon Managed Service for Prometheus collector
This post was co-authored with Nicolas Fournier, Platform Engineer at Unitary. Every day, over 80 years’ worth of video content is uploaded online. Some of this content can also be harmful. Unitary knows that human moderators are the current gold standard for moderation, but this manual approach does not scale. While automated systems can scale, […]
Multi-tenant monitoring across accounts and regions using Amazon Managed Service for Prometheus
In this guest blog post, Nauman Noor (Managing Director), Fabio Dias (Cloud Developer), and Dylan Alibay (Cloud Developer) from the platform engineering team at State Street discuss their use of Amazon Managed Prometheus and AWS Distro for OpenTelemetry to enable monitoring in a multi-tenant, multi-account, and multi-region environment. In the ever-evolving financial services landscape, State […]
Introducing Amazon CloudWatch Alarm Recommendations
Amazon CloudWatch is a foundational AWS service that provides you with actionable insights into your cloud resources and applications. With Amazon CloudWatch Metrics, you can gain better visibility into your infrastructure and large-scale application performance. You can set up alarms using Amazon CloudWatch Alarms for metrics emitted by AWS services or your applications. Identifying which metrics […]
How to monitor application health using SLOs with Amazon CloudWatch Application Signals
Today, customers operate tens, hundreds, or even thousands of applications arranged in complex distributed systems composed of many interdependent services. These applications need to be continuously available and performant to maintain end-user satisfaction and business growth. Amazon CloudWatch Application Signals (now in Preview) makes it easy to automatically instrument and operate applications on AWS to […]
Four APM features to elevate your observability experience
Application performance monitoring (or APM) is the practice of taking key application performance indicators to ensure system availability, improve system performance, and improve the end-user experience. This week we announced Amazon CloudWatch Application Signals, a new set of features built-in to Amazon CloudWatch to help you speed up troubleshooting, reduce application disruptions, and operational costs, […]
Monitoring GPU workloads on Amazon EKS using AWS managed open-source services
As machine learning (ML) workloads continue to grow in popularity, many customers are looking to run them on Kubernetes with graphics processing unit (GPU) support. Amazon Elastic Compute Cloud (Amazon EC2) instances powered by NVIDIA GPUs deliver the scalable performance needed for fast ML training and cost-effective ML inference. Monitoring GPU utilization gives valuable information for researchers working […]
Amazon Connect real-time monitoring using Amazon Managed Grafana and Amazon Timestream
Amazon Connect is an easy-to-use cloud contact center solution that helps companies of any size deliver superior customer service at a lower cost. Connect has many real-time monitoring capabilities. For requirements that go beyond those supported out of the box, Amazon Connect also provides you with data and APIs you can use to implement your […]
Lowering MTTR with Amazon CloudWatch and AWS X-Ray
Customers running microservice-based workloads in a serverless environment frequently have issues with troubleshooting incidents as the data they need can be distributed across hundreds or thousands of components. In this blog post, I will demonstrate how you can reduce the mean time to resolution (MTTR, or the average time it takes to repair or mitigate […]
Using Tag-Based Filtering to Manage AWS Health Monitoring and Alerting at Scale
AWS provides customers regular updates of service notifications and planned activities via e-mail to the root account owners or the operational, security and billing contacts. AWS also provides granular notifications to customers via AWS Health allowing them to fine-tune their alerts on issues relating directly to them. Alongside Health Dashboard’s monitoring capabilities, customers can also […]