AWS Cloud Operations & Migrations Blog

Category: Amazon EC2 Container Service

Gain operational insights for NVIDIA GPU workloads using Amazon CloudWatch Container Insights

As machine learning models grow more advanced, they require extensive computing power to train efficiently. Many organizations are turning to GPU-accelerated Kubernetes clusters for both model training and online inference. However, properly monitoring GPU usage is critical for machine learning engineers and cluster administrators to understand model performance and to optimize infrastructure utilization. Without visibility […]

Distributed Tracing using AWS Distro for OpenTelemetry

More and more applications are being developed using serverless architectures with multiple microservices. Customers use managed AWS services including AWS Lambda, Amazon ECS and Amazon EKS running on Amazon Elastic Compute Cloud (EC2) and AWS Fargate for running their code along with services like Amazon API Gateway, Amazon SNS, Amazon SQS, Amazon DynamoDB, Amazon S3, and others. Developers use multiple […]