Networking & Content Delivery
Network observability for modern applications
In today’s highly distributed and cloud-based IT environments, network monitoring has become crucial for organizations to maintain the health, performance, and security of their applications and infrastructure. However, as modern application architectures evolve, with multiple layers of abstraction and cloud-native services, many teams look for better ways to collect and use the high-quality network data required to inform critical business insights and decision-making.
This post explores how you can design effective network monitoring in the cloud using AWS services to collect and monitor and then analyze and work with network-level data to deliver valuable metrics and key performance indicators (KPIs). To help illustrate the concepts in this post, we describe a example scenario.
Sample scenario
Figure 1 shows an enterprise-scale organization with multiple applications that require guaranteed access to a central database and an on-premises mainframe, each is managed by different teams across separate AWS accounts. A central networking team oversees the connectivity between these accounts.
The database and mainframe need to support a 99.9 percent service level agreement (SLA) to the application teams, with a strict 50 millisecond (ms) latency requirement. The sensitive data in these systems also requires comprehensive access tracking and logging for 1 year, as mandated by the InfoSec team.
With this scenario in place, let us explore the concepts that drive network observability and a model that is helpful in designing a solution. Later sections in this post illustrate how all the concepts and techniques can be applied.
The role of network observability
As cloud computing technologies continue to evolve, the role of observability has expanded beyond traditional network monitoring. Monitoring focuses on collecting and reporting data points about the health of a system, which is a passive process. Observability expands on this. According to the AWS observability guide, observability is the “capability to continuously generate and discover actionable insights based on signals from the system under observation.” The shift to actionable insights moves observability into an active part of any architecture. For these insights to be valuable, they need to provide information that meets the needs of an organization. It is a best practice for organizations to build any observability goal around a KPI, SLA, or key business metric.
In cloud-based applications, there are often many logical layers that separate the application or business need from the underlying network infrastructure. Network observability paints a full picture of the performance, security, and cost-efficiency of these complex, distributed systems.
Concretely, a robust observability solution has the following beneficial impact on these aspects:
- Troubleshooting and resiliency: Well-designed observability allows for fast issue resolution and self-healing of applications.
- Performance tuning: Network metrics are valuable for understanding performance bottlenecks and optimizing workloads.
- Security and governance: Comprehensive network controls are often required to meet compliance and security requirements. These controls must be monitored.
- Cost management: Observing network data transfer costs is important for optimizing cloud spend.
It’s common for well-architected network observability designs to meet more than one of these goals. Crucially, you need to maintain a clear focus on the business purpose and desired outcomes when implementing any observability solution. This ensures that a solution will correctly prioritize how these points get addressed. The metrics and signals you prioritize should be directly tied to supporting your customers and delivering value, not just monitoring the infrastructure.
AWS design principles for network observability
The following design principles are recommended to consider when planning a network observability solution in AWS.
End-to-end visibility – Achieve comprehensive visibility into your cloud network to enable effective monitoring and troubleshooting. AWS provides services that capture network-level telemetry across your entire cloud environment, such as VPC Flow Logs and Amazon CloudWatch, these services give you a holistic view of your network’s health and performance.
Correlated insights – Network data is most valuable when analyzed in the context of broader system performance and business metrics. AWS makes it easy to correlate network telemetry with other observability data sources using services such as Amazon Managed Grafana and Amazon OpenSearch Service. As a result, your teams can quickly identify root causes, optimize resource utilization, and make data-driven decisions.
Seamless scalability – As your cloud environment grows, your network observability solutions must scale seamlessly. AWS Lambda and Amazon Kinesis provide serverless, event-driven capabilities that automatically scale to meet your increasing data processing and analytics demands, allowing you to focus on deriving insights from your network data.
Unified observability – Effective network monitoring in the cloud requires a holistic view that combines network-level data with application, security, and business intelligence. With AWS services such as OpsCenter, a capability of AWS Systems Manager; AWS X-Ray; and Amazon Athena, you can unify observability across your entire environment, helping your teams make data-driven decisions that optimize network operations and business outcomes.
Event-driven insight – In the fast-paced world of cloud computing, network issues and optimization opportunities require rapid response. Amazon EventBridge allows you to create rules that automatically trigger actions based on network-related events, enabling quick detection, investigation, and resolution of problems, as well as proactive optimization of your cloud environment.
The collect, monitor, analyze and act model
Figure 2 shows how the components of collect, monitor, and analyze and act fit together with many AWS services. When designing a solution, it is helpful to think about your design in these phases.
Collect
Network observability is the first phase in collecting metrics and logs. Metrics and logs are the raw data sources you use to build an observability system. Without reliable and robust data, any observability system is unlikely to succeed.
Monitor
The data observed through monitoring allows you to understand the current state of your network and application, identify performance bottlenecks or security vulnerabilities, and detect issues early before they impact your end users. CloudWatch dashboards at the figure’s center provide this monitoring and data aggregation capability. These dashboards are built from the data collected in the collect phase.
Effective monitoring is essential for network observability. The alarms and triggers you create during the monitoring phase feed into the next stage of the model. With these tools in place, you can identify and resolve problems in your network.
Analyze and act
Analysis and diagnosis are where customers spend the most time during an operational event or root cause analysis, which is the largest contributor to extended downtime. Understanding the right things to focus on is critical but remains difficult for many customers.
As shown in the diagram, AWS provides multiple tools for the analysis phase that help you focus on the right information for diagnosis and reduce mean time to repair (MTTR). For example, features such as Network Access Analyzer and Reachability Analyzer can assist in determining the impact of changes on your workload before deploying to production.
When an issue is detected, focusing on the right metrics and logs as quickly as possible enables quicker response to failures. AWS services like CloudWatch can be used for detecting functionality problems.
Once you identify the cause of a failure, you need to act, which may involve a short-term fix or patch, a rollback, or an architectural change. It’s best practice to automate your deployments and changes as much as possible to test them upfront and reduce configuration errors.
Performing post-event analysis for shared learning, identifying design gaps, and determining how to prevent the failure from recurring is also a best practice. Your goal should be to ensure the same issue does not re-occur, and if it does, to identify and remediate it automatically.
By incorporating the key elements illustrated in the collect, monitor, and analyze and act model, you can establish a comprehensive approach to collecting and monitoring your network data. Then, you can analyze or act on any events, using AWS services to optimize visibility and reduce mean time to resolution.
Bringing it all together
As described in the sample scenario at the beginning of this post, the key observability goals are:
- Provide end-to-end visibility to deliver near real-time data on network availability and latency to all teams.
- Use event-driven insights to quickly report when network performance drops below acceptable levels and trigger automated remediation.
- Collect network-level data that can be correlated with application logs to satisfy the InfoSec security requirements.
Collect
Figure 3 shows where collection takes place in the sample scenario. Implement Amazon Virtual Private Cloud (Amazon VPC) Flow Logs at key points to capture network traffic data, along with CloudWatch Logs and metrics on critical network components.
Monitor
In Figure 4, we use both reactive data sources (logs, metrics) and active monitoring tools (CloudWatch Synthetics, CloudWatch Internet Monitor, CloudWatch Network Monitor) to create dashboards, alerts, and events that track network performance against the defined SLAs and thresholds.
- VPC Flow Logs
- Help you understand and track traffic to and from your VPC, a subnet, or a network interface. This data is also stored in Amazon CloudWatch for analysis at a later time. Review the Flow Log limitations to determine if they’ll work for your use case. If so, create an AWS Identity and Access Management (IAM) role for your flow log, and then create a flow log.
- You can use this Systems Manager Automation documentation to create flow logs to CloudWatch Logs or Amazon Simple Storage Service (Amazon S3). Be sure to have the required input parameters. For more information, see Publish flow logs to CloudWatch Logs and Publish flow logs to Amazon S3.
- Another option is to deliver VPC flow logs to Amazon Data Firehose. For more information, see Publish flow logs to Amazon Data Firehose.
- CloudWatch Logs and metrics
- In the example, data is collected from CloudWatch in four places. The details on this are beyond the scope of this post. For more information on metrics available for each service, visit the following links.
- Amazon CloudFront
- Amazon Route 53
- Elastic Load Balancing
- AWS Transit Gateway
Analyze and act
Analyze and act on the collected data and monitoring capabilities to meet organizational goals. Here are some examples:
- For information on providing real-time visibility to all teams on network availability and latency, see Monitor hybrid connectivity with Amazon CloudWatch Network Monitor.
- To trigger automated notifications and remediations when performance issues are detected, see Using Amazon CloudWatch with Amazon ErventBridge for cross-account event monitoring.
- To archive network data for 1 year to support the InfoSec compliance requirements, see Publish flow logs to Amazon S3.
By aligning the network observability solution to your specific use case, your organization can implement comprehensive visibility, rapid issue detection and resolution, and compliance with security mandates—all while optimizing the performance and reliability of your cloud-based applications.
Aligning to your use cases
When establishing your network observability strategy, it’s important to consider the specific use cases that will shape your implementation. The architecture of your network will impact the tools and methods you use to achieve your observability goals.
Observability for connectivity within your AWS environment
This use case focuses on observing connectivity within your AWS environment. Because you have a greater degree of administrative control over the entire network, you can place monitoring closer to the source and destination of traffic, enabling complete visibility. Key areas to monitor include:
- Host elastic network interfaces, VPC endpoints, and AWS PrivateLink.
- VPC-to-VPC connectivity services such as Transit Gateway and VPC peering.
- Transit between AWS Regions and global network observability with AWS Cloud WAN.
Observability for hybrid and internet connectivity
For hybrid cloud environments, you need to observe connectivity between your AWS resources and on-premises data centers and branch offices. Observing your internet-facing workloads is crucial for understanding end user experience. These use cases often include network paths you cannot directly monitor or manage. Later sections in this post show how to resolve this. Key areas to monitor include:
- AWS Direct Connect and AWS Site-to-Site VPN for hybrid connectivity
- SD-WAN solutions integrated with AWS Transit Gateway Connect
- User-to-application connectivity with services such as Amazon CloudWatch Internet Monitor
Observability for application networking
Newer application patterns like serverless or containerized architectures have created a new use case for your network observability strategy. Because much of this network traffic is being generated by short-lived devices, you must integrate your network monitoring with the control plane that manages these workloads. Key areas to monitor include:
- Endpoints for serverless offerings such as AWS API Gateway, AWS Lambda, or AWS Fargate
- Insights that can be gathered from Amazon Elastic Kubernetes Service (Amazon EKS) or Amazon Elastic Compute Service (Amazon ECS) control planes
- Services used to pass events between applications such as Amazon Simple Queue Service (Amazon SQS) or EventBridge
Observability for network security
Observing your network security posture is essential for protecting your workloads. The need to monitor network security is often an additional use case with its own requirements. It may be helpful to meet this need on a parallel path. Key areas to monitor include:
- Workload segmentation, ingress, egress, and east-west traffic using AWS Network Firewall, AWS WAF, and Gateway Load Balancer
- Security control points, including both AWS services and partner firewall appliances
Conclusion
Effective network observability is critical for cloud-based applications. Using AWS observability services, organizations can gain comprehensive visibility, correlate network-level data, and use scalable, unified observability. By aligning the observability strategy to use cases like connectivity, networking, and security, organizations can achieve faster troubleshooting, optimize performance, enhance security, and make data-driven decisions to support their evolving cloud architectures.
When you are ready to implement network observability in your environment and want to explore how your use case can be addressed reach out to your AWS account team.
About the authors