Reduce code duplication in load testing and synthetic monitoring using Amazon CloudWatch Synthetics

Load testing is an integral step in the quality assurance phase of a software development lifecycle, that offers you confidence about the performance of your workload before it is deployed to production. Once that workload moves to production, you monitor its health using synthetic monitoring. Load testing and synthetic monitoring typically test the same application flow, using different load characteristics. Load testing would typically be run using services and solutions specifically designed for this purpose. This requires you to write the tests using programming languages, domain specific languages and frameworks that are specific to the tool being used. Similarly, synthetic monitoring requires you to write the same tests, but this time using constructs specific to the synthetic monitoring tool. The result is a duplication of effort, as well as ongoing inefficiency as the two test suites need to be maintained as the application evolves over time.

CloudWatch Synthetics allows you to create and run synthetic monitoring on the availability and responsiveness of your applications and APIs. You can create a CloudWatch Synthetics ‘canary’; a configurable script that can be written in Node.js or Python to monitor your production workload. AWS Step Functions is a serverless orchestration service that supports performing large-scale parallel tasks using the distributed map state.

This post walks through a solution to create a canary for synthetic monitoring of your production workload. It then shows you a way to invoke this canary at scale using distributed map state at scale to simulate the load on your workload.

Prerequisites

Amazon CloudWatch Synthetics
AWS Step Functions
AWS CloudFormation
An HTTP endpoint to load test – this could be an existing application that is reachable by CloudWatch Synthetics

Solution Overview

You can create a CloudWatch Synthetics canary by starting with one of the blueprints that is available in AWS console. When you create the canary, in the background CloudWatch Synthetics creates an AWS Lambda function that contains the necessary boilerplate code to invoke your canary script. It also performs additional tasks such as recording metrics in CloudWatch, capturing screenshots of the browser and HTTP Archive (HAR) files in an Amazon Simple Storage Service (S3) bucket. CloudWatch Synthetics periodically invokes this Lambda function to run the tests on your workload and records the findings.

You can also invoke this Lambda function outside the context of CloudWatch Synthetics, just like any other Lambda function. You can point the canary to the load test instance of your workload and invoke the Lambda function in parallel. By setting the number of parallel invocations of the Lambda function, you can generate enough traffic to your workload that you need to carry out the load test.

This solution uses an AWS Step Functions state machine to orchestrate this parallel invocation of the Lambda function. You can invoke the state machine with the number of concurrent requests, the ramp up time of the test and the scheduled time for the load test to run as input. The state machine invokes the Lambda function in parallel, gradually increasing the concurrency over the ramp up time, until the target concurrency is reached. It continues invoking the Lambda function until the required load testing period is reached. You can find more details of the implementation of the Step Function in the pattern described in Serverless Load Generator.

A load generator state machine triggers the canary lambda function in parallel. The invoked Lambda functions drive load to your workload. The lambda function also stores the metrics in CloudWatch Metrics and results in an S3 bucket

Figure 1: Solution Architecture of load testing using the canary Lambda function

To implement this solution, complete the following steps:

Create a canary that would invoke an HTTP endpoint
Create a Step Functions state machine that would invoke the Lambda function in parallel to simulate the load test
Modify the canary to gather only the metrics and data relevant to load testing

Implementing the solution

Step 1: Create the canary

Follow these steps to create the canary that will check if an HTTP endpoint loads correctly:

Navigate to CloudWatch Synthetics page
Use the “Heartbeat Monitoring” blueprint
Enter health_check_canary as the “Name”
Enter the URL of the load test environment of your application as the “Application or endpoint URL”
Under the “Script editor” section, select syn-nodejs-puppeteer-8.0 as the runtime version
In the “Environment variables” section add 2 variables:
1. Key: ENV Value: lt
2. Key: URL Value: <URL of the load test environment>
To reuse the canary code across different environments, make these changes to the code in the “Script Editor”:
1. The URL of the production and load test environments are different. Replace the hardcoded URL in the canary script:
```
const urls = [ 'https://…' ]
```
  to use the URL environment variable:
```
const urls = [ process.env.URL ];
```
2. Typically, during load testing, you don’t need the screenshot of the workload for every request as the goal is to generate load and analyze results as aggregate. Also taking screenshot adds to the canary’s execution time and can increase costs. Disable this by replacing the line
```
const takeScreenshot = true;
```
  with
```
let takeScreenshot = true;
if (process.env.ENV === 'lt') {
	takeScreenshot = false
}
```
3. Similarly, you also do not need the metrics of individual requests and the HAR files of a load test. Disable them conditionally, by adding this to the script before the let page = await synthetics.getPage(); line to only gather these details when not running load test.
```
if (process.env.ENV === 'lt') {
	syntheticsConfiguration.withHarFile(false);
	syntheticsConfiguration.withStepsReport(false);
};
```
  After making the code changes the script should look as shown in Figure 2.

Screenshot of the code modifications done to the canary code to allow reusing the code across different environments

Figure 2: Suggested modifications to the canary code

Select “Create Canary”

After the canary is created, it will check the health of the application periodically. This is a simple canary that checks if the endpoint loads and returns an HTTP 200 response. The canary uses puppeteer Node.js library to run this test. You can also use this framework to build canaries that perform complex website interactions. Alternatively, you can also use Python with the Selenium Webdriver framework to build your canaries.

Step 2: Create the load testing stack

Follow these steps to create a Step Functions based load testing stack:

Navigate to the CloudFormation create page, select “Upload a template file” option in the “Specify template” section and use this template. Choose “Next”
Enter “LoadTestCloudWatchSynthetics” as the name of the stack
Enter the Lambda function name created by CloudWatch Synthetics in the “CanaryLambdaFunctionName” parameter. You can find the name in the Lambda functions page and searching for the function with prefix cwsyn-health_check_canary
Enter the name of the S3 bucket and the S3 prefix the canary uses to store the results. To find this, navigate to the canary details page and select the “Configuration” tab. Under the “Data Storage” section you will find the full S3 path where canary stores the data. The first part is the S3 bucket, for example cw-syn-results-1234567890-us-east-1 and rest of it is the prefix, for example canary/us-east-1/health_check_canary-a12-a1234aa1234a

Figure 3: Example parameters for the CloudFormation template

Choose “Next” until you reach the “Review” page
Check the “I acknowledge that AWS CloudFormation might create IAM resources.” box and choose “Submit”
Wait until the resources are created and the stack reaches the CREATE_COMPLETE state

Step 3: Running the load test

Navigate to the Step Functions state machine page
Choose the state machine that the CloudFormation stack created. It will have StateMachineCanaryLoadTester- as prefix
Start the test by selecting “Start Execution”
Paste the following in the dialog box that opens and select “Start Execution”
```
{
	"rampUpDuration": 1,
	"targetConcurrency": 5,
	"duration": 5
}
```
- - rampUpDuration is the time in minutes you want the load test to gradually ramp up before reaching the targetConcurrency
  - targetConcurrency is the number concurrent users you want to simulate
  - duration is the time in minutes you want the test to run after reaching the targetConcurrency

Adjust the settings as per your load requirements. You can monitor the progress of the load test in the Step Functions’ execution page in the “Graph View” and the “Events” sections.

Monitoring

The canary sends metrics to CloudWatch that you can view, monitor and analyze. You can access these metrics by navigating to the details page of the canary, under the “Monitoring” section. You can find metrics related to the duration and failures of the overall canary and also the individual steps that the canary executed. The graphs are interactive and you can select the specific timeframe that is of interest. The example used in the blog is a single step and you can find the average duration the step took in “Canary steps duration” graph.

Screenshot of the metrics of the average duration a canary step took and the number successful requests that the canary sent you the workload

Figure 4: Metrics of duration and success count of the load test

Considerations for effective load testing

By default, Lambda limits concurrent executions to 1000 per region across all functions in an AWS account. You can calculate the concurrency required for your load test based on the average requests per second and average request duration. If you require higher concurrency, you can request a quota increase. Additionally, you should also follow the Lambda best practices where applicable to optimize performance and cost.

This solution modifies the canary code to conditionally use the load test environment’s endpoint and prevents collection of screenshots, HAR files and detailed metrics for the load test. However, you may still need to collect this for synthetic monitoring of the production endpoint. You can configure the environment variables URL to point to the production URL and set the ENV variable to prod to enable collection of screenshots, HAR files and other metrics for your production environment. You can follow the approach explained in Using environment variables with Amazon CloudWatch Synthetics to use the same canary to test multiple endpoints. Similarly, you may also have separate AWS accounts for your load test and production environments. In such cases, consider deploying separate canaries in these accounts using CloudFormation and reuse the test script by passing it in the Code section.

If your workload requires secrets such as user name and a password to authenticate, your production and load testing environment will require different credentials. In such cases use AWS Secrets Manager to store the secrets and use environment variable ENV to load the credentials for that specific environment.

The targetConcurrency at which you run the load generation Step Function and the execution duration of your canary Lambda affects the overall cost of running the load test. You can monitor your spend on these services in AWS Cost Explorer.

Troubleshooting

The Step Function state machine runs the distributed map state and the timer function state, in a sequence. Therefore the state machine needs to wait for an iteration of the distributed map step to finish, then check if the ramp up or load time has reached and then trigger another round of load. If a few canaries take much longer than the others in the map step, you might see load on your workload drop for some time, before picking up again. This usually is a good indication that your workload has some condition leading to such stragglers and worth investigating. If this is expected, you can smooth out the load by creating multiple executions of this Step Function that are spaced out over a few seconds.

The Step Function also has the max execution time set to 30 seconds in the CloudFormation template. If your canary takes longer than 30 seconds, the Step Function’s map state will mark this as a failed execution and move on. If your workload requires longer than 30 seconds or if you do not want to wait for 30 seconds you can modify the "TimeoutSeconds": 30 to the appropriate value in seconds. Note that there are 2 occurences of this in the CloudFormation template, one for the rampup phase and one for the hold phase. Make sure you modify both the settings.

Clean up

To avoid incurring ongoing charges, complete the following cleanup steps:

To delete the CloudFormation stack, navigate to the stacks page. Select the “LoadTestCloudWatchSynthetics” stack, and choose the “Delete” button.
To delete the artifacts that the load test generated in the S3 bucket, navigate to the bucket page. Find the bucket matching the pattern cw-syn-results-<account id>-<region>. CloudWatch synthetics stores the results of all the canaries it runs in this bucket. Delete only the artifacts created by the canary you created by following this blog. The exact prefix to delete is available as part of step 4 of “Step 2: Create the load testing stack”. Select the checkbox corresponding to this prefix and choose the delete button. Confirm and choose “Delete objects” button in the confirmation page.
To delete the canary, navigate to the Synthetics Canaries page. Select “health_check_canary”, select the “Actions” button and select “Delete”. In the confirmation page, select the canary role and policy as well and select the “Delete” button.

Conclusion

This post shows how you can use CloudWatch Synthetics to run synthetic monitoring and load testing of an HTTP endpoint using a single canary. It shows how you can use an AWS Step Functions state machine to gradually ramp up the load and hold it for any duration and concurrency. It also shows best practices relevant to load testing such as load generation, metrics collection, and analysis of the results of the load test run using the Step Function and Synthetics canary. You can extend this solution to test more complex interactions that your workload requires.

For more information, see the following resources:

AWS Cloud Operations Blog