Build a serverless data quality pipeline using Deequ on AWS Lambda

Poor data quality can lead to a variety of problems, including pipeline failures, incorrect reporting, and poor business decisions. For example, if data ingested from one of the systems contains a high number of duplicates, it can result in skewed data in the reporting system. To prevent such issues, data quality checks are integrated into data pipelines, which assess the accuracy and reliability of the data. These checks in the data pipelines send alerts if the data quality standards are not met, enabling data engineers and data stewards to take appropriate actions. Example of these checks include counting records, detecting duplicate data, and checking for null values.

To address these issues, Amazon built an open source framework called Deequ, which performs data quality at scale. In 2023, AWS launched AWS Glue Data Quality, which offers a complete solution to measure and monitor data quality. AWS Glue uses the power of Deequ to run data quality checks, identify records that are bad, provide a data quality score, and detect anomalies using machine learning (ML). However, you may have very small datasets and require faster startup times. In such instances, an effective solution is running Deequ on AWS Lambda.

In this post, we show how to run Deequ on Lambda. Using a sample application as reference, we demonstrate how to build a data pipeline to check and improve the quality of data using AWS Step Functions. The pipeline uses PyDeequ, a Python API for Deequ and a library built on top of Apache Spark to perform data quality checks. We show how to implement data quality checks using the PyDeequ library, deploy an example that showcases how to run PyDeequ in Lambda, and discuss the considerations using Lambda for running PyDeequ.

To help you get started, we’ve set up a GitHub repository with a sample application that you can use to practice running and deploying the application.

Since you are reading this post you may also be interested in the following:

Solution overview

In this use case, the data pipeline checks the quality of Airbnb accommodation data, which includes ratings, reviews, and prices, by neighborhood. Your objective is to perform the data quality check of the input file. If the data quality check passes, then you aggregate the price and reviews by neighborhood. If the data quality check fails, then you fail the pipeline and send a notification to the user. The pipeline is built using Step Functions and comprises three primary steps:

Data quality check – This step uses a Lambda function to verify the accuracy and reliability of the data. The Lambda function uses PyDeequ, a library for data quality checks. As PyDeequ runs on Spark, the example employs the Spark Runtime for AWS Lambda (SoAL) framework, which makes it straightforward to run a standalone installation of Spark in Lambda. The Lambda function performs data quality checks and stores the results in an Amazon Simple Storage Service (Amazon S3) bucket.
Data aggregation – If the data quality check passes, the pipeline moves to the data aggregation step. This step performs some calculations on the data using a Lambda function that uses Polars, a DataFrames library. The aggregated results are stored in Amazon S3 for further processing.
Notification – After the data quality check or data aggregation, the pipeline sends a notification to the user using Amazon Simple Notification Service (Amazon SNS). The notification includes a link to the data quality validation results or the aggregated data.

The following diagram illustrates the solution architecture.

Implement quality checks

The following is an example of data from the sample accommodations CSV file.

id	name	host_name	neighbourhood_group	neighbourhood	room_type	price	minimum_nights	number_of_reviews
7071	BrightRoom with sunny greenview!	Bright	Pankow	Helmholtzplatz	Private room	42	2	197
28268	Cozy Berlin Friedrichshain for1/6 p	Elena	Friedrichshain-Kreuzberg	Frankfurter Allee Sued FK	Entire home/apt	90	5	30
42742	Spacious 35m2 in Central Apartment	Desiree	Friedrichshain-Kreuzberg	suedliche Luisenstadt	Private room	36	1	25
57792	Bungalow mit Garten in Berlin Zehlendorf	Jo	Steglitz – Zehlendorf	Ostpreu√üendamm	Entire home/apt	49	2	3
81081	Beautiful Prenzlauer Berg Apt	Bernd+Katja :-)	Pankow	Prenzlauer Berg Nord	Entire home/apt	66	3	238
114763	In the heart of Berlin!	Julia	Tempelhof – Schoeneberg	Schoeneberg-Sued	Entire home/apt	130	3	53
153015	Central Artist Appartement Prenzlauer Berg	Marc	Pankow	Helmholtzplatz	Private room	52	3	127

In a semi-structured data format such as CSV, there is no inherent data validation and integrity checks. You need to verify the data against accuracy, completeness, consistency, uniqueness, timeliness, and validity, which are commonly referred as the six data quality dimensions. For instance, if you want to display the name of the host for a particular property on a dashboard, but the host’s name is missing in the CSV file, this would be an issue of incomplete data. Completeness checks can include looking for missing records, missing attributes, or truncated data, among other things.

As part of the GitHub repository sample application, we provide a PyDeequ script that will perform the quality validation checks on the input file.

The following code is an example of performing the completeness check from the validation script:

checkCompleteness = VerificationSuite(spark)
.onData(dataset) \
.isComplete("host_name")

The following is an example of checking for uniqueness of data:

checkCompleteness = VerificationSuite(spark)
.onData(dataset) \
.isUnique ("id")

You can also chain multiple validation checks as follows:

checkResult = VerificationSuite(spark) \
.onData(dataset) \
.isComplete("name") \
.isUnique("id") \
.isComplete("host_name") \
.isComplete("neighbourhood") \
.isComplete("price") \
.isNonNegative("price")) \
.run()

The following is an example of making sure 99% or more of the records in the file include host_name:

checkCompleteness = VerificationSuite(spark)
.onData(dataset) \
.hasCompleteness("host_name", lambda x: x >= 0.99)

Prerequisites

Before you get started, make sure you complete the following prerequisites:

You should have an AWS account.
Install and configure the AWS Command Line Interface (AWS CLI).
Install the AWS SAM CLI.
Install Docker community edition.
You should have Python 3

Run Deequ on Lambda

To deploy the sample application, complete the following steps:

Clone the GitHub repository.
Use the provided AWS CloudFormation template to create the Amazon Elastic Container Registry (Amazon ECR) image that will be used to run Deequ on Lambda.
Use the AWS SAM CLI to build and deploy the rest of the data pipeline to your AWS account.

For detailed deployment steps, refer to the GitHub repository Readme.md.

When you deploy the sample application, you’ll find that the DataQuality function is in a container packaging format. This is because the SoAL library required for this function is larger than the 250 MB limit for zip archive packaging. During the AWS Serverless Application Model (AWS SAM) deployment process, a Step Functions workflow is also created, along with the necessary data required to run the pipeline.

Run the workflow

After the application has been successfully deployed to your AWS account, complete the following steps to run the workflow:

Go to the S3 bucket that was created earlier.

You will notice a new bucket with the prefix as your stack name.

Follow the instructions in the GitHub repository to upload the Spark script to this S3 bucket. This script is used to perform data quality checks.
Subscribe to the SNS topic created to receive success or failure email notifications as explained in the GitHub repository.
Open the Step Functions console and run the workflow prefixed DataQualityUsingLambdaStateMachine with default inputs.
You can test both success and failure scenarios as explained in the instructions in the GitHub repository.

The following figure illustrates the workflow of the Step Functions state machine.

Review the quality check results and metrics

To review the quality check results, you can navigate to the same S3 bucket. Navigate to the OUTPUT/verification-results folder to see the quality check verification results. Open the file name starting with the prefix part. The following table is a snapshot of the file.

check	check_level	check_status	constraint	constraint_status
Accomodations	Error	Success	SizeConstraint(Size(None))	Success
Accomodations	Error	Success	CompletenessConstraint(Completeness(name,None))	Success
Accomodations	Error	Success	UniquenessConstraint(Uniqueness(List(id),None))	Success
Accomodations	Error	Success	CompletenessConstraint(Completeness(host_name,None))	Success
Accomodations	Error	Success	CompletenessConstraint(Completeness(neighbourhood,None))	Success
Accomodations	Error	Success	CompletenessConstraint(Completeness(price,None))	Success

Check_status suggests if the quality check was successful or a failure. The Constraint column suggests the different quality checks that were done by the Deequ engine. Constraint_status suggests the success or failure for each of the constraint.

You can also review the quality check metrics generated by Deequ by navigating to the folder OUTPUT/verification-results-metrics. Open the file name starting with the prefix part. The following table is a snapshot of the file.

entity	instance	name	value
Column	price is non-negative	Compliance	1
Column	neighbourhood	Completeness	1
Column	price	Completeness	1
Column	id	Uniqueness	1
Column	host_name	Completeness	0.998831356
Column	name	Completeness	0.997348076

For the columns with a value of 1, all the records of the input file satisfy the specific constraint. For the columns with a value of 0.99, 99% of the records satisfy the specific constraint.

Considerations for running PyDeequ in Lambda

Consider the following when deploying this solution:

Running SoAL on Lambda is a single-node deployment, but is not limited to a single core; a node can have multiple cores in Lambda, which allows for distributed data processing. Adding more memory in Lambda proportionally increases the amount of CPU, increasing the overall computational power available. Multiple CPU with single-node deployment and the quick startup time of Lambda results in faster job processing when it comes to Spark jobs. Additionally, the consolidation of cores within a single node enables faster shuffle operations, enhanced communication between cores, and improved I/O performance.
For Spark jobs that run longer than 15 minutes or larger files (more than 1 GB) or complex joins that require more memory and compute resource, we recommend AWS Glue Data Quality. SoAL can also be deployed in Amazon ECS.
Choosing the right memory setting for Lambda functions can help balance the speed and cost. You can automate the process of selecting different memory allocations and measuring the time taken using Lambda power tuning.
Workloads using multi-threading and multi-processing can benefit from Lambda functions powered by an AWS Graviton processor, which offers better price-performance. You can use Lambda power tuning to run with both x86 and ARM architecture and compare results to choose the optimal architecture for your workload.

Clean up

Complete the following steps to clean up the solution resources:

On the Amazon S3 console, empty the contents of your S3 bucket.

Because this S3 bucket was created as part of the AWS SAM deployment, the next step will delete the S3 bucket.

To delete the sample application that you created, use the AWS CLI. Assuming you used your project name for the stack name, you can run the following code:

sam delete --stack-name "<your stack name>"

To delete the ECR image you created using CloudFormation, delete the stack from the AWS CloudFormation console.

For detailed instructions, refer to the GitHub repository Readme.md file.

Conclusion

Data is crucial for modern enterprises, influencing decision-making, demand forecasting, delivery scheduling, and overall business processes. Poor quality data can negatively impact business decisions and efficiency of the organization.

In this post, we demonstrated how to implement data quality checks and incorporate them in the data pipeline. In the process, we discussed how to use the PyDeequ library, how to deploy it in Lambda, and considerations when running it in Lambda.

You can refer to Data quality prescriptive guidance for learning about best practices for implementing data quality checks. Please refer to Spark on AWS Lambda blog to learn about running analytics workloads using AWS Lambda.

About the Authors

Vivek Mittal is a Solution Architect at Amazon Web Services. He is passionate about serverless and machine learning technologies. Vivek takes great joy in assisting customers with building innovative solutions on the AWS cloud platform.

John Cherian is Senior Solutions Architect at Amazon Web Services helps customers with strategy and architecture for building solutions on AWS.

Uma Ramadoss is a Principal Solutions Architect at Amazon Web Services, focused on the Serverless and Integration Services. She is responsible for helping customers design and operate event-driven cloud-native applications using services like Lambda, API Gateway, EventBridge, Step Functions, and SQS. Uma has a hands on experience leading enterprise-scale serverless delivery projects and possesses strong working knowledge of event-driven, micro service and cloud architecture.

AWS Big Data Blog