Clean up your Excel and CSV files without writing code using AWS Glue DataBrew

Managing data within an organization is complex. Handling data from outside the organization adds even more complexity. As the organization receives data from multiple external vendors, it often arrives in different formats, typically Excel or CSV files, with each vendor using their own unique data layout and structure. In this blog post, we’ll explore a solution that streamlines this process by leveraging the capabilities of AWS Glue DataBrew.

DataBrew is an excellent tool for data quality and preprocessing. You can use its built-in transformations, recipes, as well as integrations with the AWS Glue Data Catalog and Amazon Simple Storage Service (Amazon S3) to preprocess the data in your landing zone, clean it up, and send it downstream for analytical processing.

In this post, we demonstrate the following:

Extracting non-transactional metadata from the top rows of a file and merging it with transactional data
Combining multi-line rows into single-line rows
Extracting unique identifiers from within strings or text

Solution overview

For this use case, imagine you’re a data analyst working at your organization. The sales leadership have requested a consolidated view of the net sales they are making from each of the organization’s suppliers. Unfortunately, this information is not available in a database. The sales data comes from each supplier in layouts like the following example.

However, with hundreds of resellers, manually extracting the information at the top is not feasible. Your goal is to clean up and flatten the data into the following output layout.

To achieve this, you can use pre-built transformations in DataBrew to quickly get the data in the layout you want.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account.
An AWS Identity and Access Management (IAM) role with permissions for Amazon S3 and DataBrew. For more information, refer to Setting up AWS Identity and Access Management (IAM) permissions.

Connect to the dataset

The first thing we need to do is upload the input dataset to Amazon S3. Create an S3 bucket for the project and create a folder to upload the raw input data. The output data will be stored in another folder in a later step.

Next, we need to connect DataBrew to our CSV file. We create what we call a dataset, which
is an artifact that points to whatever data source we will be using. Navigate to “Datasets” on
the left hand menu.

Ensure the Column header values field is set to Add default header. The input CSV has an irregular format, so the first row will not have the needed column values.

Create a project

To create a new project, complete the following steps:

On the DataBrew console, choose Projects in the navigation pane.
Choose Create project.
For Project name, enter FoodMartSales-AllUpProject.
For Attached recipe, choose Create new recipe.
For Recipe name, enter FoodMartSales-AllUpProject-recipe.
For Select a dataset, select My datasets.
Select the FoodMartSales-AllUp dataset.
Under Permissions, for Role name, choose the IAM role you created as a prerequisite or create a new role.
Choose Create project.

After the project is opened, an interactive session is created where you can author transformations on a sample of the data.

Extract non-transactional metadata from within the contents of the file and merge it with transactional data

In this section, we consider data that has metadata on the first few rows of the file, followed by transactional data. We walk through how to extract data relevant to the whole file from the top of the document and combine it with the transactional data into one flat table.

Extract metadata from the header and remove invalid rows

Complete the following steps to extract metadata from the header:

Choose Conditions and then choose IF.
For Matching conditions, choose Match all conditions.
For Source, choose Value of and Column_1.
For Logical condition, choose Is exactly.
For Enter a value, choose Enter custom value and enter RESELLER NAME.
For Flag result value as, choose Custom value.
For Value if true, choose Select source column and set Value of to Column_2.
For Value if false, choose Enter custom value and enter INVALID.
Choose Apply.

Your dataset should now look like the following screenshot, with the Reseller Name value extracted to a column by itself.

Next, you remove invalid rows and fill the rows with the Reseller Name value.

Choose Clean and then choose Custom values.
For Source column, choose ResellerName.
For Specify values to remove, choose Custom value.
For Values to remove, choose Invalid.
For Apply transform to, choose All rows.
Choose Apply.
Choose Missing and then choose Fill with most frequent value.
For Source column, choose FirstTransactionDate.
For Missing value action, choose Fill with most frequent value.
For Apply transform to, choose All rows.
Choose Apply.

Your dataset should now look like the following screenshot, with the Reseller Name value extracted to a column by itself.

Repeat the same steps in this section for the rest of the metadata, including Reseller Email Address, Reseller ID, and First Transaction Date.

Promote column headers and clean up data

To promote column headers, complete the following steps:

Reorder the columns to put the metadata columns to the left of the dataset by choosing Column, Move column, and Start of the table.
Rename the columns with the appropriate names.

Now you can clean up some columns and rows.

Delete unnecessary columns, such as Column_7.

You can also delete invalid rows by filtering out records that don’t have a transaction date value.

Choose the ABC icon on the menu of the Transaction_Date column and choose date.
For Handle invalid values, select Delete rows, then choose Apply.

The dataset should now have the metadata extracted and the column headers promoted.

Combine multi-line rows into single-line rows

The next issue to address is transactions pertaining to the same row that are split across multiple lines. In the following steps, we extract the needed data from the rows and merge it into single-line transactions. For this example specifically, the Reseller Margin data is split across two lines.

Complete the following steps to get the Reseller Margin value on the same line as the corresponding transaction. First, we identify the Reseller Margin rows and store them in a temporary column.

Choose Conditions and then choose IF.
For Matching conditions, choose Match all conditions.
For Source, choose Value of and Transaction_ID.
For Logical condition, choose Contains.
For Enter a value, choose Enter custom value and enter Reseller Margin.
For Flag result value as, choose Custom value.
For Value if true, choose Select source column set Value of to TransactionAmount.
For Value if false, choose Enter custom value and enter Invalid.
For Destination column, choose ResellerMargin_Temp.
Choose Apply.

Next, you shift the Reseller Margin value up one row.

Choose Functions and then choose NEXT.
For Source column, choose ResellerMargin_Temp.
For Number of rows, enter 1.
For Destination column, choose ResellerMargin.
For Apply transform to, choose All rows.
Choose Apply.

Next, delete the invalid rows.

Choose Missing and then choose Remove missing rows.
For Source column, choose TransactionDate.
For Missing value action, choose Delete rows with missing values.
For Apply transform to, choose All rows.
Choose Apply.

Your dataset should now look like the following screenshot, with the Reseller Margin value extracted to a column by itself.

With the data structured properly, we can move on to mining the cleaned data.

Extract unique identifiers from within strings and text

Many types of data contain important information stored as unstructured text in a cell. In this section, we look at how to extract this data. Within the sample dataset, the BankTransferText column has valuable information around our resellers’ registered bank account numbers as well as the currency of the transaction, namely IBAN, SWIFT Code, and Currency.

Complete the following steps to extract IBAN, SWIFT code, and Currency into separate columns. First, you extract the IBAN number from the text using a regular expression (regex).

Choose Extract and then choose Custom value or pattern.
For Create column options, choose Extract values.
For Source column, choose BankTransferText.
For Extract options, choose Custom value or pattern.
For Values to extract, enter [a-zA-Z][a-zA-Z][0-9]{2}[A-Z0-9]{1,30}.
For Destination column, choose IBAN.
For Apply transform to, choose All rows.
Choose Apply.
Extract the SWIFT code from the text using a regex following the same steps used to extract the IBAN number, but using the following regex instead: (?!^)(SWIFT Code: )([A-Z]{2}[A-Z0-9]+).

Next, remove the SWIFT Code: label from the extracted text.

Choose Remove and then choose Custom values.
For Source column, choose SWIFT Code.
For Specify values to remove, choose Custom value.
For Apply transform to, choose All rows.
Extract the currency from the text using a regex following the same steps used to extract the IBAN number, but using the following regex instead: (?!^)(Currency: )([A-Z]{3}).
Remove the Currency: label from the extracted text following the same steps used to remove the SWIFT Code: label.

You can clean up by deleting any unnecessary columns.

Choose Column and then choose Delete.
For Source columns, choose BankTransferText.
Choose Apply.
Repeat for any remaining columns.

Your dataset should now look like the following screenshot, with IBAN, SWIFT Code, and Currency extracted to separate columns.

Write the transformed data to Amazon S3

With all the steps captured in the recipe, the last step is to write the transformed data to Amazon S3.

On the DataBrew console, choose Run job.

For Job name, enter FoodMartSalesToDataLake.
For Output to, choose Amazon S3.
For File type, choose CSV.
For Delimiter, choose Comma (,).
For Compression, choose None.
For S3 bucket owners’ account, select Current AWS account.
For S3 location, enter s3://{name of S3 bucket}/clean/.
For Role name, choose the IAM role created as a prerequisite or create a new role.
Choose Create and run job.
Go to the Jobs tab and wait for the job to complete.
Navigate to the job output folder on the Amazon S3 console.
Download the CSV file and view the transformed output.

Your dataset should look similar to the following screenshot.

Clean up

To optimize cost, make sure to clean up the resources deployed for this project by completing the following steps:

Delete every DataBrew project along with their linked recipes.
Delete all the DataBrew datasets.
Delete the contents in your S3 bucket.
Delete the S3 bucket.

Conclusion

The reality of exchanging data with suppliers is that we can’t always control the shape of the input data. With DataBrew, we can use a list of pre-built transformations and repeatable steps to transform incoming data into a desired layout and extract relevant data and insights from Excel or CSV files. Start using DataBrew today and transform 3 ^rd party files into structured datasets ready for consumption by your business.

About the Author

Ismail Makhlouf is a Senior Specialist Solutions Architect for Data Analytics at AWS. Ismail focuses on architecting solutions for organizations across their end-to-end data analytics estate, including batch and real-time streaming, big data, data warehousing, and data lake workloads. He primarily works with direct-to-consumer platform companies in the ecommerce, FinTech, PropTech, and HealthTech space to achieve their business objectives with well-architected data platforms.

AWS Big Data Blog