AWS Big Data Blog
Clean up your Excel and CSV files without writing code using AWS Glue DataBrew
Managing data within an organization is complex. Handling data from outside the organization adds even more complexity. As the organization receives data from multiple external vendors, it often arrives in different formats, typically Excel or CSV files, with each vendor using their own unique data layout and structure. In this blog post, we’ll explore a solution that streamlines this process by leveraging the capabilities of AWS Glue DataBrew.
DataBrew is an excellent tool for data quality and preprocessing. You can use its built-in transformations, recipes, as well as integrations with the AWS Glue Data Catalog and Amazon Simple Storage Service (Amazon S3) to preprocess the data in your landing zone, clean it up, and send it downstream for analytical processing.
In this post, we demonstrate the following:
- Extracting non-transactional metadata from the top rows of a file and merging it with transactional data
- Combining multi-line rows into single-line rows
- Extracting unique identifiers from within strings or text
Solution overview
For this use case, imagine you’re a data analyst working at your organization. The sales leadership have requested a consolidated view of the net sales they are making from each of the organization’s suppliers. Unfortunately, this information is not available in a database. The sales data comes from each supplier in layouts like the following example.
However, with hundreds of resellers, manually extracting the information at the top is not feasible. Your goal is to clean up and flatten the data into the following output layout.
To achieve this, you can use pre-built transformations in DataBrew to quickly get the data in the layout you want.
Prerequisites
For this walkthrough, you should have the following prerequisites:
- An AWS account.
- An AWS Identity and Access Management (IAM) role with permissions for Amazon S3 and DataBrew. For more information, refer to Setting up AWS Identity and Access Management (IAM) permissions.
Connect to the dataset
The first thing we need to do is upload the input dataset to Amazon S3. Create an S3 bucket for the project and create a folder to upload the raw input data. The output data will be stored in another folder in a later step.
Next, we need to connect DataBrew to our CSV file. We create what we call a dataset, which
is an artifact that points to whatever data source we will be using. Navigate to “Datasets” on
the left hand menu.
Ensure the Column header values field is set to Add default header. The input CSV has an irregular format, so the first row will not have the needed column values.
Create a project
To create a new project, complete the following steps:
- On the DataBrew console, choose Projects in the navigation pane.
- Choose Create project.
- For Project name, enter
FoodMartSales-AllUpProject
. - For Attached recipe, choose Create new recipe.
- For Recipe name, enter
FoodMartSales-AllUpProject-recipe
.
- For Select a dataset, select My datasets.
- Select the
FoodMartSales-AllUp
dataset.
- Under Permissions, for Role name, choose the IAM role you created as a prerequisite or create a new role.
- Choose Create project.
After the project is opened, an interactive session is created where you can author transformations on a sample of the data.
Extract non-transactional metadata from within the contents of the file and merge it with transactional data
In this section, we consider data that has metadata on the first few rows of the file, followed by transactional data. We walk through how to extract data relevant to the whole file from the top of the document and combine it with the transactional data into one flat table.
Extract metadata from the header and remove invalid rows
Complete the following steps to extract metadata from the header:
- Choose Conditions and then choose IF.
- For Matching conditions, choose Match all conditions.
- For Source, choose Value of and
Column_1
. - For Logical condition, choose Is exactly.
- For Enter a value, choose Enter custom value and enter
RESELLER NAME
. - For Flag result value as, choose Custom value.
- For Value if true, choose Select source column and set Value of to
Column_2
. - For Value if false, choose Enter custom value and enter
INVALID
. - Choose Apply.
Your dataset should now look like the following screenshot, with the Reseller Name value extracted to a column by itself.
Next, you remove invalid rows and fill the rows with the Reseller Name value.
- Choose Clean and then choose Custom values.
- For Source column, choose
ResellerName
. - For Specify values to remove, choose Custom value.
- For Values to remove, choose Invalid.
- For Apply transform to, choose All rows.
- Choose Apply.
- Choose Missing and then choose Fill with most frequent value.
- For Source column, choose
FirstTransactionDate
. - For Missing value action, choose Fill with most frequent value.
- For Apply transform to, choose All rows.
- Choose Apply.
Your dataset should now look like the following screenshot, with the Reseller Name value extracted to a column by itself.
Repeat the same steps in this section for the rest of the metadata, including Reseller Email Address, Reseller ID, and First Transaction Date.
Promote column headers and clean up data
To promote column headers, complete the following steps:
- Reorder the columns to put the metadata columns to the left of the dataset by choosing Column, Move column, and Start of the table.
- Rename the columns with the appropriate names.
Now you can clean up some columns and rows.
- Delete unnecessary columns, such as
Column_7
.
You can also delete invalid rows by filtering out records that don’t have a transaction date value.
- Choose the ABC icon on the menu of the
Transaction_Date
column and choose date.
- For Handle invalid values, select Delete rows, then choose Apply.
The dataset should now have the metadata extracted and the column headers promoted.
Combine multi-line rows into single-line rows
The next issue to address is transactions pertaining to the same row that are split across multiple lines. In the following steps, we extract the needed data from the rows and merge it into single-line transactions. For this example specifically, the Reseller Margin data is split across two lines.
Complete the following steps to get the Reseller Margin value on the same line as the corresponding transaction. First, we identify the Reseller Margin rows and store them in a temporary column.
- Choose Conditions and then choose IF.
- For Matching conditions, choose Match all conditions.
- For Source, choose Value of and
Transaction_ID
. - For Logical condition, choose Contains.
- For Enter a value, choose Enter custom value and enter Reseller Margin.
- For Flag result value as, choose Custom value.
- For Value if true, choose Select source column set Value of to
TransactionAmount
. - For Value if false, choose Enter custom value and enter Invalid.
- For Destination column, choose
ResellerMargin_Temp
. - Choose Apply.
Next, you shift the Reseller Margin value up one row.
- Choose Functions and then choose NEXT.
- For Source column, choose
ResellerMargin_Temp
. - For Number of rows, enter
1
. - For Destination column, choose
ResellerMargin
. - For Apply transform to, choose All rows.
- Choose Apply.
Next, delete the invalid rows.
- Choose Missing and then choose Remove missing rows.
- For Source column, choose
TransactionDate
. - For Missing value action, choose Delete rows with missing values.
- For Apply transform to, choose All rows.
- Choose Apply.
Your dataset should now look like the following screenshot, with the Reseller Margin value extracted to a column by itself.
With the data structured properly, we can move on to mining the cleaned data.
Extract unique identifiers from within strings and text
Many types of data contain important information stored as unstructured text in a cell. In this section, we look at how to extract this data. Within the sample dataset, the BankTransferText
column has valuable information around our resellers’ registered bank account numbers as well as the currency of the transaction, namely IBAN, SWIFT Code, and Currency.
Complete the following steps to extract IBAN, SWIFT code, and Currency into separate columns. First, you extract the IBAN number from the text using a regular expression (regex).
- Choose Extract and then choose Custom value or pattern.
- For Create column options, choose Extract values.
- For Source column, choose
BankTransferText
. - For Extract options, choose Custom value or pattern.
- For Values to extract, enter
[a-zA-Z][a-zA-Z][0-9]{2}[A-Z0-9]{1,30}
. - For Destination column, choose IBAN.
- For Apply transform to, choose All rows.
- Choose Apply.
- Extract the SWIFT code from the text using a regex following the same steps used to extract the IBAN number, but using the following regex instead:
(?!^)(SWIFT Code: )([A-Z]{2}[A-Z0-9]+)
.
Next, remove the SWIFT Code:
label from the extracted text.
- Choose Remove and then choose Custom values.
- For Source column, choose
SWIFT Code
. - For Specify values to remove, choose Custom value.
- For Apply transform to, choose All rows.
- Extract the currency from the text using a regex following the same steps used to extract the IBAN number, but using the following regex instead:
(?!^)(Currency: )([A-Z]{3})
. - Remove the
Currency:
label from the extracted text following the same steps used to remove theSWIFT Code:
label.
You can clean up by deleting any unnecessary columns.
- Choose Column and then choose Delete.
- For Source columns, choose
BankTransferText
. - Choose Apply.
- Repeat for any remaining columns.
Your dataset should now look like the following screenshot, with IBAN, SWIFT Code, and Currency extracted to separate columns.
Write the transformed data to Amazon S3
With all the steps captured in the recipe, the last step is to write the transformed data to Amazon S3.
- For Job name, enter
FoodMartSalesToDataLake
.
- For Output to, choose Amazon S3.
- For File type, choose CSV.
- For Delimiter, choose Comma (,).
- For Compression, choose None.
- For S3 bucket owners’ account, select Current AWS account.
- For S3 location, enter
s3://{name of S3 bucket}/clean/
. - For Role name, choose the IAM role created as a prerequisite or create a new role.
- Choose Create and run job.
- Go to the Jobs tab and wait for the job to complete.
- Navigate to the job output folder on the Amazon S3 console.
- Download the CSV file and view the transformed output.
Your dataset should look similar to the following screenshot.
Clean up
To optimize cost, make sure to clean up the resources deployed for this project by completing the following steps:
- Delete every DataBrew project along with their linked recipes.
- Delete all the DataBrew datasets.
- Delete the contents in your S3 bucket.
- Delete the S3 bucket.
Conclusion
The reality of exchanging data with suppliers is that we can’t always control the shape of the input data. With DataBrew, we can use a list of pre-built transformations and repeatable steps to transform incoming data into a desired layout and extract relevant data and insights from Excel or CSV files. Start using DataBrew today and transform 3 rd party files into structured datasets ready for consumption by your business.
About the Author
Ismail Makhlouf is a Senior Specialist Solutions Architect for Data Analytics at AWS. Ismail focuses on architecting solutions for organizations across their end-to-end data analytics estate, including batch and real-time streaming, big data, data warehousing, and data lake workloads. He primarily works with direct-to-consumer platform companies in the ecommerce, FinTech, PropTech, and HealthTech space to achieve their business objectives with well-architected data platforms.