Author visual ETL flows on Amazon SageMaker Unified Studio (preview)

Amazon SageMaker Unified Studio (preview) provides an integrated data and AI development environment within Amazon SageMaker. From the Unified Studio, you can collaborate and build faster using familiar AWS tools for model development, generative AI, data processing, and SQL analytics. This experience includes visual ETL, a new visual interface that makes it simple for data engineers to author, run, and monitor extract, transform, load (ETL) data integration flow. You can use a simple visual interface to compose flows that move and transform data and run them on serverless compute. Additionally, you can choose to author your visual flows with English using generative AI prompts powered by Amazon Q. Visual ETL also automatically converts your visual flow directed acyclic graph (DAG) into Spark native scripts so you can continue authoring by notebook, enabling a quick-start experience for developers who prefer to author using code.

This post shows how you can build a low-code and no-code (LCNC) visual ETL flow that enables seamless data ingestion and transformation across multiple data sources. We demonstrate how to:

Connect to diverse data sources
Perform table joins
Apply custom filters
Export aggregated data to Amazon Simple Storage Service (Amazon S3)

Additionally, we explore how generative AI can enhance your LCNC visual ETL development process, creating an intuitive and powerful workflow that streamlines the entire development experience.

Use case walkthrough

In this example, we use Amazon SageMaker Unified Studio to develop a visual ETL flow. This pipeline reads data from an Amazon S3 based file location, performs transformations on the data, and subsequently writes the transformed data back into an Amazon S3 based AWS Glue Data Catalog table. We use allevents_pipe and venue_pipe files from the TICKIT dataset to demonstrate this capability.

The TICKIT dataset records sales activities on the fictional TICKIT website, where users can purchase and sell tickets online for different types of events such as sports games, shows, and concerts. Analysts can use this dataset to track how ticket sales change over time, evaluate the performance of sellers, and determine the most successful events, venues, and seasons in terms of ticket sales.

The process involves merging the allevents_pipe and venue_pipe files from the TICKIT dataset. Next, the merged data is filtered to include only a specific geographic region. The data is then aggregated to calculate the number of events by venue name. In the end, the transformed output data is saved to Amazon S3, and a new AWS Glue Data Catalog table is created.

The following diagram illustrates the architecture:

Prerequisites

To run the instruction, you must complete the following prerequisites:

An AWS account
A SageMaker Unified Studio domain
A SageMaker Unified Studio project with Data analytics and machine learning project profile

Build a visual ETL flow

Complete following steps to build a new visual ETL flow with sample dataset:

On the SageMaker Unified Studio console, on the top menu, choose Build.
Under DATA ANALYSIS & INTEGRATION, choose Visual ETL flows, as shown in the following screenshot.

Select your project and choose Continue.

Choose Create visual ETL flow.

This time, manually define the ETL flow.

On the top left, choose the + icon in the circle. Under Data sources, choose Amazon S3, as shown in the following screenshot. Locate the icon at the canvas.

Choose the Amazon S3 source node and enter the following values:

- S3 URI: s3://aws-blogs-artifacts-public/artifacts/BDB-4798/data/venue.csv
- Format: CSV
- Delimiter: ,
- Multiline: Enabled
- Header: Disabled

Leave the rest as default.

Wait for the data preview to be available at the bottom of the screen.

Choose the + icon in the circle to the right of the Amazon S3 node. Under Transforms, choose Rename Columns.

Choose the Rename Columns node and choose Add new rename pair. For Current name and New name, enter the following pairs:
- _c0: venueid
- _c1: venuename
- _c2: venuecity
- _c3: venuestate
- _c4: venueseats

Choose the + icon to the right of Rename Columns node. Under Transforms, choose Filter.
Choose Add new filter condition.
For Key, choose venuestate. For Operation, choose ==. For Value, enter DC, as shown in the following screenshot.

Repeat steps 5 and 6 to add the Amazon S3 source node for table events.

- S3 URI: s3://aws-blogs-artifacts-public/artifacts/BDB-4798/data/events.csv
- Format: CSV
- Sep: ,
- Multiline: Enabled
- Header: Disabled

Leave the rest as default

Repeat steps 7 and 8 for the Amazon S3 source node. On the Rename Columns node, choose Add new rename pair. For Current name and New name, enter the following pairs:
- _c0: eventid
- _c1: e_venueid
- _c2: catid
- _c3: dateid
- _c4: eventname
- _c5: starttime

Choose the + icon to the right of Rename Column node. Under Transforms, choose Join.
Drag the + icon at the right of the Filter node and drop it at the left of the Join node.
For Join type, choose Inner. For Left data source, choose e_venueid. For Right data source, choose venue_id.

Choose the + icon to the right of the Join node. Under Transforms, choose SQL Query.
Enter the following query statement:

select 
  venuename,
  count(distinct eventid) as eventid_count 
from {myDataSource} 
group by venuename

Choose the + icon to the right of the SQL Query node. Under Data target, choose Amazon S3.
Choose the Amazon S3 target node and enter the following values:
- S3 URI: <choose s3 location from project overview page and add suffix “/output/venue_event/”> (for example, s3://<bucket-name>/dzd_bd693kieeb65yf/52d3z1nutb42w7/dev/output/venue_event/)
- Format: Parquet
- Compression: Snappy
- Mode: Overwrite
- Update catalog: True
- Database: Choose your database
- Table: venue_event_agg

At this point, you should encounter this end-to-end visual flow. Now you can publish it.

On the top right, choose Save to project to save the draft flow. You can optionally change the name and add a description. Choose Save to project, as shown in the following screenshot.

The visual ETL flow has been successfully saved.

Run flow

This section shows you how to run the visual ETL flow you authored.

On the top right, choose Run.

At the bottom of the screen, the run status is shown. The run status transitions from Starting to Running and Running to Finished.

Wait for the run to be Finished.

Query using Amazon Athena

The output data has been written to the target S3 bucket. This section shows you how to query the output table.

On the top left menu, under DATA ANALYSIS & INTEGRATION, choose Query Editor.

On the data explorer, under Lakehouse, choose AwsDataCatalog. Navigate to the table venue_event_agg.
From the three dots icon, choose Query with Athena.

Four records will be returned, as shown in the following screenshot. This indicates you succeeded in querying the output table written by the visual ETL flow.

Generative AI section to generate a visual ETL flow

The preceding instruction is done in step-by-step operations on the visual console. On the other hand, SageMaker Unified Studio can automate job authoring steps by using generative AI powered by Amazon Q.

On the top left menu, choose Visual ETL flows.
Choose Create visual ETL flow.
Enter the following text and choose Submit.

Create a flow to connect 2 Glue catalog tables venue and event in database glue_db, join on event id , filter on venue state with condition as venuestate=='DC' and write output to a S3 location

This creates the following boilerplate flow that you can edit to quickly author the visual ETL flow.

The generated flow keeps the context of the prompt at the node level.

Clean Up

To avoid incurring future charges, clean up the resources you created during this walkthrough:

From the SQL querybook, enter the following SQL to drop table:

drop table venue_event_agg

To delete the flow, under Actions, choose Delete flow

Conclusion

This post demonstrated how you can use Amazon SageMaker Unified Studio to build a low-code no-code (LCNC) visual ETL flow. This allows for a seamless data ingestion and transformation across multiple data sources.

To learn more, refer to our documentation and the AWS News Blog.

About the Authors

Praveen Kumar is an Analytics Solutions Architect at AWS with expertise in designing, building, and implementing modern data and analytics platforms using cloud-based services. His areas of interest are serverless technology, data governance, and data-driven AI applications.

Noritaka Sekiyama is a Principal Big Data Architect with AWS Analytics services. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling on his road bike.

Alexandra Tello is a Senior Front End Engineer with the AWS Analytics services in New York City. She is a passionate advocate for usability and accessibility. In her free time, she’s an espresso enthusiast and enjoys building mechanical keyboards.

Ranu Shah is a Software Development Manager with AWS Analytics services. She loves building data analytics features for customers. Outside work, she enjoys reading books or listening to music.

Gal Heyne is a Technical Product Manager for AWS Analytics services with a strong focus on AI/ML and data engineering. She is passionate about developing a deep understanding of customers’ business needs and collaborating with engineers to design simple-to-use data products.

AWS Big Data Blog