AWS Big Data Blog

Amazon Q data integration adds DataFrame support and in-prompt context-aware job creation

Amazon Q data integration, introduced in January 2024, allows you to use natural language to author extract, transform, load (ETL) jobs and operations in AWS Glue specific data abstraction DynamicFrame. This post introduces exciting new capabilities for Amazon Q data integration that work together to make ETL development more efficient and intuitive. We’ve added support for DataFrame-based code generation that works across any Spark environment. We’ve also introduced in-prompt context-aware development that applies details from your conversations, working seamlessly with a new iterative development experience. This means you can refine your ETL jobs through natural follow-up questions—starting with a basic data pipeline and progressively adding transformations, filters, and business logic through conversation. These improvements are available through the Amazon Q chat experience on the AWS Management Console, and the Amazon SageMaker Unified Studio (preview) visual ETL and notebook interfaces.

The DataFrame code generation now extends beyond AWS Glue DynamicFrame to support a broader range of data processing scenarios. You can now generate data integration jobs for various data sources and destinations, including Amazon Simple Storage Service (Amazon S3) data lakes with popular file formats like CSV, JSON, and Parquet, as well as modern table formats such as Apache Hudi, Delta, and Apache Iceberg. Amazon Q can generate ETL jobs for connecting to over 20 different data sources, including relational databases like PostgreSQL, MySQL and Oracle; data warehouses like Amazon Redshift, Snowflake, and Google BigQuery; NoSQL databases like Amazon DynamoDB, MongoDB, and OpenSearch; tables defined in the AWS Glue Data Catalog; and custom user-supplied JDBC and Spark connectors. Your generated jobs can use a variety of data transformations, including filters, projections, unions, joins, and aggregations, giving you the flexibility to handle complex data processing requirements.

In this post, we discuss how Amazon Q data integration transforms ETL workflow development.

Improved capabilities of Amazon Q data integration

Previously, Amazon Q data integration only generated code with template values that required you to fill in the configurations such as connection properties for data source and data sink and the configurations for transforms manually. With in-prompt context awareness, you can now include this information in your natural language query, and Amazon Q data integration will automatically extract and incorporate it into the workflow. In addition, generative visual ETL in the SageMaker Unified Studio (preview) visual editor allows you to reiterate and refine your ETL workflow with new requirements, enabling incremental development.

Solution overview

This post describes the end-to-end user experiences to demonstrate how Amazon Q data integration and SageMaker Unified Studio (preview) simplify your data integration and data engineering tasks with the new enhancements, by building a low-code no-code (LCNC) ETL workflow that enables seamless data ingestion and transformation across multiple data sources.

We demonstrate how to do the following:

  • Connect to diverse data sources
  • Perform table joins
  • Apply custom filters
  • Export processed data to Amazon S3

The following diagram illustrates the architecture.

Using Amazon Q data integration with Amazon SageMaker Unified Studio (preview)

In the first example, we use Amazon SageMaker Unified Studio (preview) to develop a visual ETL workflow incrementally. This pipeline reads data from different Amazon S3 based Data Catalog tables, performs transformations on the data, and writes the transformed data back into an Amazon S3. We use the allevents_pipe and venue_pipe files from the TICKIT dataset to demonstrate this capability. The TICKIT dataset records sales activities on the fictional TICKIT website, where users can purchase and sell tickets online for different types of events such as sports games, shows, and concerts.

The process involves merging the allevents_pipe and venue_pipe files from the TICKIT dataset. Next, the merged data is filtered to include only a specific geographic region. Then the transformed output data is saved to Amazon S3 for further processing in future.

Data preparation

The two datasets are hosted as two Data Catalog tables, venue and event, in a project in Amazon SageMaker Unified Studio (preview), as shown in the following screenshots.

Data processing

To process the data, complete the following steps:

  1. On the Amazon SageMaker Unified Studio console, on the Build menu, choose Visual ETL flow.

An Amazon Q chat window will help you provide a description for the ETL flow to be built.

  1. For this post, enter the following text:
    Create a Glue ETL flow connect to 2 Glue catalog tables venue and event in my database glue_db_4fthqih3vvk1if, join the results on the venue’s venueid and event’s e_venueid, and write output to a S3 location.
    (The database name is generated with the project ID suffixed to the given database name automatically).
  2. Choose Submit.

An initial data integration flow will be generated as shown in the following screenshot to read from the two Data Catalog tables, join the results, and write to Amazon S3. We can see the join conditions are correctly inferred from our request from the join node configuration displayed.

Let’s add another filter transform based on the venue state as DC.

  1. Choose the plus sign and choose the Amazon Q icon to ask a follow-up question.
  2. Enter the instructions filter on venue state with condition as venuestate==‘DC’ after joining the results to modify the workflow.

The workflow is updated with a new filter transform.

Upon checking the S3 data target, we can see the S3 path is now a placeholder <s3-path> and the output format is Parquet.

  1. We can ask the following question in Amazon Q:
    update the s3 sink node to write to s3://xxx-testing-in-356769412531/output/ in CSV format
    in the same way to update the Amazon S3 data target.
  2. Choose Show script to see the generated code is DataFrame based, with all context in place from all of our conversation.
  3. Finally, we can preview the data to be written to the target S3 path. Note that the data is a joined result with only the venue state DC included.

With Amazon Q data integration with Amazon SageMaker Unified Studio (preview), an LCNC user can create the visual ETL workflow by providing prompts to Amazon Q and the context for data sources and transformations are preserved. Subsequently, Amazon Q also generated the DataFrame-based code for data engineers or more experienced users to use the automatic ETL generated code for scripting purposes.

Amazon Q data integration with Amazon SageMaker Unified Studio (preview) notebook

Amazon Q data integration is also available in the Amazon SageMaker Unified Studio (preview) notebook experience. You can add a new cell and enter your comment to describe what you want to achieve. After you press Tab and Enter, the recommended code is shown.

For example, we provide the same initial question:

Create a Glue ETL flow to connect to 2 Glue catalog tables venue and event in my database glue_db_4fthqih3vvk1if, join the results on the venue’s venueid and event’s e_venueid, and write output to a S3 location.

Similar to the Amazon Q chat experience, the code is recommended. If you press Tab, then the recommended code is chosen.

The following video provides a full demonstration of these two experiences in Amazon SageMaker Unified Studio (preview).

Using Amazon Q data integration with AWS Glue Studio

In this section, we walk through the steps to use Amazon Q data integration with AWS Glue Studio

Data preparation

The two datasets are hosted in two Amazon S3 based Data Catalog tables, event and venue, in the database glue_db, which we can query from Amazon Athena. The following screenshot shows an example of the venue table.

Data processing

To start using the AWS Glue code generation capability, use the Amazon Q icon on the AWS Glue Studio console. You can start authoring a new job, and ask Amazon Q the question to create the same workflow:

Create a Glue ETL flow connect to 2 Glue catalog tables venue and event in my database glue_db, join the results on the venue’s venueid and event’s e_venueid, and then filter on venue state with condition as venuestate=='DC' and write to s3://<s3-bucket>/<folder>/output/ in CSV format.

You can see the same code is generated with all configurations in place. With this response, you can learn and understand how you can author AWS Glue code for your needs. You can copy and paste the generated code to the script editor. After you configure an AWS Identity and Access Management (IAM) role on the job, save and run the job. When the job is complete, you can begin querying the data exported to Amazon S3.

After the job is complete, you can verify the joined data by checking the specified S3 path. The data is filtered by venue state as DC and is now ready for downstream workloads to process.

The following video provides a full demonstration of the experience with AWS Glue Studio.

Conclusion

In this post, we explored how Amazon Q data integration transforms ETL workflow development, making it more intuitive and time-efficient, with the latest enhancement of in-prompt context awareness to accurately generate a data integration flow with reduced hallucinations, and multi-turn chat capabilities to incrementally update the data integration flow, add new transforms and update DAG nodes. Whether you’re working with the console or other Spark environments in SageMaker Unified Studio (preview), these new capabilities can significantly reduce your development time and complexity.

To learn more, refer to Amazon Q data integration in AWS Glue.


About the Authors

Bo Li is a Senior Software Development Engineer on the AWS Glue team. He is devoted to designing and building end-to-end solutions to address customers’ data analytic and processing needs with cloud-based, data-intensive technologies.

Stuti Deshpande is a Big Data Specialist Solutions Architect at AWS. She works with customers around the globe, providing them strategic and architectural guidance on implementing analytics solutions using AWS. She has extensive experience in big data, ETL, and analytics. In her free time, Stuti likes to travel, learn new dance forms, and enjoy quality time with family and friends.

Kartik Panjabi is a Software Development Manager on the AWS Glue team. His team builds generative AI features for the Data Integration and distributed system for data integration.

Shubham Mehta is a Senior Product Manager at AWS Analytics. He leads generative AI feature development across services such as AWS Glue, Amazon EMR, and Amazon MWAA, using AI/ML to simplify and enhance the experience of data practitioners building data applications on AWS.