Amazon SageMaker Data Processing FAQs

General

Amazon SageMaker Data Processing analyzes, prepares, integrates and orchestrates your data with processing capabilities from Amazon Athena, Amazon EMR, AWS Glue, and Amazon Managed Workflows for Apache Airflow (MWAA). You can leverage open-source data processing frameworks such as Apache Spark analyze data at scale with Trino, and seamlessly build real time analytics with Apache Flink and Apache Spark.

Amazon SageMaker Data Processing brings together Amazon EMR, Amazon Athena, AWS Glue, and Amazon Managed Workflows for Apache Airflow.

SageMaker Data Processing helps you explore data, build data transformation jobs, orchestrate, and deploy data pipelines at scale. It improves performance, driving faster insights than traditional open-source systems with cost effective and open-source API-compatible versions of Apache Spark, Apache Airflow, Apache Flink, Trino, and more. Data Processing provides access to your data sources in Amazon SageMaker Lakehouse through zero-ETL integrations, federated querying capabilities, and connectors.

Migration and Access

No, you do not need to migration to Amazon SageMaker. You can continue to use Amazon EMR, Amazon Athena, AWS Glue, and Amazon Managed Workflows for Apache Airflow as you do today. However, we recommend that you get started with Amazon SageMaker to leverage unified tooling, built-in data governance, and simplified Amazon SageMaker Lakehouse architectures.

There is no impact to current code, queries, jobs, and other resources that you’ve created and used with Amazon EMR, Amazon Athena, or AWS Glue. You can continue to leverage these services for new workloads, if you prefer. Resources created in these services, such as Amazon EMR on EC2 clusters, are visible in Amazon SageMaker to simplify the development of analytics and AI applications. Existing development experiences built into Amazon EMR, AWS Glue and Amazon Athena will continue to exist in addition to a new development experience within Amazon SageMaker.

The latest version of AWS Glue, Glue 5.0, is available in Amazon SageMaker. Glue 5.0 accelerates data processing workloads and delivers the latest performance-optimized Apache Spark 3.5.2 runtime so you can develop, run, and scale for faster insights. To learn more, visit AWS Glue.

Pricing

Each AWS Service that you use through Amazon SageMaker is subject to its own individual pricing. For more details, please consult the AWS pricing page for Amazon Athena, Amazon EMR, AWS Glue, and the Amazon Managed Workflow Apache Airflow.