Amazon SageMaker Data Processing

Analyze, prepare, and integrate data for analytics and AI at any scale

Why SageMaker Data Processing?

Prepare, integrate, and orchestrate your data with the data processing capabilities from Amazon Athena, Amazon EMR, AWS Glue, and Amazon Managed Workflows for Apache Airflow (MWAA). Process and integrate your data, wherever it lives, with fast and easy connectivity to hundreds of data sources.

Leverage open-source data processing frameworks such as Apache Spark, Trino, and Apache Flink. Analyze data at scale with Trino, without managing infrastructure and seamlessly build real time analytics with Apache Flink and Apache Spark.

Trust that your data is accurate and secure by automating data quality, sensitive data identification, lineage tracking, and enforcing fine-grained access controls through native integration with Amazon SageMaker Lakehouse.

Benefits

SageMaker Data Processing provides comprehensive access to data and stream processing frameworks, open-source distributed SQL query engines, and the most popular tools such as notebooks, query editors, and visual ETL.

You can access the most popular frameworks such as Apache Spark to prepare and integrate your data at any scale. Respond to real-time business needs with stream processing with Apache Flink and Spark Streaming, and analyze data with the leading open source SQL frameworks like Trino. Simplify workflow orchestration without having to manage infrastructure with native integration with Amazon Managed Workflows with Apache Airflow (MWAA).

Amazon SageMaker Data Processing natively integrates with SageMaker Lakehouse, allowing you to process and integrate using one copy of your data for all of your use cases including analytics, ad-hoc querying, machine learning, and generative AI.

SageMaker Lakehouse unifies data across Amazon S3 data lakes and Amazon Redshift data warehouses, providing unified access to your data. You can discover and analyze data unified in the Lakehouse with hundreds of connectors, zero-ETL integrations, and federated data sources, giving you a complete picture of your business. SageMaker Lakehouse works out-of-the-box with your existing data architecture, without being constrained by specific storage format or query engine choices.

Improve efficiency with fast query performance over Apache Iceberg tables. Get insights up to 2x faster than traditional open source systems with highly performant and open source API-compatible versions of Apache Spark, Apache Airflow, Apache Flink, Trino, and more.

SageMaker Data Processing allows you to focus on transforming and analyzing your data without managing compute capacity or open source applications, saving you time and reducing costs. You can automatically provision your capacity on Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2) or Amazon EMR on Amazon Elastic Kubernetes Service (Amazon EKS). Scaling rules manage changes to your compute demand to optimize performance and runtimes.

Gain trust and transparency with automated data-quality reporting, detection of sensitive data, and lineage tracking for data and AI models through integration with Amazon SageMaker Catalog. Increase confidence in the quality of your data with automatic measuring, monitoring, and recommendations for data-quality rules.

Process and analyze your data securely by adhering to and enforcing fine-grained access controls defined on data sets in SageMaker Lakehouse, enabling you to define permissions once and make your data accessible to authorized users across your organization.

AWS services

Simplified data integration

AWS Glue provides serverless data integration, simplifying data exploration, preparation, and integration from multiple sources. Connect to diverse data sources, manage your data in a centralized data catalog, and visually create, run, and monitor ETL pipelines to load data into your lakehouse. AWS Glue automatically scales on demand, so you can focus on gaining insights from your data without managing infrastructure.

Run and scale Apache Spark, Apache Hive, Trino, and other workloads

Amazon EMR makes it easier and more cost-effective to run data-processing workloads like Apache Spark, Apache Airflow, Apache Flink, Trino, and more. Build and run data-processing pipelines and automatically scale faster than on- premises solutions.

Track costs

Amazon Athena provides a simplified and flexible way to analyze your data at any scale. Athena is an interactive query service that simplifies data analysis in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to set up or manage, and you can choose to pay based on the queries you run or compute resources needed by your queries. Use Athena to process logs, perform data analytics, and run interactive queries. Athena automatically scales–— completing queries in parallel—so results are fast, even with large datasets and complex queries.

Security-focused and highly available managed workflow orchestration for Apache Airflow

Amazon MWAA is a managed service for Apache Airflow that lets you use your current, familiar Apache Airflow platform to orchestrate your workflows. You gain improved scalability, availability, and security without the operational burden of managing underlying infrastructure. Amazon MWAA orchestrates your workflows using directed acyclic graphs (DAGs) written in Python. You provide Amazon MWAA an S3 bucket where your DAGs, plugins, and Python requirements reside. Deploy Apache Airflow at scale without the operational burden of managing underlying infrastructure.

Use cases

Quickly identify and access unified data across AWS, on premises, and other clouds, and then make it instantly available for querying and transforming.

Process data using frameworks like Apache Spark, Apache Flink, and Trino, and various workloads, including batch, microbatch, and streaming.

Run large-scale data processing and what-if analysis using statistical algorithms and predictive models to uncover hidden patterns, correlations, market trends, and customer preferences.