Amazon SageMaker Data Processing
Analyze, prepare, and integrate data for analytics and AI at any scaleWhy SageMaker Data Processing?
Prepare, integrate, and orchestrate your data with the data processing capabilities from Amazon Athena, Amazon EMR, AWS Glue, and Amazon Managed Workflows for Apache Airflow (MWAA). Process and integrate your data, wherever it lives, with fast and easy connectivity to hundreds of data sources.
Leverage open-source data processing frameworks such as Apache Spark, Trino, and Apache Flink. Analyze data at scale with Trino, without managing infrastructure and seamlessly build real time analytics with Apache Flink and Apache Spark.
Trust that your data is accurate and secure by automating data quality, sensitive data identification, lineage tracking, and enforcing fine-grained access controls through native integration with Amazon SageMaker Lakehouse.
Benefits
AWS services
Simplified data integration
AWS Glue provides serverless data integration, simplifying data exploration, preparation, and integration from multiple sources. Connect to diverse data sources, manage your data in a centralized data catalog, and visually create, run, and monitor ETL pipelines to load data into your lakehouse. AWS Glue automatically scales on demand, so you can focus on gaining insights from your data without managing infrastructure.
Run and scale Apache Spark, Apache Hive, Trino, and other workloads
Amazon EMR makes it easier and more cost-effective to run data-processing workloads like Apache Spark, Apache Airflow, Apache Flink, Trino, and more. Build and run data-processing pipelines and automatically scale faster than on- premises solutions.
Track costs
Amazon Athena provides a simplified and flexible way to analyze your data at any scale. Athena is an interactive query service that simplifies data analysis in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to set up or manage, and you can choose to pay based on the queries you run or compute resources needed by your queries. Use Athena to process logs, perform data analytics, and run interactive queries. Athena automatically scales–— completing queries in parallel—so results are fast, even with large datasets and complex queries.
Security-focused and highly available managed workflow orchestration for Apache Airflow
Amazon MWAA is a managed service for Apache Airflow that lets you use your current, familiar Apache Airflow platform to orchestrate your workflows. You gain improved scalability, availability, and security without the operational burden of managing underlying infrastructure. Amazon MWAA orchestrates your workflows using directed acyclic graphs (DAGs) written in Python. You provide Amazon MWAA an S3 bucket where your DAGs, plugins, and Python requirements reside. Deploy Apache Airflow at scale without the operational burden of managing underlying infrastructure.