SageMaker Lakehouse pricing
Amazon SageMaker Lakehouse unifies all your data across Amazon Simple Storage Service (Amazon S3) data lakes and Amazon Redshift data warehouses, helping you build powerful analytics and AI/ML applications on a single copy of data. SageMaker Lakehouse gives you the flexibility to access and query your data with all Apache Iceberg–compatible tools and engines. It secures your data in the lakehouse by defining fine-grained permissions, which are consistently applied across all analytics and machine learning (ML) tools and engines. Besides these benefits, access your data from operational databases and applications through zero-ETL integrations and data across third-party sources through federated query capabilities in the lakehouse.
SageMaker Lakehouse is directly accessible from Amazon SageMaker Unified Studio (preview). Data from different sources are organized in logical containers called catalogs in SageMaker Lakehouse. Each catalog represents data either from existing data sources such as data warehouses and third-party databases or directly created in the lakehouse to store data in Amazon S3 or Amazon Redshift Managed Storage (RMS). Query engines can connect to these catalogs and access data in-place with Apache Iceberg APIs. You can use any Apache Iceberg–compatible engine such as Apache Spark, Trino, Amazon Athena, or Amazon EMR to access the data as Apache Iceberg tables and query the data from their first- and third-party query engines. Similarly, the catalogs are mounted in first-party query engines such as Amazon Redshift clusters and workgroups as databases. Connect to the databases from query tools through Java Database Connectivity (JDBC) or Amazon Redshift Query Editor V2 to query using SQL.
SageMaker Lakehouse has the below underlying components. You pay for the components you use on the lakehouse.
SageMaker Lakehouse metadata: Data definitions are organized in a logical hierarchy of catalogs, databases, and tables using AWS Glue Data Catalog.
- Catalog: A logical container that holds objects from a data store such as schemas, tables, views, or materialized views from Amazon Redshift. You can nest catalogs under a catalogs to match the levels of hierarchies from the data source you are bringing to the lakehouse.
- Database: Databases can be used to organize the data objects such as tables and views in the lakehouse.
- Tables and views: Table and views are data objects in a database that describe how to access the underlying data such as schema, partitions, storage location, storage format, and SQL query to access the data.
SageMaker Lakehouse metadata can be accessed from AWS Glue APIs. For metadata storage, APIs requests, AWS Glue Data Catalog metadata pricing applies, including the AWS Free Tier. For more information, visit AWS Glue pricing.
Data storage and access: Using the SageMaker Lakehouse, you can read and write data into Amazon S3 or RMS. Based on the storage type you choose to store data in the lakehouse, you will incur additional storage and compute costs to access the underlying storage. Visit AWS Glue pricing for more details on storage and compute pricing for the storage types.
Statistics and Apache Iceberg table maintenance: In SageMaker Lakehouse, you can automate statistics collection on data lake tables in Amazon S3 for faster query execution and Apache Iceberg tables maintenance, such as compaction, to optimize the storage layout of your Apache Iceberg tables. You will incur additional charges when you enable these features. For more information, visit AWS Glue pricing.
Permissions: Fine-grained permissions in SageMaker Lakehouse are powered by AWS Lake Formation. Permissions on SageMaker Lakehouse are free. For more details, visit Lake Formation pricing.
Zero-ETL integration costs
Amazon SageMaker has zero-ETL integrations with applications, removing the need to build and manage extract, transform, and load (ETL) pipelines. Supported applications include Salesforce, ServiceNow, Zendesk, and more.
These integrations provide you with flexibility, so you can choose specific data tables in an application to automatically replicate to Amazon Redshift. This flexibility allows you to run unified analytics across multiple applications and data sources. AWS does not charge an additional fee for the zero-ETL integration. You pay for existing resources used to create and process the change data created as part of a zero-ETL integration. This includes additional Amazon Redshift storage for storing replicated data, compute resources for processing data replication (or RPUs on Amazon Redshift Serverless), and cross-AZ data transfer costs for moving data from source to target. Ongoing processing of data changes by zero-ETL integration is offered at no additional charge. For more information, visit Amazon Aurora pricing, Amazon Relational Database (Amazon RDS) for MySQL pricing, Amazon DynamoDB pricing, and AWS Glue pricing.