AWS Machine Learning Blog
Amazon SageMaker now integrates with Amazon DataZone to streamline machine learning governance
Amazon SageMaker is a fully managed machine learning (ML) service that provides a range of tools and features for building, training, and deploying ML models. Amazon DataZone is a data management service that makes it faster and easier for customers to catalog, discover, share, and govern data stored across AWS, on-premises, and third-party sources.
Today, we are excited to announce an integration between Amazon SageMaker and Amazon DataZone to help you set up infrastructure with security controls, collaborate on machine learning (ML) projects, and govern access to data and ML assets.
When solving a business problem with ML, you create ML models from training data and integrate those models with business applications to make predictive decisions. For example, you could use an ML model for loan application processing to make decisions such as approving or denying a loan. When deploying such ML models, effective ML governance helps build trust in ML-powered applications, minimize risks, and promote responsible AI practices.
A comprehensive governance strategy spans across infrastructure, data, and ML. ML governance requires implementing policies, procedures, and tools to identify and mitigate various risks associated with ML use cases. Applying governance practices at every stage of the ML lifecycle is essential for successfully maximizing the value for the organization. For example, when building a ML model for a loan application processing use case, you can align the model development and deployment with your organization’s overall governance policies and controls to create effective loan approval workflows.
However, it might be challenging and time-consuming to apply governance across an ML lifecycle because it typically requires custom workflows and integration of several tools. With the new built-in integration between SageMaker and Amazon DataZone, you can streamline setting up ML governance across infrastructure, collaborate on business initiatives, and govern data and ML assets in just a few clicks.
For governing ML use cases, this new integration offers the following capabilities:
- Business project management – You can create, edit, and view projects, as well as add users to start collaborating on the shared business objective
- Infrastructure management – You can create multiple project environments and deploy infrastructure resources with embedded security controls to meet the enterprise needs
- Asset governance – Users can search, discover, request access, and publish data and ML assets along with business metadata to the enterprise business catalog
In this post, we dive deep into how to set up and govern ML use cases. We discuss the end-to-end journey for setup and configuration of the SageMaker and Amazon DataZone integration. We also discuss how you can use self-service capabilities to discover, subscribe, consume, and publish data and ML assets as you work through your ML lifecycle.
Solution overview
With Amazon DataZone, administrators and data stewards who oversee an organization’s data assets can manage and govern access to data. These controls are designed to enforce access with the right level of privileges and context. Amazon DataZone makes it effortless for engineers, data scientists, product managers, analysts, and business users to access data throughout an organization so that they can discover, use, and collaborate to derive data-driven insights. The following diagram illustrates a sample architecture of Amazon DataZone and Amazon SageMaker integration.
With this integration, you can deploy SageMaker infrastructure using blueprints. The new SageMaker blueprint provides a well-architected infrastructure template. With this template, ML administrators can build a SageMaker environment profile with appropriate controls from services such as Amazon Virtual Private Cloud (VPC), Amazon Key Management Service (KMS Keys), and AWS Identity and Access Management (IAM), and enable ML builders to use this environment profile to deploy a SageMaker domain in minutes. When you create a SageMaker environment using the SageMaker environment profile, Amazon DataZone provisions a data and ML asset catalog, Amazon SageMaker Studio, and (IAM) roles for managing Amazon DataZone project permissions. The following diagram shows how the SageMaker environment fits in with the existing environments in Amazon DataZone projects.
To facilitate data and ML asset governance from SageMaker Studio, we extended SageMaker Studio to incorporate the following component:
- Asset – A data or ML resource that can be published to a catalog or project inventory, discovered, and shared. Amazon Redshift tables and AWS Glue tables are original Amazon DataZone assets. With this integration, we introduce two more asset types: SageMaker Feature Groups and Model Package Groups.
- Owned assets – A collection of project inventory assets discoverable only by project members. These are the staging assets in the project inventory that are not available to Amazon DataZone domain users until they are explicitly published to the Amazon DataZone business catalog.
- Asset catalog – A collection of published assets in the Amazon DataZone business catalog discoverable across your organization with business context, thereby enabling everyone in your organization to find assets quickly for their use case.
- Subscribed assets – A collection of assets the subscriber has been approved from the Amazon DataZone business catalog. Owners of those assets have to approve the request for access before the subscriber can consume them.
The following diagram shows an example of an ML asset like Customer-Churn-Model
lifecycle with the described components.
In the following sections, we show you the user experience of the SageMaker and Amazon DataZone integration with an example. We demonstrate how to set up Amazon DataZone, including a domain, project, and SageMaker environment, and how to perform asset management using SageMaker Studio. The following diagram illustrates our workflow.
Set up an Amazon DataZone domain, project, and SageMaker environment
On the Amazon DataZone console, administrators create an Amazon DataZone domain, get access to the Amazon DataZone data portal, and provision a new project with access to specific data and users.
Administrators use the SageMaker blueprint that has enterprise level security controls to setup the SageMaker environment profile. Then, the SageMaker infrastructure with appropriate organizational boundaries will deploy in minutes so that ML builders can start using it for their ML use cases.
In the Amazon DataZone data portal, ML builders can create or join a project to collaborate on the business problem being solved. To start their ML use case in SageMaker, they use the SageMaker environment profile made by the administrators to create a SageMaker environment or use an existing one.
ML builders can then seamlessly federate into SageMaker Studio from the Amazon DataZone data portal with just a few clicks. The following actions can happen in SageMaker Studio:
- Subscribe – SageMaker allows you to find, access, and consume the assets in the Amazon DataZone business catalog. When you find an asset in the catalog that you want to access, you need to subscribe to the asset, which creates a subscription request to the asset owner.
- Publish – SageMaker allows you to publish your assets and their metadata as an owner of the asset to the Amazon DataZone business catalog so that others in the organization can subscribe and consume in their ML use cases.
Perform asset management using SageMaker Studio
In SageMaker Studio, ML builders can search, discover, and subscribe to data and ML assets in their business catalog. They can consume these assets for ML workflows such as data preparation, model training, and feature engineering in SageMaker Studio and SageMaker Canvas. Upon completing the ML tasks, ML builders can publish data, models, and feature groups to the business catalog for governance and discoverability.
Search and discover assets
After ML builders are federated into SageMaker Studio, they can view the Assets option in the navigation pane.
On the Assets page, ML builders can search and discover data assets and ML assets without additional administrator overhead.
The search result displays all the assets corresponding to the search criteria, including a name and description. ML builders can further filter by the type of asset to narrow down their results. The following screenshot is an example of available assets from a search result.
Subscribe to assets
After ML builders discover the asset from their search results, they can choose the asset to see details such as schema or metadata to understand whether the asset is useful for their use case.
To gain access to the asset, choose Subscribe to initiate the request for access from the asset owner. This action allows data governance for the asset owners to determine which members of the organization can access their assets.
The owner of the asset will be able to see the request in the Incoming subscription requests section on the Assets page. The asset owners can approve or reject the request with justifications. ML builders will also be able to see the corresponding action on the Assets page in the Outgoing subscription requests section. The following screenshot shows an example of managing asset requests and the Subscribed assets tab. In the next steps, we demonstrate how a subscribed data asset like mkt_sls_table
and an ML asset like Customer-Churn-Model
are used within SageMaker.
Consume subscribed assets
After ML builders are approved to access the subscribed assets, they can choose to use Amazon SageMaker Canvas or JupyterLab within SageMaker Studio. In this section, we explore the scenarios in which ML builders can consume the subscribed assets.
Consume a subscribed Model Package Group in SageMaker Studio
ML builders can see all the subscribed Model Package Groups in SageMaker Studio by choosing Open in Model Registry on the asset details page. ML builders are also able to consume the subscribed model by deploying the model to an endpoint for prediction. The following screenshot shows an example of opening a subscribed model asset.
Consume a subscribed data asset in SageMaker Canvas
When ML builders open the SageMaker Canvas app from SageMaker Studio, they are able to use Amazon SageMaker Data Wrangler and datasets. ML builders can view their subscribed data asset to perform experimentation and build models. As part of this integration, ML builders can view their subscribed assets under sub_db
, and publish their assets via pub_db
.The created models can then be registered in the Amazon SageMaker Model Registry from SageMaker Canvas. The following screenshot is an example of the subscribed asset mkt_sls_table
for data preparation in SageMaker Canvas.
Consume a subscribed data asset in JupyterLab notebooks
ML builders can navigate to JupyterLab in SageMaker Studio to open a notebook and start their data experimentation. In JupyterLab notebooks, ML builders are able to see the subscribed data assets to query in their notebook and consume for experimentation and model building. The following screenshot is an example of the subscribed asset mkt_sls_table
for data preparation in SageMaker Studio.
Publish assets
After experimentation and analysis, ML builders are able to share the assets with the rest of the organization by publishing them to the Amazon DataZone business catalog. They can also make their assets only available to the project members by just publishing to the project inventory. ML builders can achieve these tasks by using the SageMaker SDK or publishing directly from SageMaker Studio.
You can publish ML assets by navigating to the specific asset tab and choosing Publish to asset catalog or Publish to inventory. The following screenshot show how you can publish feature group to asset catalog.
The following screenshot show how you can also publish model group to asset catalog or project inventory.
On the Assets page, you can use the data source feature to publish data assets like an AWS Glue table or Redshift table.
Conclusion
Governance is a multi-faceted discipline that encompasses controls across infrastructure management, data management, model management, access management, policy management, and more. ML governance plays a key role for organizations to successfully scale their ML usage across a wide range of use cases and also mitigate technical and operational risks.
The new SageMaker and Amazon DataZone integration enables your organization to streamline infrastructure controls and permissions, in addition to data and ML asset governance in ML projects. The provisioned ML environment is secure, scalable, and reliable for your teams to access data and ML assets, and build and train ML models.
We would like to hear from you on how this new capability is helping your ML governance use cases. Be on the lookout for more data and ML governance blog posts. Try out this new SageMaker integration for ML governance capability and leave your comments in the comments section.
About the authors
Siamak Nariman is a Senior Product Manager at AWS. He is focused on AI/ML technology, digital transformation, and enabling automation to improve overall organizational efficiency and productivity. He has over 7 years of automation experience deploying various technologies. In his spare time, Siamak enjoys exploring the outdoors, long-distance running, and playing sports.
Kareem Syed-Mohammed is a Product Manager at AWS. He is focused on ML Observability and ML Governance. Prior to this, at Amazon QuickSight, he led embedded analytics, and developer experience. In addition to QuickSight, he has been with AWS Marketplace and Amazon retail as a Product Manager. Kareem started his career as a developer for call center technologies, Local Expert and Ads for Expedia, and management consultant at McKinsey.
Dr. Sokratis Kartakis is a Principal Machine Learning and Operations Specialist Solutions Architect at AWS. Sokratis focuses on enabling enterprise customers to industrialize their Machine Learning (ML) and generative AI solutions by exploiting AWS services and shaping their operating model, i.e. MLOps/FMOps/LLMOps foundations, and transformation roadmap leveraging best development practices. He has spent 15+ years on inventing, designing, leading, and implementing innovative end-to-end production-level ML and AI solutions in the domains of energy, retail, health, finance, motorsports etc.
Ram Vittal is a Principal ML Solutions Architect at AWS. He has over 3 decades of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure and scalable AI/ML and big data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he rides his motorcycle and walks with his 3-year-old Sheepadoodle.