AWS Machine Learning Blog
Build Custom SageMaker Project Templates – Best Practices
July 2023: This post was reviewed for accuracy.
SageMaker Projects give organizations the ability to easily setup and standardize developer environments for data scientists and CI/CD systems for MLOps Engineers. With SageMaker Projects, MLOps engineers or organization admins can define templates which bootstrap the ML Workflow with source version control, automated ML Pipelines, and a set of code to quickly start iterating over ML use cases. With Projects, dependency management, code repository management, build reproducibility, artifact sharing and management become easy for organizations to set up. SageMaker Projects are provisioned using AWS Service Catalog products. Project templates are used by organizations to provision Projects for each of their users.
This post describes how SageMaker Project templates can be customized to fit any organization’s use case. This GitHub repository contains examples of custom templates.
SageMaker Projects
Every organization has its own set of standards and practices that provide security and governance for their AWS environment. SageMaker provides a set of 1st party templates for organizations that want to quickly get started with ML workflows and CI/CD. Included in the templates are projects which use AWS native services for CI/CD such as AWS CodeBuild, AWS CodePipeline, and AWS CodeCommit and also projects that use third party tools such as Jenkins and GitHub.
Oftentimes organizations need tight control over the MLOps resources that are provisioned, restricted and managed; this includes – configuring IAM roles/policies, enforcing resource tags, enforcing encryption and decoupling resources across multiple accounts. To give organizations the flexibility to do this, SageMaker Projects support custom templates where organizations use AWS CloudFormation scripts to define the resources needed for an ML workflow. These custom templates are created as AWS Service Catalog products and provisioned as Organization Templates on the SageMaker Studio UI. This is where Data Scientists would choose a template and have their ML workflow bootstrapped and pre-configured. AWS Service Catalog is an AWS service that enables organizations to create and manage catalogs of products that are approved for use on AWS. These products are created using CloudFormation templates.
Understanding SageMaker 1P templates
To help our customers get started with common model building and deployment paradigms, SageMaker Projects offers a set of 1P templates. The 1P templates generally focuses on creating resources for model building and model training.
To understand SageMaker Projects in detail, it helps to break it up into two components – AWS resources and project seed code.
The following sections will reference “MLOps template for model building, training, and deployment”.
AWS Resources
- Two CodeCommit repositories populated with seed code. One repository for model building and another for model deployment.
- Two CodeBuild stages that are triggered by CodePipeline –
- Model Building – this builds the code in the model building repository.
- Model Deployment – this builds the code in the deployment repository.
- Two CodePipeline Pipelines –
- Model Building Pipeline
This Pipeline will orchestrate the steps for Model Building. This includes pulling the latest code from CodeCommit and starting the CodeBuild execution. - Model Deployment Pipeline
This Pipeline will orchestrate the steps for Model Deployment. This includes pulling code from CodeCommit, starting the CodeBuild execution, and any other steps defined in the model template.
- Model Building Pipeline
- Three Amazon EventBridge rules-
- Trigger the Model Building Pipeline based on changes to the “modelbuild” repository
- Trigger Model Deployment Pipeline based on changes to “modeldeploy” repository
- Trigger Model Deployment Pipeline when a model is “Approved” in the model registry associated with this project. When an endpoint is already deployed, changing the status from “Approved” to “Rejected”, the endpoint will be torn down and the most recently “Approved” before that will be deployed.
- An Amazon S3 bucket for the project.
Seed Code
The CodeCommit repositories created by the template are pre-populated with seed code for model building and model deployment.
- Model Building Seed Code
The seed code for the model building pipeline includes a SageMaker Pipeline that preprocesses the Abalone dataset, trains an XGBoost model, evaluates the performance of a model using a processing step, and registers the model into a model registry based on the model performance. A file namedcodebuild-buildspec.yml
is part of the seed code in the repository. This file describes the steps in CodeBuild for triggering the SageMaker Pipeline. Data Scientists would make changes to the code in this repository to reflect their use cases. The seed code is provided as a starting point to help quickly onboard new ML use cases. - Model Deployment Seed Code
The seed code for model deployment includes a CodeBuild step to find the latest model that has been approved in the model registry and create configuration files to deploy the CloudFormation templates. The CloudFormation template deploys the model into either the “staging” or “prod” stage. A file namedbuildspec.yml
is part of this repository which describes the steps taken to find the latest model to deploy and create the CloudFormation scripts to create the SageMaker Endpoint. MLOps Engineers would update the code in this repository to change the specific deployment logic. Note that changes to the deployment stages (staging, prod, etc.) would require a custom template, since it is a change in CodePipeline, not the seed code.
Triggering the Pipeline
Model building process
- Data scientist(s) update the model building logic (i.e., preprocessing, training scripts etc.) and push a commit to Repository #1 in CodeCommit.
- An EventBridge rule triggers the modelbuild CodePipeline pipeline.
- In the pipeline, a CodeBuild project builds a SageMaker pipeline (this can be replaced with any model training code) and executes the workflow, i.e., trains a model and registers the model to SageMaker model registry.
The model can then be approved or rejected in the registry. Organizations should determine how their model approval takes place and have processes that govern which users are allowed to approve models in the registry.
Model deployment process
- MLOps engineers update model deployment code and push a commit to Repository #2 on CodeCommit, or a model is approved in the model registry.
- EventBridge rules trigger the model deploy CodePipeline pipeline.
- CodeBuild project creates configurations for Staging and Production to be used by CloudFormation templates.
- In the “DeployStaging” stage –
- A SageMaker endpoint is created in “Staging” through CloudFormation
- A CodeBuild project tests the endpoint for an ‘InService’ status.
- If the test passes, a manual approval step is triggered.
- The designated user in the organization will manually approve the model in CodePipeline
- In the “DeployProd” stage, the SageMaker endpoint is deployed to “Production”. Note that for the 1P template, both endpoints are deployed in the same region, with different suffixes (Staging or Prod).
If the model is rejected, no action is taken.
When this 1P SageMaker project is first created, each pipeline runs automatically as the seed code is pushed into these repositories for the first time. As users subsequently update the code in either of the repositories to meet their use case, the pipelines will be re-triggered. Some use cases fit within the paradigm offered by the 1st party templates, making these templates a great way to seed ML projects. For example, adding additional steps to the pipeline, changing the data set, updating the properties of the deployed endpoint like adding DataCaptureConfig, using GitHub or Jenkins instead of the AWS native counterparts are all achievable using the 1st party templates. For more information on the 1st party SageMaker Project templates, visit the SageMaker Projects documentation.
Why Create Custom Templates
Organizations may want to extend the 1st party templates to support use cases beyond simply training and deploying model. Custom project templates are a way for organizations to create a standard workflow for Machine Learning projects. Organization can create several templates and use IAM policies to manage access to those templates on SageMaker Studio ensuring that each of their users are accessing projects dedicated for their use cases.
Here are some common scenarios when organization would need to create a custom project template:
Scenario | Current 1P offering | Organization’s use case |
Using SVC systems not supported in the 1st party templates | 1st party templates use on AWS CodeStar Connections to authenticate with the repository limiting the 1st party templates to CodeCommit, GitHub, BitBucket, and GitHub enterprise. | Organizations may have version control systems other than the ones currently offered by 1P templates. |
Using a multi-account strategy for model training and deployment. | Currently 1P templates do model training and deployment all in the same account. | Organizations may want to use a multi-account strategy as a best practice with dedicated training accounts and staging/production accounts for deployments. |
Custom approval workflows for model deployment. | 1P templates have 2 approval steps – Approval in the Model Registry, and manual approval in the CI/CD Pipeline | Organizations may have multiple approval steps that need to take place before a model can be deployed. |
Multiple deployment stages | 1P templates deploy models in 2 stages – staging and production. Both endpoints are deployed in the same account. | Organizations may have more than 2 stages (staging, pre-production, production) for deployment. |
Multiple code branches for experimentation | 1P templates assume only a single branch in the repository used | Organizations may have multiple users working on the same repository where each of them works in individual branches for experimentation with the main branch having the best version of the training pipeline |
Custom hosting options | 1P templates use SageMaker hosted endpoints only | Organizations may want to leverage a variety of SageMaker hosting options – MME, Edge etc. |
Single pipeline for multiple use cases | 1P templates use a single SageMaker Pipeline that trains a model | Organizations may have to use a SageMaker Pipeline to train and register multiple models each with its own evaluation criteria to limit the number of pipelines to manage |
Development and production pipelines | 1P templates used a single pipeline for development and production | Organizations may want to first test their pipeline in a development environment and then use a CI/CD process to create and run that pipeline in a production environment |
Using custom seed code | 1P templates have standard seed code for model building and deployment | Organizations may want to provide their developers a set of custom seed code particular to use cases they work on |
Using SageMaker Studio in a VPC | 1P templates use SageMaker in internet-mode | Organization may be using SageMaker in vpc-only mode |
Best Practices for Designing a Custom Project Template
In this section, organizations will see how they can build their own custom templates and the considerations to account for when designing templates of their own.
Source Version Control Integration
The 1P templates use AWS CodeStar Connections to manage authentication between the Project and the repository so that the seed code can be pushed to it. This method will support GitHub, GitHub Enterprise, and BitBucket in addition to the AWS native repository, CodeCommit. If organizations want to use different repositories a different authentication mechanism needs to be provided in the Project. A recommended approach is to use an AWS Lambda function with AWS Secrets Manager to authenticate with the repository and push the seed code. Once the seed code has been pushed, the authentication with the repository on SageMaker Studio happens via repository username and password. The method with Lambda and Secrets Manger is meant for the seed code being pushed to the repo when the project is created. Alternative strategies to push seed code into the repository can be explored based on the organization’s repository, authentication mechanism, use case, etc.
The seed code pushed to the repository should be customized to support the use case for the project.
Enabling CI/CD
In the SageMaker Project, the CI/CD tool used will be responsible for triggering the model training and deployment process. When the status of a model is changed in the Model Registry, an EventBridge notification is emitted which can alert the CI/CD tool to begin deployment. Similarly, the CI/CD tool will need to use SageMaker’s API to start the SageMaker Pipeline execution when a change is made to the model building repository. In the 1P template that uses Jenkins for CI/CD, the Jenkins Pipeline is triggered by pushed to the SVC repository. The CI/CD Pipeline uses the AWS CLI commands to start a SageMaker Pipeline for model training and run CloudFormation scripts for deploying the model to the endpoint.
In cases where organizations want to use CI/CD tools not supported in the 1P templates (Jenkins, CodePipeline), they should make sure their repository can trigger their CI/CD Pipeline and that they AWS CLI commands can be invoked for the CI/CD pipeline so the relevant AWS services can be called (SageMaker Pipelines, CloudFormation, etc). In the 1P template, a Lambda function is used to trigger the Jenkins/CodePipeline pipeline when a model in the Model Registry is approved, the same can be done when using other CI/CD tools.
Identify IAM Roles and Actors
SageMaker Projects require a set of IAM roles that fall under two categories:
- Launch Roles – Used to define a constraint in Service Catalog which forces underlying product to be provisioned using the designated LaunchRole. This allows developers to create projects using templates without needing their SageMaker Execution Role to have all the policies needed to launch the Project. Service Catalog assumes the launch constraint role while creating the project so that the developers using the project can have their roles limited to the specific policies they need.
- Use Roles – Used within the template by each resource for the required operations. For each operation in the product template, the Use Role is assumed by the respective AWS Service Principal.
Roles in the 1P templates:
The 1P templates use the following AWS managed roles.
- Launch Roles – The Amazon Managed Policy
AmazonSageMakerServiceCatalogProductsLaunchRoleis
used by the 1st party templates for the Service Catalog launch constraint. - Use Role –
AmazonSageMakerServiceCatalogProductsUseRole
is used by the resources created in the 1st party templates.
Roles in the custom template:
In case of custom templates, the LaunchRole
needs to be updated to have enough permissions to deploy all resources in the CloudFormation template; and the UseRole
needs to have all the associated services in its trust policy so the services can assume the right role. Customers can define a UseRole
for each service instead of a single role for all services.
To get started, identify the user personas associated with the application in addition to the administrator, i.e., data scientists, MLOps team, etc., and design the IAM policies with least privilege access for each user, and services such as SageMaker Studio and Service Catalog. See Actions, resources, and condition keys for AWS services for an exhaustive list of IAM policies. A sample set of IAM roles are:
- Administrator: This user will be responsible for creating the custom template and provisioning it for the data scientist users. The required privileges include access to Service Catalog, IAM (to create roles and policies), and SageMaker at the minimum.
- Data Scientist: This user will be launching SageMaker Studio, creating notebooks and deploying project templates, running processing, training jobs etc. For isolation of permissions, this user role only needs access to open a SageMaker Studio user profile, and the remaining permissions can be handled through the notebook’s execution role.
- Lead Data Scientist: This user will be in charge of approving deployment of resources in Production. This role could typically be the team lead, who will validate the staging resources, verify all conditions are met and approve the changes to production. This additional user provides auditability and adds a manual check before making any changes in the live environment.
- Studio notebook execution role: This is the service role assumed by the SageMaker studio user. It typically includes a higher-level permission to SageMaker, with additional access to Elastic Container Registry, specific S3 buckets, CodePipeline, CodeBuild, CloudWatch logs etc. as necessary. To be able to list and create a SageMaker Project, the role also needs access to Service Catalog.
- Custom template launch role and user role: Roles assumed by SageMaker when launching the custom project template, and to create the resources deployed by the template. Adding a distinction between the notebook’s role and the template’s launch role allows the administrators to limit the end users’ permission to the minimum that they require for each product. See AWS Service Catalog Launch Constraints for detailed documentation.
Model deployment strategies
Based on the hosting option that’s right for the use case, the deployment components of the template should be updated to use that hosting option. For eg. if the model needs to be deployed to a multi model endpoint, the CloudFormation scripts should be updated to reflect that. Or if a Serial Inference Pipeline is used, a PipelineModel should be registered to the Model Registry by the training pipeline and CloudFormation is used to deploy the PipelineModel to a SageMaker Endpoint. Similarly, the template can be modified to support compiling a model for Edge deployment using SageMaker Neo.
In addition to the hosting option, the approval strategy should be coded into the custom template. In the 1P templates as described above, the approval process for deployment happens in 2 steps. The first is approving the model in the Model Registry, the second is a manual approval step in CodePipeline or Jenkins. This may not fit into the approval governance mechanisms in place for organizations when they deploy models. An example of a different approval mechanism could be to restrict the users that can update the Model Registry model status so only MLOps Engineers can update the status to “Approved”. Once approved, the CI/CD pipeline can have a step that checks for certain integration tests to be completed along with manual approval from an account admin before deploying the model. Such approval workflows can be designed in the custom template to define a standard practice for deployment across the organization.
Lastly, the 1P templates operate within a single account in accordance with CI/CD best practices, organizations may have a multi-account strategy with dedicated dev, staging, and prod accounts. In this case, customers can make use of AWS Organizations to manage those accounts and leverage cross account CloudFormation stacks to handle the deployment. The following diagram illustrates how this can be setup.
For detailed instruction on how to set up multi-account deployment using SageMaker Projects, refer to this blog.
Security, Encryption, & tagging
- Security and Encryption
Security is highest priority at AWS. While creating CI/CD workflows for machine learning, it is imperative to understand how to secure confidential data used for training. AWS recommends AWS KMS-CMK encryption of storage, i.e., S3 buckets, objects, SageMaker Studio volume etc. S3 versioning enables recovery of objects in case of an accidental delete, and S3 logging delivers access logs for the bucket to a target bucket. Create the SageMaker Studio domain in a secure VPC and use VPC endpoints to avoid data transfer through public internet. See the whitepaper Build a Secure Enterprise Machine Learning Platform on AWSfor best practices. - Tagging
Tags are used to organize resources by users, teams, projects or departments, and track your AWS costs on a detailed level. In addition, tags can be used to enforce IAM policies, for example, isolating resources between two teams as described in the blog post Configuring Amazon SageMaker Studio for teams and groups with complete resource isolation. For such reasons, you can enforce tagging on any user created SageMaker Project through IAM policies. To the data scientist role, add the following IAM policy statement:
When you create a custom product, you can also use the TagOption Library to enforce the values for each tag. When a tag is specified for a Project, SageMaker propagates the tags to all its resources.
In an organizational setting, you can also create multiple Project templates (Service Catalog Products) for different teams, and restrict access for each team to their corresponding template using IAM policies.
Applying these best practices to a use case
This table describes how the best practices described above can help solve a variety of use cases where custom project templates are created.
Scenario | Proposed solution |
Using SVC systems not supported in the 1st party templates | The 1st party templates can be customized to use custom authentication mechanisms like Lambda functions with AWS Secrets Manager or any other way to access code in the repository. Refer to Source Version Control in the section Best Practices for Designing a Custom Project Template. Here is an example of this. |
Using a multi-account strategy for model training and deployment. | Use AWS Organization to manage multiple accounts and cross account CloudFormation stacks to manage deployment of models in multiple accounts. Refer to “Model Deployment Strategies” in the section Best Practices for Designing a Custom Project Template. |
Custom approval workflows for model deployment. | Add unit tests, integration tests, additional manual approval steps, multiple evaluation steps in the training pipeline, etc. to have a robust model approval strategy. Refer to “Model Deployment Strategies” in the section Best Practices for Designing a Custom Project Template. |
Multiple deployment stages | The CI/CD Pipeline (CodePipeline, Jenkins etc) needs to be updated with all the deployment steps required. One step for each stage of deployment. Refer to “Model Deployment Strategies” in the section Best Practices for Designing a Custom Project Template. |
Multiple code branches for experimentation | A custom template could be created where each time a new branch in SVC is pushed for experimentation, a SageMaker Pipeline for that branch is created and executed. Here is an example of a custom template to enable this strategy. |
Custom hosting options | A custom template can be created that changes the CloudFormation scripts used for endpoint deployment and the deployment stages can be updated to suit the hosting option selected. Refer to “Model Deployment Strategies” in the section Best Practices for Designing a Custom Project Template. |
Single pipeline for multiple use cases | The project seedcode can be updated to have a single SageMaker Pipeline train multiple models serving multiple use cases. Instances where this is useful could be when a single dataset is used to train multiple models, each model is trained on different subsets of data. This prevents the need of managing multiple pipelines and reduces the number of data preparation steps needed. |
Development and production pipelines | A custom template can be created that creates a SageMaker Pipeline using a Pipeline definition is a production environment when the definition is pushed to the SVC repository. This way, data scientists can test their pipeline in a dev environment, iterate over it until the pipeline reaches the desired state, push the pipeline definition file to the repository, have the CI/CD pipeline create a new pipeline using the same definition in the production environment, and start its execution. |
Using custom seed code | A custom template can be created that pulls code from a location that hosts the custom seed code for the projec. This can be an organization managed S3 bucket or repository to pull code from. Refer to Source Version Control in the section Best Practices for Designing a Custom Project Template. |
Using SageMaker Studio in a VPC | A custom template can be created that has access to the bucket with seed code through a VPC endpoint. Without this, when the project is created, the seed code will not be available to populate the repository. Refer to Security, Encryption, & Tagging in the section Best Practices for Designing a Custom Project Template. |
Conclusion
Using the best practice and guidance described here, organizations can enable their users with standardized workflows for ML that help boost productivity and ensure compliance with organization standards.
Visit this GitHub repository for an example on building your own template and contribute to the repository with custom templates of your own!
About the Authors
Kirit Thadaka is an ML Solutions Architect working in the SageMaker Service SA team. Prior to joining AWS, Kirit spent time working in early-stage AI startups followed by some time in consulting in various roles in AI research, MLOps, and technical leadership.
Durga Sury is a Data Scientist in the Energy Delivery team in Professional Services. Before AWS, she enabled non-profit and government agencies derive insights from their data to improve education outcomes. At AWS, she focuses on Natural Language Processing and MLOps.