AWS Open Source Blog
Using Kedro pipelines to train Amazon SageMaker models
Machine learning (ML) and artificial intelligence (AI) adoption is growing at nearly 25 percent per year in a variety of businesses, which results in data scientists and engineers building more analytical models per person with similar levels of resources as last year. To keep up with such high demand, builders need to remove manual and repetitive tasks from the model development workflow through automation and standardization.
It’s no surprise that tools like Kedro, which help builders apply software engineering best practices to data and machine learning pipelines, have rapidly gained popularity. Pipelines are a software concept where code is used to create a repeatable sequence of steps so that the pipeline builder no longer needs to manually figure out what’s next. In this article, we’ll show how building a Kedro pipeline that trains your model on Amazon SageMaker will help you create reproducible models and save time in the process.
What is Kedro?
Kedro is an Apache 2.0 licensed open source Python framework that applies software engineering best practices to data and machine learning pipelines. You can use it, for example, to optimize the process of taking a machine learning model into a production environment. You can use Kedro to organize a single-user project running on a local environment, or collaborate in a team on an enterprise-level project.
Kedro focuses on the development workflow to make organizing, building, and running data projects stress-free. Kedro aims to solve several common challenges:
- Maintainability–Project standardization makes projects easier to understand and update because the structure is the same between projects.
- Reusability–Pipeline abstraction allows builders to reuse parts of their pipeline in the future, thereby shortening the build cycle.
- Collaboration–Best practices built into the workflow let builders with varied levels of experience work together more effectively by guiding the development process.
- Efficient use of time–When the above challenges are addressed, builders can move more quickly from experimentation to production.
Kedro gets our project organized and consistent in a few different ways. For example, using customizable templates we can generate a standard project skeleton with logical folders for assets including datasets, configuration, notebooks, and application source code. Once the skeleton setup is complete, we can use Kedro to construct pipelines of executable nodes that form the basis of our project. When building nodes and pipelines, we can use features such as the data catalog, pipeline journal, and others to construct production-ready data pipelines from day one.
Prerequisites
To complete this tutorial, you will need:
- An AWS account
- The AWS Command Line Interface v2 (AWS CLI v2) installed and configured
- Python 3.7 installed
- Git installed
The code examples shown in this article are from macOS. If you’re a Windows user, you’ll need to translate commands and path references used on the terminal to Windows format; the YAML and Python code should remain the same.
Installing Kedro
Kedro can be installed using Pip or Conda, which are package managers for Python. But, first, we’ll create a virtual environment to keep our project isolated and clean from other dependencies. The following steps will create a new directory for our project, create the Python virtual environment, activate it, and then finally install Kedro in that virtual environment:
If you’re not familiar with Kedro and would like to start from scratch, the documentation offers a great full tutorial that walks through the development workflow of a Kedro project, including pipeline creation, nodes, data catalog specification, testing, packaging, and more. We’ll be working through the completed version of that tutorial, so if this is your first exposure to Kedro, you may want to start there.
Starting our Kedro project
To get up and running quickly, we’ll use a “Kedro Starter,” which is a template that contains both configuration and code that can be run as-is or extended. The starter includes a sample dataset and a fully configured pipeline and code, so we can start exploring immediately. Using a starter project will help us jump to step three in the Kedro workflow and begin to make some modifications to the starter project.
Running a sample pipeline
Once Kedro is installed in our virtual environment, we’ll create a new project using the kedro new
command and pass the path of the starter template. We’ll use the “Space Flights” starter for our examples, which will create a linear regression model that predicts the cost of space flights from the included sample data sets. When you run kedro new
, you’ll be prompted for a project name, repository name, and Python package name. Enter spaceflights
for all three and then change to the directory specified at the end of the output.
Looking inside the spaceflights
directory at the content of our new project, we can see that the organizational structure that forms the base of all Kedro projects automatically includes best practices such as data, configuration, and source code separation.
The Space Flights project we’re using contains two pre-built pipelines: data_engineering
and data_science
, which process the raw input data from the data
folder and build a SKLearn linear regression model, respectively. The linear regression model attempts to predict the cost of a taking a space flight using the input data.
Because this is our first time running the project, we need to perform a few one-time project setup tasks. To begin, we’ll initialize the spaceflights folder as a Git repository so we can track our changes. Thankfully, a .gitignore
file is automatically created for us by Kedro. Next, we’ll install the requirements.txt
from the src
directory to our virtual environment:
With our prerequisite setup finished, we can now run our Kedro pipeline as-is via the kedro run
command, which will execute all the nodes in the pipeline:
The pipeline output will look similar to the below, you can ignore any warnings:
From the output above, we can see that our pipeline completed successfully. We can see from the last three lines that the default model had a coefficient R^2 of 0.456 (your number might differ slightly), that “6 out of 6 tasks” were completed and that the “pipeline execution completed successfully,” respectively.
This output tells us that every node in the default pipeline ran including the training and testing of the SKLearn model provided from the Space Flights starter. But, what happens if we have more data than our laptop can reasonably handle? What if we need to retrain our model when we receive new data or when a specific event occurs? Let’s explore how training models on SageMaker through our Kedro pipeline helps address these issues.
Updating the Space Flights pipeline with SageMaker
We can visualize our current pipeline using Kedro-Viz, a plugin for Kedro already installed in our environment. Kedro-Viz uses the configuration and code of our Kedro data pipeline to produce an interactive rendering of the data pipeline, allowing us to explore the nodes, dependencies, outputs, and data sets defined in the pipeline. Generating the visualization is done by running kedro viz
from the root of the project:
kedro viz
Kedro-Viz will then start a local webserver that hosts the interactive visualization of our pipeline. We can access the webserver at the URL listed in the output of the kedro viz
command, typically http://127.0.0.1:4141/:
Exploring the visualization, we can see the node Train Model is currently outputting Regressor, which is the SKLearn model, so we know Train Model is the node we’ll need to update if we want to use SageMaker to train our model. Once you are done exploring, you can exit Kedro-Viz and cancel the process in the terminal.
Before we make the change to our node, we need to create an Amazon Simple Storage Service (Amazon S3) bucket for our data and objects and an IAM role to allow SageMaker to perform operations on our behalf.
Note: This guide assumes your AWS user has Administrator permission in your account. To limit permissions in your live environment, you can use IAM last accessed information to help implement granting least privilege access, a critical security best practice.
To begin, create an Amazon S3 bucket with sagemaker
in the name and take note of the full name; be sure you do not allow public access to the bucket. The bucket name must contain sagemaker
because an IAM role that we use, SageMakerFullAccess
, looks for the exact string in the bucket name to allow access to the bucket. An example S3 bucket name could be xyz-123-sagemaker. Be sure the bucket you create is in the same region you plan on using SageMaker with, for example us-east-1.
Next, create an IAM Role that will allow SageMaker to perform operations on your behalf and note its name. As a managed service, SageMaker can perform only the operations that the user permits. In this example, we’ll use the managed policy AmazonSageMakerFullAccess
. In your Production environment, however, you should restrict the permissions to only what is needed for your use case. Refer to the best practices documentation to understand granting least privilege in your account.
To create a new IAM role:
1. Sign in to the AWS Management Console and open the IAM console.
2. In the left navigation pane, choose Roles.
3. Choose Create role.
4. For role type, choose AWS Service, find and choose SageMaker from the use case list and then choose the SageMaker – Execution use case. Then, choose Next: Permissions.
5. On the Attach permissions policy page, confirm that the policy AmazonSageMakerFullAccess is listed. Choose Next: Tags, then Next: Review.
6. For Role name, enter a name for your role, for example: AmazonSageMakerFullAccessRole.
Once we’ve created the SageMaker IAM role, we can begin updating our Space Flights project to train the model on SageMaker.
We’ll be using the Amazon SageMaker Python SDK and S3FS in our pipeline, so we need to add them to the kedro-sagemaker/spaceflights/src/ requirements.txt
file so they are available to our nodes.
From the kedro-sagemaker/spaceflights
directory, run kedro install --build-reqs
to update your Python virtual environment with the SageMaker SDK and S3FS added to the requirement.txt
file. This step may take a few minutes as it downloads, installs, and builds all the requirements for our project.
Configuring the pipeline
Once we have S3FS installed, we can use Amazon S3 to store the data that we’ll use for training models on SageMaker. For this example, we’ll focus primarily on the data_science
pipeline because that’s where model training takes place; however, we can use S3 as part of the data_engineering
pipeline as well.
Looking at kedro-sagemaker/spaceflights/src/spaceflights/pipelines/data_science pipeline.py
, we see that the split_data
node takes in master_table and parameters as input and then outputs four lists; X_train
, y_train
, X_test
, and y_test
:
The current inputs and outputs are defined in kedro-sagemaker/spaceflights/conf/base/catalog.yml
, where they are all specified to be stored locally on your computer. Add a new sagemaker
directory under conf
to create a new configuration set, which will specify that we want to store our test and train data on S3.
We’ll create a new directory named sagemaker under the kedro-sagemaker/spaceflights/conf
directory to house our configuration files. Our conf
directory structure should now look like this:
conf
├── base
├── local
└── sagemaker
Create the following files under the new conf/sagemaker
directory:
catalog.yml
The catalog file will specify that the X_train
and y_train
objects will now be stored as PickleDataSet on S3. You can find more information on the data storage options in the Kedro Data Catalog documentation.
parameters.yml
The parameters file allows us to create configuration values that can be passed as variables when executing pipelines. We can change or override these parameters at runtime to fit different environmental needs. In this example, we’ve specified that SageMaker should use Managed Spot Instances for training, which can save up to 90 percent of costs over on-demand instances. If you want to change the parameters for the Spot training, look at the options in the SageMaker SDK documentation.
NOTE: Make sure you replace the role
above with the name of the role you created earlier.
globals.yml
The global variables become available to the configuration files above, so we have a central location to change configuration from environment to environment if needed.
Finally, we update kedro-sagemaker/spaceflights/src/spaceflights/hooks.py
to register the globals.yml
variables by using TemplatedConfigLoader, which allows us to replace variables in our pipeline code with the contents of a configuration file. You can replace the src/spaceflights/hooks.py
file with the following:
Updating the pipeline
Now that we updated the pipeline to store our data on S3 where SageMaker can access it and have configured parameter files to maintain our configuration, we’ll update the data_science
pipeline to use SageMaker for training by making three updates to the pipeline:
- Adding methods to
nodes.py
under thedata_science
pipeline to create a SageMaker training job. - Replacing
pipeline.py
in thedata_science
pipeline to send the training data to the SageMaker training job. - Creating a SageMaker training script that loads data from S3, configures training with hyperparameters, trains, and saves the model.
The changes to the pipeline listed above include the tasks needed to train a model using the SageMaker Python SDK, which are:
- Preparing a training script to load data, configure hyperparameters, train and save the model.
- Creating an Estimator to encapsulate the training job.
- Calling the
fit
method on the Estimator to begin training.
Update nodes.py
To create the SageMaker training job, which will train our linear regression model, we first update the methods in kedro-sagemaker/spaceflights/src/spaceflights/pipelines/data_science/nodes.py
. The methods added will set up and trigger the training job and return the path to the trained model as well as simply unzip the model so we can review it locally if needed. We can replace the data_science/nodes.py
file with the following:
Replace pipline.py
With the new methods added to nodes.py
, we’ll replace the contents of src/spaceflights/pipelines/data_science/pipeline.py
to call the new train_model_sagemaker
node and pass in the S3 path to the training data along with our training parameters.
from kedro.pipeline import Pipeline, node
Creating the SageMaker Training Script
Finally, we can create the SageMaker Training Script, which defines the actual work that the SageMaker training job will execute when run. The training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to model_dir
so that it can be deployed for inference later.
To create the training script, create the file src/spaceflights/sagemaker_entry_point.py
with the content below.
Running the pipeline
With the pipeline changes complete, we’re now ready to execute the pipeline:
kedro run --env sagemaker
Specifying the env
option will use the configuration variables from the conf/sagemaker
folder and will execute the pipeline.
Note: You’ll need to have your AWS CLI configured because you’ll now be calling AWS services from your computer.
On this run, the Kedro pipeline will:
- Store the engineered dataset to S3.
- Create a Managed Spot Training job on SageMaker to train the model and show the savings over on-demand.
- Download the trained model locally to the data folder.*
- Test the model and report its score.
Your output should look similar to that below:
Visualizing the final pipeline
To view the changes you made to your pipeline, run the kedro viz
command again from the spaceflights directory, and you’ll see the updated pipeline referencing the new Train Model SageMaker node.
kedro viz
Your pipeline should now look like this:
Note that y_train
isn’t connected to Train Model SageMaker because we are only passing the path of the datasets to SageMaker for training. Because we specified in catalog.yml
that we want to store both X_train
and y_train
files in the same location on S3, we only need one reference to them to extract the path.
Clean up
To clean up your account, make sure that you delete all the objects in the S3 buckets you’ve created. The SageMaker training jobs you created are automatically terminated once they complete training, so you don’t have any SageMaker resources that need to be removed.
Conclusion
In this tutorial, we showed how to use Amazon SageMaker to train models through a Kedro pipeline. With this information, you can begin building models that are ready for production and quick to train from day one.
Get involved
You can join the open source Kedro community on GitHub and Discourse, where you can ask questions, collaborate, and contribute to the project.