AWS Open Source Blog

Using Kedro pipelines to train Amazon SageMaker models

Machine learning (ML) and artificial intelligence (AI) adoption is growing at nearly 25 percent per year in a variety of businesses, which results in data scientists and engineers building more analytical models per person with similar levels of resources as last year. To keep up with such high demand, builders need to remove manual and repetitive tasks from the model development workflow through automation and standardization.

It’s no surprise that tools like Kedro, which help builders apply software engineering best practices to data and machine learning pipelines, have rapidly gained popularity. Pipelines are a software concept where code is used to create a repeatable sequence of steps so that the pipeline builder no longer needs to manually figure out what’s next. In this article, we’ll show how building a Kedro pipeline that trains your model on Amazon SageMaker will help you create reproducible models and save time in the process.

What is Kedro?

Kedro is an Apache 2.0 licensed open source Python framework that applies software engineering best practices to data and machine learning pipelines. You can use it, for example, to optimize the process of taking a machine learning model into a production environment. You can use Kedro to organize a single-user project running on a local environment, or collaborate in a team on an enterprise-level project.

Kedro focuses on the development workflow to make organizing, building, and running data projects stress-free. Kedro aims to solve several common challenges:

  • Maintainability–Project standardization makes projects easier to understand and update because the structure is the same between projects.
  • Reusability–Pipeline abstraction allows builders to reuse parts of their pipeline in the future, thereby shortening the build cycle.
  • Collaboration–Best practices built into the workflow let builders with varied levels of experience work together more effectively by guiding the development process.
  • Efficient use of time–When the above challenges are addressed, builders can move more quickly from experimentation to production.

Kedro gets our project organized and consistent in a few different ways. For example, using customizable templates we can generate a standard project skeleton with logical folders for assets including datasets, configuration, notebooks, and application source code. Once the skeleton setup is complete, we can use Kedro to construct pipelines of executable nodes that form the basis of our project. When building nodes and pipelines, we can use features such as the data catalog, pipeline journal, and others to construct production-ready data pipelines from day one.

Prerequisites

To complete this tutorial, you will need:

The code examples shown in this article are from macOS. If you’re a Windows user, you’ll need to translate commands and path references used on the terminal to Windows format; the YAML and Python code should remain the same.

Installing Kedro

Kedro can be installed using Pip or Conda, which are package managers for Python. But, first, we’ll create a virtual environment to keep our project isolated and clean from other dependencies. The following steps will create a new directory for our project, create the Python virtual environment, activate it, and then finally install Kedro in that virtual environment:

mkdir kedro-sagemaker && cd kedro-sagemaker
python -m venv kedro-env
source kedro-env/bin/activate
pip install kedro==0.16.5
Python

If you’re not familiar with Kedro and would like to start from scratch, the documentation offers a great full tutorial that walks through the development workflow of a Kedro project, including pipeline creation, nodes, data catalog specification, testing, packaging, and more. We’ll be working through the completed version of that tutorial, so if this is your first exposure to Kedro, you may want to start there.

Starting our Kedro project

To get up and running quickly, we’ll use a “Kedro Starter,” which is a template that contains both configuration and code that can be run as-is or extended. The starter includes a sample dataset and a fully configured pipeline and code, so we can start exploring immediately. Using a starter project will help us jump to step three in the Kedro workflow and begin to make some modifications to the starter project.

Kedro logo

Running a sample pipeline

Once Kedro is installed in our virtual environment, we’ll create a new project using the kedro new command and pass the path of the starter template. We’ll use the “Space Flights” starter for our examples, which will create a linear regression model that predicts the cost of space flights from the included sample data sets. When you run kedro new, you’ll be prompted for a project name, repository name, and Python package name. Enter spaceflights for all three and then change to the directory specified at the end of the output.

kedro new --starter git+https://github.com/quantumblacklabs/kedro-starter-spaceflights.git

Project Name:
=============
Please enter a human readable name for your new project.
Spaces and punctuation are allowed.
 [New Kedro Project]: spaceflights

Repository Name:
================
Please enter a directory name for your new project repository.
Alphanumeric characters, hyphens and underscores are allowed.
Lowercase is recommended.
 [spaceflights]: spaceflights

Python Package Name:
====================
Please enter a valid Python package name for your project package.
Alphanumeric characters and underscores are allowed.
Lowercase is recommended. Package name must start with a letter or underscore.
 [spaceflights]: spaceflights

cd spaceflights
Python

Looking inside the spaceflights directory at the content of our new project, we can see that the organizational structure that forms the base of all Kedro projects automatically includes best practices such as data, configuration, and source code separation.

├── conf
│   ├── base
│   └── local
├── data
│   ├── 01_raw
│   ├── 02_intermediate
│   ├── 03_primary
│   ├── 04_features
│   ├── 05_model_input
│   ├── 06_models
│   ├── 07_model_output
│   └── 08_reporting
├── docs
│   └── source
├── logs
│   └── journals
├── notebooks
└── src
    ├── spaceflights
    │   ├── nodes
    │   └── pipelines
    │       ├── data_engineering
    │       └── data_science
    └── tests
        └── pipelines
Python

The Space Flights project we’re using contains two pre-built pipelines: data_engineering and data_science, which process the raw input data from the data folder and build a SKLearn linear regression model, respectively. The linear regression model attempts to predict the cost of a taking a space flight using the input data.

Because this is our first time running the project, we need to perform a few one-time project setup tasks. To begin, we’ll initialize the spaceflights folder as a Git repository so we can track our changes. Thankfully, a .gitignore file is automatically created for us by Kedro. Next, we’ll install the requirements.txt from the src directory to our virtual environment:

git init
pip install -r src/requirements.txt
Git

With our prerequisite setup finished, we can now run our Kedro pipeline as-is via the kedro run command, which will execute all the nodes in the pipeline:

kedro run
Git

The pipeline output will look similar to the below, you can ignore any warnings:

2020-10-21 18:22:42,896 – root – INFO - ** Kedro project spaceflights
2020-10-21 18:22:42,959 – kedro.io.data_catalog – INFO – Loading data from `shuttles` (ExcelDataSet)…
2020-10-21 18:22:53,524 kedro.pipeline.node - INFO - Running node: preprocessing_shuttles: preprocess_shuttles([shuttles]) -> [preprocessed_shuttles]
2020-10-21 18:22:53,538 kedro.io.data_catalog - INFO - Saving data to `preprocessed_shuttles` (CSVDataSet)...
2020-10-21 18:22:54,186 kedro.runner.sequential_runner - INFO - Completed 1 out of 6 tasks
2020-10-21 18:22:54,166 kedro.io.data_catalog - INFO - Loading data from `companies` (CSVDataSet)...
2020-10-21 18:22:54,238 kedro.pipeline.node - INFO - Running node: preprocessing_companies: preprocess_companies([companies]) -> [preprocessed_companies]
2020-10-21 18:22:54,216 kedro.io.data_catalog - INFO - Saving data to `preprocessed_companies` (CSVDataSet)...
2020-10-21 18:22:54,563 kedro.runner.sequential_runner - INFO - Completed 2 out of 6 tasks
...
2020-10-21 18:23:12,065 kedro.pipeline.node - INFO - Running node: evaluate_model([X_test,regressor,y_test]) -> None
2020-10-21 18:23:12,082 - spaceflights.pipelines.data_science.nodes - INFO - Model has a coefficient R^2 of 0.456.
2020-10-21 18:23:12,108 kedro.runner.sequential_runner - INFO - Completed 6 out of 6 tasks
2020-10-21 18:23:12,118 kedro.runner.sequential_runner - INFO - Pipeline execution completed successfully.
Git

From the output above, we can see that our pipeline completed successfully. We can see from the last three lines that the default model had a coefficient R^2 of 0.456 (your number might differ slightly), that “6 out of 6 tasks” were completed and that the “pipeline execution completed successfully,” respectively.

This output tells us that every node in the default pipeline ran including the training and testing of the SKLearn model provided from the Space Flights starter. But, what happens if we have more data than our laptop can reasonably handle? What if we need to retrain our model when we receive new data or when a specific event occurs? Let’s explore how training models on SageMaker through our Kedro pipeline helps address these issues.

Updating the Space Flights pipeline with SageMaker

We can visualize our current pipeline using Kedro-Viz, a plugin for Kedro already installed in our environment. Kedro-Viz uses the configuration and code of our Kedro data pipeline to produce an interactive rendering of the data pipeline, allowing us to explore the nodes, dependencies, outputs, and data sets defined in the pipeline. Generating the visualization is done by running kedro viz from the root of the project:

kedro viz

Kedro-Viz will then start a local webserver that hosts the interactive visualization of our pipeline. We can access the webserver at the URL listed in the output of the kedro viz command, typically http://127.0.0.1:4141/:

Visualization of the data pipeline that kedro viz produces from the root of the project.

Exploring the visualization, we can see the node Train Model is currently outputting Regressor, which is the SKLearn model, so we know Train Model is the node we’ll need to update if we want to use SageMaker to train our model. Once you are done exploring, you can exit Kedro-Viz and cancel the process in the terminal.

Before we make the change to our node, we need to create an Amazon Simple Storage Service (Amazon S3) bucket for our data and objects and an IAM role to allow SageMaker to perform operations on our behalf.

Note: This guide assumes your AWS user has Administrator permission in your account. To limit permissions in your live environment, you can use IAM last accessed information to help implement granting least privilege access, a critical security best practice.

To begin, create an Amazon S3 bucket with sagemaker in the name and take note of the full name; be sure you do not allow public access to the bucket. The bucket name must contain sagemaker because an IAM role that we use, SageMakerFullAccess, looks for the exact string in the bucket name to allow access to the bucket. An example S3 bucket name could be xyz-123-sagemaker. Be sure the bucket you create is in the same region you plan on using SageMaker with, for example us-east-1.

Next, create an IAM Role that will allow SageMaker to perform operations on your behalf and note its name. As a managed service, SageMaker can perform only the operations that the user permits. In this example, we’ll use the managed policy AmazonSageMakerFullAccess. In your Production environment, however, you should restrict the permissions to only what is needed for your use case. Refer to the best practices documentation to understand granting least privilege in your account.

To create a new IAM role:

1.     Sign in to the AWS Management Console and open the IAM console.

2.     In the left navigation pane, choose Roles.

3.     Choose Create role.

4.     For role type, choose AWS Service, find and choose SageMaker from the use case list and then choose the SageMaker – Execution use case. Then, choose Next: Permissions.

5.     On the Attach permissions policy page, confirm that the policy AmazonSageMakerFullAccess is listed. Choose Next: Tags, then Next: Review.

6.      For Role name, enter a name for your role, for example: AmazonSageMakerFullAccessRole.

Once we’ve created the SageMaker IAM role, we can begin updating our Space Flights project to train the model on SageMaker.

We’ll be using the Amazon SageMaker Python SDK and S3FS in our pipeline, so we need to add them to the kedro-sagemaker/spaceflights/src/ requirements.txt file so they are available to our nodes.

black==v19.10b0
flake8>=3.7.9, <4.0
ipython~=7.0
isort>=4.3.21, <5.0
jupyter~=1.0
jupyter_client>=5.1,<7.0
jupyterlab==0.31.1
kedro[pandas.CSVDataSet,pandas.ExcelDataSet]==0.16.5
kedro-viz~=3.1
nbstripout==0.3.3
pytest-cov~=2.5
pytest-mock>=1.7.1, <2.0
pytest~=5.0
wheel==0.32.2
scikit-learn==0.23.2
sagemaker>=2.13.0 s3fs>=0.3.0, <0.4.1
Python

From the kedro-sagemaker/spaceflights directory, run kedro install --build-reqs  to update your Python virtual environment with the SageMaker SDK and S3FS added to the requirement.txt file. This step may take a few minutes as it downloads, installs, and builds all the requirements for our project.

Configuring the pipeline

Once we have S3FS installed, we can use Amazon S3 to store the data that we’ll use for training models on SageMaker. For this example, we’ll focus primarily on the data_science pipeline because that’s where model training takes place; however, we can use S3 as part of the data_engineering pipeline as well.

Looking at kedro-sagemaker/spaceflights/src/spaceflights/pipelines/data_science pipeline.py, we see that the split_data node takes in master_table and parameters as input and then outputs four lists; X_train, y_train, X_test, and y_test:

def create_pipeline(**kwargs):
    return Pipeline(
        [
            node(
                func=split_data,
                inputs=["master_table", "parameters"],
                outputs=["X_train", "X_test", "y_train", "y_test"],
            ),
            node(func=train_model, inputs=["X_train", "y_train"], outputs="regressor"),
            node(
                func=evaluate_model,
                inputs=["regressor", "X_test", "y_test"],
                outputs=None,
            ),
        ]
    )
Python

The current inputs and outputs are defined in kedro-sagemaker/spaceflights/conf/base/catalog.yml, where they are all specified to be stored locally on your computer. Add a new sagemaker directory under conf to create a new configuration set, which will specify that we want to store our test and train data on S3.

We’ll create a new directory named sagemaker under the kedro-sagemaker/spaceflights/conf directory to house our configuration files. Our conf directory structure should now look like this:

conf
├── base
├── local
└── sagemaker
Python

Create the following files under the new conf/sagemaker directory:

catalog.yml

X_train@pickle:
    type: pickle.PickleDataSet
    filepath: ${s3.train_path}/X_train.pickle
  
X_train@path:
    type: MemoryDataSet
    data: ${s3.train_path}/X_train.pickle
  
y_train:
    type: pickle.PickleDataSet
    filepath: ${s3.train_path}/y_train.pickle
Python

The catalog file will specify that the X_train and y_train objects will now be stored as PickleDataSet on S3.  You can find more information on the data storage options in the Kedro Data Catalog documentation.

parameters.yml

sklearn_estimator_kwargs:
  entry_point: src/spaceflights/sagemaker_entry_point.py  # you will create this file later
  role: kedro-sagemaker-role  # put the name of the role you've created earlier
  instance_type: ml.m4.xlarge
  instance_count: 1
  framework_version: 0.23-1
  output_path: ${s3.output_path}
  use_spot_instances: True
  max_run: 3600
  max_wait: 7200
  checkpoint_s3_uri: ${s3.checkpoint_path}
Python

The parameters file allows us to create configuration values that can be passed as variables when executing pipelines. We can change or override these parameters at runtime to fit different environmental needs. In this example, we’ve specified that SageMaker should use Managed Spot Instances for training, which can save up to 90 percent of costs over on-demand instances. If you want to change the parameters for the Spot training, look at the options in the SageMaker SDK documentation.

NOTE: Make sure you replace the role above with the name of the role you created earlier.

globals.yml

s3:
  train_path: s3://<your_s3_bucket_name>/train
  output_path: s3://<your_s3_bucket_name>/output
  checkpoint_path: s3://<your_s3_bucket_name>/checkpoints
Python

The global variables become available to the configuration files above, so we have a central location to change configuration from environment to environment if needed.

Finally, we update kedro-sagemaker/spaceflights/src/spaceflights/hooks.py to register the globals.yml variables by using TemplatedConfigLoader, which allows us to replace variables in our pipeline code with the contents of a configuration file. You can replace the src/spaceflights/hooks.py file with the following:

from typing import Any, Dict, Iterable, Optional

from kedro.config import ConfigLoader
from kedro.framework.hooks import hook_impl
from kedro.io import DataCatalog
from kedro.pipeline import Pipeline
from kedro.versioning import Journal

from spaceflights.pipelines import data_engineering as de
from spaceflights.pipelines import data_science as ds

from typing import Iterable

from kedro.config import TemplatedConfigLoader
from kedro.framework.hooks import hook_impl

class ProjectHooks:
    @hook_impl
    def register_pipelines(self) -> Dict[str, Pipeline]:
        """Register the project's pipeline.

        Returns:
            A mapping from a pipeline name to a ``Pipeline`` object.

        """
        data_engineering_pipeline = de.create_pipeline()
        data_science_pipeline = ds.create_pipeline()

        return {
            "__default__": data_engineering_pipeline + data_science_pipeline,
            "de": data_engineering_pipeline,
            "ds": data_science_pipeline,
        }

    @hook_impl
    def register_config_loader(self, conf_paths: Iterable[str]) -> ConfigLoader:
        return ConfigLoader(conf_paths)

    @hook_impl
    def register_catalog(
        self,
        catalog: Optional[Dict[str, Dict[str, Any]]],
        credentials: Dict[str, Dict[str, Any]],
        load_versions: Dict[str, str],
        save_version: str,
        journal: Journal,
    ) -> DataCatalog:
        return DataCatalog.from_config(
            catalog, credentials, load_versions, save_version, journal
        )
        
    @hook_impl
    def register_config_loader(
        self, conf_paths: Iterable[str]
    ) -> TemplatedConfigLoader:
        return TemplatedConfigLoader(conf_paths, globals_pattern="*globals.yml")

project_hooks = ProjectHooks()
Python

Updating the pipeline

Now that we updated the pipeline to store our data on S3 where SageMaker can access it and have configured parameter files to maintain our configuration, we’ll update the data_science pipeline to use SageMaker for training by making three updates to the pipeline:

  • Adding methods to nodes.py under the data_science pipeline to create a SageMaker training job.
  • Replacing pipeline.py in the data_science pipeline to send the training data to the SageMaker training job.
  • Creating a SageMaker training script that loads data from S3, configures training with hyperparameters, trains, and saves the model.

The changes to the pipeline listed above include the tasks needed to train a model using the SageMaker Python SDK, which are:

  • Preparing a training script to load data, configure hyperparameters, train and save the model.
  • Creating an Estimator to encapsulate the training job.
  • Calling the fit method on the Estimator to begin training.

Update nodes.py

To create the SageMaker training job, which will train our linear regression model, we first update the methods in kedro-sagemaker/spaceflights/src/spaceflights/pipelines/data_science/nodes.py. The methods added will set up and trigger the training job and return the path to the trained model as well as simply unzip the model so we can review it locally if needed. We can replace the data_science/nodes.py file with the following:

import logging
from typing import Dict, List

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

import pickle
import tarfile
from typing import Any, Dict

import fsspec
import uuid
from sagemaker.sklearn.estimator import SKLearn
from sklearn.linear_model import LinearRegression

def split_data(data: pd.DataFrame, parameters: Dict) -> List:
    """Splits data into training and test sets.

        Args:
            data: Source data.
            parameters: Parameters defined in parameters.yml.
        Returns:
            A list containing split data.

    """
    X = data[
        [
            "engines",
            "passenger_capacity",
            "crew",
            "d_check_complete",
            "moon_clearance_complete",
        ]
    ].values
    y = data["price"].values
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=parameters["test_size"], random_state=parameters["random_state"]
    )

    return [X_train, X_test, y_train, y_test]

def train_model(X_train: np.ndarray, y_train: np.ndarray) -> LinearRegression:
    """Train the linear regression model.

        Args:
            X_train: Training data of independent features.
            y_train: Training data for price.

        Returns:
            Trained model.

    """
    regressor = LinearRegression()
    regressor.fit(X_train, y_train)
    return regressor

def evaluate_model(regressor: LinearRegression, X_test: np.ndarray, y_test: np.ndarray):
    """Calculate the coefficient of determination and log the result.

        Args:
            regressor: Trained model.
            X_test: Testing data of independent features.
            y_test: Testing data for price.

    """
    y_pred = regressor.predict(X_test)
    score = r2_score(y_test, y_pred)
    logger = logging.getLogger(__name__)
    logger.info("Model has a coefficient R^2 of %.3f.", score)

def train_model_sagemaker(
    X_train_path: str, sklearn_estimator_kwargs: Dict[str, Any]
) -> str:
    """Train the linear regression model on SageMaker.

        Args:
            X_train_path: Full S3 path to `X_train` dataset.
            sklearn_estimator_kwargs: Keyword arguments that will be used
                to instantiate SKLearn estimator.

        Returns:
            Full S3 path to `model.tar.gz` file containing the model artifact.

    """
    checkpoint_s3_uri = sklearn_estimator_kwargs.pop('checkpoint_s3_uri', '')
    checkpoint_suffix = str(uuid.uuid4())[:8]
    checkpoint_s3_uri += checkpoint_suffix
    
    sklearn_estimator = SKLearn(**sklearn_estimator_kwargs,
                                checkpoint_s3_uri=checkpoint_s3_uri)

    # we need a path to the directory containing both
    # X_train (feature table) and y_train (target variable)
    inputs_dir = X_train_path.rsplit("/", 1)[0]
    inputs = {"train": inputs_dir}

    # wait=True ensures that the execution is blocked
    # until the job finishes on SageMaker
    sklearn_estimator.fit(inputs=inputs, wait=True)

    training_job = sklearn_estimator.latest_training_job
    job_description = training_job.describe()
    model_path = job_description["ModelArtifacts"]["S3ModelArtifacts"]
    return model_path

def untar_model(model_path: str) -> LinearRegression:
    """Unarchive the linear regression model artifact produced
    by the training job on SageMaker.

        Args:
            model_path: Full S3 path to `model.tar.gz` file containing
                the model artifact.

        Returns:
            Trained model.

    """
    with fsspec.open(model_path) as s3_file, tarfile.open(
        fileobj=s3_file, mode="r:gz"
    ) as tar:
        # we expect to have only one file inside the `model.tar.gz` archive
        filename = tar.getnames()[0]
        model_obj = tar.extractfile(filename)
        return pickle.load(model_obj)
Python

Replace pipline.py

With the new methods added to nodes.py, we’ll replace the contents of src/spaceflights/pipelines/data_science/pipeline.py to call the new train_model_sagemaker node and pass in the S3 path to the training data along with our training parameters.

from kedro.pipeline import Pipeline, node

from kedro.pipeline import Pipeline, node

from .nodes import (
    evaluate_model,
    split_data,
    train_model_sagemaker,
    untar_model,
)

def create_pipeline(**kwargs):
    return Pipeline(
        [
            node(
                func=split_data,
                inputs=["master_table", "parameters"],
                outputs=["X_train@pickle", "X_test", "y_train", "y_test"],
            ),
            node(
                func=train_model_sagemaker,
                inputs=["X_train@path", "params:sklearn_estimator_kwargs"],
                outputs="model_path",
            ),
            node(untar_model, inputs="model_path", outputs="regressor"),
            node(
                func=evaluate_model,
                inputs=["regressor", "X_test", "y_test"],
                outputs=None,
            ),
        ]
    )
Python

Creating the SageMaker Training Script

Finally, we can create the SageMaker Training Script, which defines the actual work that the SageMaker training job will execute when run. The training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to model_dir so that it can be deployed for inference later.

To create the training script, create the file src/spaceflights/sagemaker_entry_point.py with the content below.

import argparse
import pickle
from os import getenv
from pathlib import Path
from typing import Any

from sklearn.linear_model import LinearRegression

def _pickle(path: Path, data: Any) -> None:
    """Pickle the object and save it to disk"""
    with path.open("wb") as f:
        pickle.dump(data, f)

def _unpickle(path: Path) -> Any:
    """Unpickle the object from a given file"""
    with path.open("rb") as f:
        return pickle.load(f)

def _get_arg_parser() -> argparse.ArgumentParser:
    """Instantiate the command line argument parser"""
    parser = argparse.ArgumentParser()

    parser.add_argument(
        "--output-data-dir", type=str, default=getenv("SM_OUTPUT_DATA_DIR")
    )
    parser.add_argument
    main()
Python

Running the pipeline

With the pipeline changes complete, we’re now ready to execute the pipeline:

kedro run --env sagemaker

Specifying the env option will use the configuration variables from the conf/sagemaker folder and will execute the pipeline.

Note: You’ll need to have your AWS CLI configured because you’ll now be calling AWS services from your computer.

On this run, the Kedro pipeline will:

  • Store the engineered dataset to S3.
  • Create a Managed Spot Training job on SageMaker to train the model and show the savings over on-demand.
  • Download the trained model locally to the data folder.*
  • Test the model and report its score.

Your output should look similar to that below:

2020-10-27 21:02:36,002 - root - INFO - ** Kedro project spaceflights
2020-10-27 21:02:36,089 - botocore.utils - INFO - IMDS ENDPOINT: http://169.254.169.254/
2020-10-27 21:02:36,098 - botocore.credentials - INFO - Found credentials in shared credentials file: ~/.aws/credentials

    <Data engineering pipeline output>

2020-10-27 21:04:15,007 - kedro.pipeline.node - INFO - Running node: train_model_sagemaker([X_train@path,params:sklearn_estimator_kwargs]) -> [model_path]
2020-10-27 21:04:15,023 - botocore.utils - INFO - IMDS ENDPOINT: http://169.254.169.254/
2020-10-27 21:04:15,028 - botocore.credentials - INFO - Found credentials in shared credentials file: ~/.aws/credentials
2020-10-27 21:04:15,080 - sagemaker.image_uris - INFO - Same images used for training and inference. Defaulting to image scope: inference.
2020-10-27 21:04:16,318 - sagemaker - INFO - Creating training-job with name: sagemaker-scikit-learn-2020-10-27-21-04-15-089
2020-10-27 21:04:16 Starting - Starting the training job...
2020-10-27 21:04:19 Starting - Launching requested ML instances.........
2020-10-27 21:06:05 Starting - Preparing the instances for training......

    <Amazon SageMaker training output>

2020-10-27 21:08:35,238 sagemaker-training-toolkit INFO     Reporting training SUCCESS
2020-10-27 21:08:45 Uploading - Uploading generated training model
2020-10-27 21:08:45 Completed - Training job completed
Training seconds: 101
Billable seconds: 62
Managed Spot Training savings: 38.6%

    <Model training output>

2020-10-27 21:09:18,226 - kedro.pipeline.node - INFO - Running node: evaluate_model([X_test,regressor,y_test]) -> None
2020-10-27 21:09:18,296 - spaceflights.pipelines.data_science.nodes - INFO - Model has a coefficient R^2 of 0.456.
2020-10-27 21:09:18,325 - kedro.runner.sequential_runner - INFO - Completed 7 out of 7 tasks
2020-10-27 21:09:18,326 - kedro.runner.sequential_runner - INFO - Pipeline execution completed successfully.
Python

Visualizing the final pipeline

To view the changes you made to your pipeline, run the kedro viz command again from the spaceflights directory, and you’ll see the updated pipeline referencing the new Train Model SageMaker node.

kedro viz

Your pipeline should now look like this:

Visualization of the data pipeline after changes are made to the pipeline.

Note that y_train isn’t connected to Train Model SageMaker because we are only passing the path of the datasets to SageMaker for training. Because we specified in catalog.yml that we want to store both X_train and y_train files in the same location on S3, we only need one reference to them to extract the path.

Clean up

To clean up your account, make sure that you delete all the objects in the S3 buckets you’ve created. The SageMaker training jobs you created are automatically terminated once they complete training, so you don’t have any SageMaker resources that need to be removed.

Conclusion

In this tutorial, we showed how to use Amazon SageMaker to train models through a Kedro pipeline. With this information, you can begin building models that are ready for production and quick to train from day one.

Get involved

You can join the open source Kedro community on GitHub and Discourse, where you can ask questions, collaborate, and contribute to the project.

 

 

 

 

 

 

Jeremy Warrick

Jeremy Warrick

Jeremy is a Principal Solutions Architect at AWS working with Business Consulting and Advisory Partners. In his role, he helps customers translate business-led initiatives into successful technical delivery processes. With a background in software engineering, Jeremy is always looking for better ways to solve customer problems. You can find Jeremy on Twitter @ComputerBoxen