AWS Machine Learning Blog

Migrate Amazon SageMaker Data Wrangler flows to Amazon SageMaker Canvas for faster data preparation

Amazon SageMaker Data Wrangler provides a visual interface to streamline and accelerate data preparation for machine learning (ML), which is often the most time-consuming and tedious task in ML projects. Amazon SageMaker Canvas is a low-code no-code visual interface to build and deploy ML models without the need to write code. Based on customers’ feedback, we have combined the advanced ML-specific data preparation capabilities of SageMaker Data Wrangler inside SageMaker Canvas, providing users with an end-to-end, no-code workspace for preparing data, and building and deploying ML models.

By abstracting away much of the complexity of the ML workflow, SageMaker Canvas enables you to prepare data, then build or use a model to generate highly accurate business insights without writing code. Additionally, preparing data in SageMaker Canvas offers many enhancements, such as page loads up to 10 times faster, a natural language interface for data preparation, the ability to view the data size and shape at every step, and improved replace and reorder transforms to iterate on a data flow. Finally, you can one-click create a model in the same interface, or create a SageMaker Canvas dataset to fine-tune foundation models (FMs).

This post demonstrates how you can bring your existing SageMaker Data Wrangler flows—the instructions created when building data transformations—from SageMaker Studio Classic to SageMaker Canvas. We provide an example of moving files from SageMaker Studio Classic to Amazon Simple Storage Service (Amazon S3) as an intermediate step before importing them into SageMaker Canvas.

Solution overview

The high-level steps are as follows:

  1. Open a terminal in SageMaker Studio and copy the flow files to Amazon S3.
  2. Import the flow files into SageMaker Canvas from Amazon S3.

Prerequisites

In this example, we use a folder called data-wrangler-classic-flows as a staging folder for migrating flow files to Amazon S3. It is not necessary to create a migration folder, but in this example, the folder was created using the file system browser portion of SageMaker Studio Classic. After you create the folder, take care to move and consolidate relevant SageMaker Data Wrangler flow files together. In the following screenshot, three flow files necessary for migration have been moved into the folder data-wrangler-classic-flows, as seen in the left pane. One of these files, titanic.flow, is opened and visible in the right pane.

Copy flow files to Amazon S3

To copy the flow files to Amazon S3, complete the following steps:

  1. To open a new terminal in SageMaker Studio Classic, on the File menu, choose Terminal.
  2. With a new terminal open, you can supply the following commands to copy your flow files to the Amazon S3 location of your choosing (replacing NNNNNNNNNNNN with your AWS account number):
    cd data-wrangler-classic-flows
    target="s3://sagemaker-us-west-2-NNNNNNNNNNNN/data-wrangler-classic-flows/"
    aws s3 sync . $target --exclude "*.*" --include "*.flow"

The following screenshot shows an example of what the Amazon S3 sync process should look like. You will get a confirmation after all files are uploaded. You can adjust the preceding code to meet your unique input folder and Amazon S3 location needs. If you don’t want to create a folder, when you enter the terminal, simply skip the change directory (cd) command, and all flow files on your entire SageMaker Studio Classic file system will be copied to Amazon S3, regardless of origin folder.

After you upload the files to Amazon S3, you can validate that they have been copied using the Amazon S3 console. In the following screenshot, we see the original three flow files, now in an S3 bucket.

Import Data Wrangler flow files into SageMaker Canvas

To import the flow files into SageMaker Canvas, complete the following steps:

  1. In the SageMaker Canvas application, choose Data Wrangler in the navigation pane.
  2. Choose Import data flows.
  3. For Select a data source, choose Amazon S3.
  4. For Input S3 endpoint, enter the Amazon S3 location you used earlier to copy files from SageMaker Studio to Amazon S3, then choose Go. You can also navigate to the Amazon S3 location using the browser below.
  5. Select the flow files to import, then choose Import.

After you import the files, the SageMaker Data Wrangler page will refresh to show the newly imported files, as shown in the following screenshot.

Use SageMaker Canvas for data transformation with SageMaker Data Wrangler

Choose one of the flows (for this example, we choose titanic.flow) to launch the SageMaker Data Wrangler transformation.

Now you can add analyses and transformations to the data flow using a visual interface (Accelerate data preparation for ML in Amazon SageMaker Canvas) or natural language interface (Use natural language to explore and prepare data with a new capability of Amazon SageMaker Canvas).

When you’re happy with the data, choose the plus sign and choose Create model, or choose Export to export the dataset to build and use ML models.

Alternate migration method

This blog has provided guidance on using Amazon S3 to migrate SageMaker Data Wrangler flow files from an Amazon SageMaker Studio Classic environment. In our AWS Documentation, we provide other method to import Data Wrangler flow files. If your Studio Classic and Canvas applications share the same Amazon EFS storage volume, you will see a one-click import option for migrating your data flows from Data Wrangler in Studio Classic to Data Wrangler in SageMaker Canvas.

Alternatively, you can use your local machine to transfer the flow files. Lastly, you can also download single flow files from the SageMaker Studio tree control to your local machine, then import them manually in Canvas too.  There is no right or wrong, but feel free to choose a method you are comfortable with.

Clean up

When you’re done, shut down any running SageMaker Data Wrangler applications in SageMaker Studio Classic. To save costs, you can also remove any flow files from the SageMaker Studio Classic file browser, which is an Amazon Elastic File System (Amazon EFS) volume. You can also delete any of the intermediate files in Amazon S3. After the flow files are imported into SageMaker Canvas, the files copied to Amazon S3 are no longer needed.

You can log out of SageMaker Canvas when you’re done, then relaunch it when you’re ready to use it again.

Conclusion

Migrating your existing SageMaker Data Wrangler flows to SageMaker Canvas is a straightforward process that allows you to use the advanced data preparations you’ve already developed while taking advantage of the end-to-end, low-code no-code ML workflow of SageMaker Canvas. By following the steps outlined in this post, you can seamlessly transition your data wrangling artifacts to the SageMaker Canvas environment, streamlining your ML projects and enabling business analysts and non-technical users to build and deploy models more efficiently.

Start exploring SageMaker Canvas today and experience the power of a unified platform for data preparation, model building, and deployment!


About the Authors

Charles Laughlin is a Principal AI Specialist at Amazon Web Services (AWS). Charles holds an MS in Supply Chain Management and a PhD in Data Science. Charles works in the Amazon SageMaker service team where he brings research and voice of the customer to inform the service roadmap. In his work, he collaborates daily with diverse AWS customers to help transform their businesses with cutting-edge AWS technologies and thought leadership.

Dan Sinnreich is a Sr. Product Manager for Amazon SageMaker, focused on expanding no-code / low-code services. He is dedicated to making ML and generative AI more accessible and applying them to solve challenging problems. Outside of work, he can be found playing hockey, scuba diving, and reading science fiction.

Huong Nguyen is a Sr. Product Manager at AWS. She is leading the ML data preparation for SageMaker Canvas and SageMaker Data Wrangler, with 15 years of experience building customer-centric and data-driven products.

Davide Gallitelli is a Specialist Solutions Architect for AI/ML in the EMEA region. He is based in Brussels and works closely with customer throughout Benelux. He has been a developer since very young, starting to code at the age of 7. He started learning AI/ML in his later years of university, and has fallen in love with it since then.get confirmation