Posted On: Oct 21, 2022
Today, we are excited to announce support for dimensionality reduction using principal components analysis (PCA) in Amazon SageMaker Data Wrangler. Amazon SageMaker Data Wrangler reduces the time it takes to aggregate and prepare data for machine learning (ML) from weeks to minutes. With Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization from a single visual interface. PCA is a popular technique for analyzing large datasets containing a high number of dimensions per observation and is a helpful statistical technique for reducing the dimensionality of a dataset for use with popular ML algorithms like XgBoost and random forest. Previously, to perform PCA on a data set, data scientists would have to find appropriate libraries and write code to reduce high-dimensional data.
With support for PCA in Data Wrangler, you can now easily reduce the dimensionality of a high dimensional data set in only a few clicks. You can access PCA by selecting Dimensionality Reduction from the “Add step” workflow. The built-in column selector enables you to easily auto-select all numeric columns and specify the number of principal components to retain. Optionally, you can specify the appropriate variance threshold percentage and Data Wrangler will automatically determine the appropriate number of components to retain in your transformed data set.
This feature is generally available in all AWS Regions that Data Wrangler currently supports at no additional charge. To get started scheduling your data processing jobs with SageMaker Data Wrangler read the AWS documentation.