Why Amazon SageMaker Feature Store?
Amazon SageMaker Feature Store is a fully managed, purpose-built repository to store, share, and manage features for machine learning (ML) models. Features are inputs to ML models used during training and inference. For example, in an application that recommends a music playlist, features could include song ratings, listening duration, and listener demographics. Features are used repeatedly by multiple teams and feature quality is critical to ensure a highly accurate model. Also, when features used to train models offline in batch are made available for real-time inference, it’s hard to keep the two feature stores synchronized. SageMaker Feature Store provides a secured and unified store to process, standardize, and use features at scale across the ML lifecycle.
How it works
Benefits of SageMaker Feature Store
Feature Management
Feature processing and ingestion
You can ingest data into SageMaker Feature Store from a variety of sources, such as application and service logs, clickstreams, sensors, and tabular data from Amazon S3, Amazon Redshift, AWS Lake Formation, Snowflake, and Databricks Delta Lake. Using feature processing, you can specify your batch data source and feature transformation function (for example, count of product views or time window aggregates) and SageMaker Feature Store transforms the data at the time of ingest into ML features. With Amazon SageMake Data Wrangler you can publish features directly into SageMaker Feature Store. With the Apache Spark connector, you can batch ingest a high volume of data with a single line of code.
Feature storage, catalog, search, and reuse
SageMaker Feature Store tags and indexes feature groups so they are easily discoverable through the visual interface of Amazon SageMaker Studio. Browsing the feature catalog allows teams to discover existing features they can confidently reuse and avoid duplication of pipelines. SageMaker Feature Store uses the AWS Glue Data Catalog by default, but allows you to use a different catalog if desired. You can also query features using familiar SQL with Amazon Athena or another query tool of your choice.
Feature consistency
SageMaker Feature Store supports offline storage for training and online storage for real-time inference. Training and inference are very different use cases and the storage requirements are different for each. During training, models often use the complete data set and can take hours to complete, while inference needs to happen in milliseconds and usually uses a subset of the data. When used together, SageMaker Feature Store ensures that offline and online datasets remain in sync which is critical because if they diverge, it can negatively impact model accuracy.
Time travel
Data scientists may need to train models with the exact set of feature values from a specific time in the past without the risk of including data from beyond that time (also referred to as feature leakage), such as patient medical data before a diagnosis. SageMaker Feature Store Offline API supports point-in-time queries to retrieve the state of each feature at the historical time of interest.
Security and Governance
Lineage tracking
To enable feature reuse with confidence, data scientists need to know how features were built and which models and endpoints are using them. SageMaker Feature Store allows data scientists to track their features in Amazon SageMaker Studio with SageMaker Lineage. SageMaker Lineage lets you track scheduled pipeline executions, visualize upstream lineage to trace features back to their data sources, and view feature processing code, all in one environment.
ML operations
Feature stores are a key component in the MLOps lifecycle. They manage datasets and feature pipelines, speeding up data science tasks and eliminating the duplicate work of creating the same features multiple times. SageMaker Feature Store can be used as a standalone service or together with other SageMaker services in an integrated manner across the MLOps lifecycle.
Security and compliance
To support security and compliance needs, you may need granular control over how shared ML features are accessed. These needs often go beyond table and column-level access control to individual row-level access control. For example, you may want to let account representatives see rows from a sales table for only their accounts and mask the prefix of sensitive data like credit card numbers. SageMaker Feature Store together with AWS Lake Formation can be used to implement fine-grained access controls to protect feature store data and grant access based on role.