[SEO Subhead]
This Guidance shows how large enterprise customers can efficiently identify and manage duplicate datasets distributed across multiple AWS accounts. It helps these users to search and locate identical or highly similar data tables, allowing for the identification of redundant data assets. This enables procurement teams to easily access a comprehensive, searchable data inventory, thereby avoiding the unnecessary purchase of the same datasets multiple times. Through these capabilities, this Guidance helps organizations optimize their data management practices and drive cost savings through the elimination of data duplication.
Please note: [Disclaimer]
Architecture Diagram
[Architecture diagram description]
Step 1
Each customer account that contributes data (Account A,B, and C) must have at least one table in the AWS Glue Data Catalog using an AWS Glue crawler or manual processes.
Step 2
A scheduled AWS Lambda function, the DataLoader, will execute in external accounts at a fixed time every day. It extracts table information from the Data Catalog, transforms it to a metadata document, and sends it to Amazon OpenSearch in the reporting account located in the us-east-1 AWS Region.
Step 3
Raw data collected from all accounts and Regions is read from OpenSearch.
Step 4
Vector embeddings for each table are generated using the Amazon Bedrock Titan Text Embeddings G1 model based on the column names.
Step 5
A K-Means algorithm model will be trained using Amazon SageMaker, and then deployed to the Amazon SageMaker Serverless Inference endpoint.
Step 6
Each table will be labeled using the deployed K-Means model, and the vector-enriched dataset will be stored back in OpenSearch.
Step 7
A Lambda function responsible for incremental updates will execute daily to query the raw metadata documents and identify the changes (delta) since the last data enrichment.
Step 8
The delta identified in the raw metadata documents will be sent to Amazon Simple Queue Service (Amazon SQS).
Step 9
The Amazon SQS service will trigger the Data Enrichment Lambda function.
Step 10
Vector embeddings will be generated for each table based on the column names using the Titan Text Embeddings G1 model.
Step 11
Each table is labeled using the deployed K-Means model.
Step 12
The vector-enriched dataset will be stored back in OpenSearch.
Get Started
Deploy this Guidance
Well-Architected Pillars
The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.
The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.
-
Operational Excellence
This Guidance is designed to be fully serverless, reducing the operational overhead and complexity associated with maintaining infrastructure. In addition, the use of Lambda, OpenSearch, and other managed services allows the system to scale automatically without the need for manual intervention. Furthermore, this Guidance outlines a systematic approach to handling data updates and changes, with a systemic user enrichment building block facilitating automated, scheduled, and event-driven updates.
-
Security
This Guidance restricts access to the data stored in OpenSearch by granting permissions only to allowlisted users, roles, or principals. Furthermore, this Guidance defines the access control and authorization mechanisms for both administrative and remote users. This includes specifying the appropriate permissions and privileges required for different user roles to access and interact with the various components of the system. For example, administrative users may be granted full control over the configuration and management of the Guidance, while remote users could be limited to read-only access or specific data querying capabilities.
-
Reliability
The serverless architecture of this Guidance, with its inherent capability to automatically scale resources and self-heal, promotes the overall reliability of the system. Additionally, the use of Amazon SQS to manage data updates and changes helps ensure message durability and delivery. Furthermore, this Guidance provides the ability to incrementally add new AWS accounts and Regions, which supports the scalability and fault tolerance of the overall system.
-
Performance Efficiency
The use of OpenSearch, a fully managed service, as well as the adoption of vector databases, helps ensure efficient query performance and data retrieval capabilities within this Guidance. Furthermore, this Guidance uses K-Means clustering to group similar data tables, which can enhance the performance of similarity searches.
The serverless architecture of this Guidance, combined with the use of managed services such as Lambda and Amazon SageMaker, helps optimize resource utilization and reduce the need for manual performance tuning.
-
Cost Optimization
The serverless architecture of this Guidance, with its pay-as-you-go pricing model, can help reduce the overall cost of running the system, as resources are only consumed when needed. Additionally, the use of managed services, such as OpenSearch and SageMaker, can help organizations avoid the overhead associated with managing and maintaining the underlying infrastructure.
-
Sustainability
Through right-sized, transient resources that avoid excess idling, this Guidance minimizes energy consumption and hardware waste. For example, rather than pre-provisioning servers that continually run even when unutilized, Lambda functions are invoked on-demand only when needed. Each function is individually configured with the optimal amount of memory and CPU capacity required to complete its designated task, avoiding over-provisioning of resources. By dynamically allocating just the right compute power when workloads arrive and terminating those resources after use, Lambda eliminates resource waste from idle servers.
Related Content
[Title]
Disclaimer
The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.
References to third-party services or organizations in this Guidance do not imply an endorsement, sponsorship, or affiliation between Amazon or AWS and the third party. Guidance from AWS is a technical starting point, and you can customize your integration with third-party services when you deploy the architecture.