Guidance for Deduplicating Syndicated Data on AWS

This Guidance shows how large enterprise customers can efficiently identify and manage duplicate datasets distributed across multiple AWS accounts. It helps these users to search and locate identical or highly similar data tables, allowing for the identification of redundant data assets. This enables procurement teams to easily access a comprehensive, searchable data inventory, thereby avoiding the unnecessary purchase of the same datasets multiple times. Through these capabilities, this Guidance helps organizations optimize their data management practices and drive cost savings through the elimination of data duplication.

Please note: [Disclaimer]

Architecture Diagram

[Architecture diagram description]

Download the architecture diagram PDF

Guidance Architecture Diagram for Deduplicating Syndicated Data on AWS

Step 1
Each customer account that contributes data (Account A,B, and C) must have at least one table in the AWS Glue Data Catalog using an AWS Glue crawler or manual processes.

Step 2
A scheduled AWS Lambda function, the DataLoader, will execute in external accounts at a fixed time every day. It extracts table information from the Data Catalog, transforms it to a metadata document, and sends it to Amazon OpenSearch in the reporting account located in the us-east-1 AWS Region.

Step 3
Raw data collected from all accounts and Regions is read from OpenSearch.

Step 4
Vector embeddings for each table are generated using the Amazon Bedrock Titan Text Embeddings G1 model based on the column names.

Step 5
A K-Means algorithm model will be trained using Amazon SageMaker, and then deployed to the Amazon SageMaker Serverless Inference endpoint.

Step 6
Each table will be labeled using the deployed K-Means model, and the vector-enriched dataset will be stored back in OpenSearch.

Step 7
A Lambda function responsible for incremental updates will execute daily to query the raw metadata documents and identify the changes (delta) since the last data enrichment.

Step 8
The delta identified in the raw metadata documents will be sent to Amazon Simple Queue Service (Amazon SQS).

Step 9
The Amazon SQS service will trigger the Data Enrichment Lambda function.

Step 10
Vector embeddings will be generated for each table based on the column names using the Titan Text Embeddings G1 model.

Step 11
Each table is labeled using the deployed K-Means model.

Step 12
The vector-enriched dataset will be stored back in OpenSearch.

Get Started

Deploy this Guidance

Sample code

Use sample code to deploy this Guidance in your AWS account

Well-Architected Pillars

The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.

The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.

Operational Excellence

This Guidance is designed to be fully serverless, reducing the operational overhead and complexity associated with maintaining infrastructure. In addition, the use of Lambda, OpenSearch, and other managed services allows the system to scale automatically without the need for manual intervention. Furthermore, this Guidance outlines a systematic approach to handling data updates and changes, with a systemic user enrichment building block facilitating automated, scheduled, and event-driven updates.

Read the Operational Excellence whitepaper
Security

This Guidance restricts access to the data stored in OpenSearch by granting permissions only to allowlisted users, roles, or principals. Furthermore, this Guidance defines the access control and authorization mechanisms for both administrative and remote users. This includes specifying the appropriate permissions and privileges required for different user roles to access and interact with the various components of the system. For example, administrative users may be granted full control over the configuration and management of the Guidance, while remote users could be limited to read-only access or specific data querying capabilities.

Read the Security whitepaper
Reliability

The serverless architecture of this Guidance, with its inherent capability to automatically scale resources and self-heal, promotes the overall reliability of the system. Additionally, the use of Amazon SQS to manage data updates and changes helps ensure message durability and delivery. Furthermore, this Guidance provides the ability to incrementally add new AWS accounts and Regions, which supports the scalability and fault tolerance of the overall system.

Read the Reliability whitepaper
Performance Efficiency

The use of OpenSearch, a fully managed service, as well as the adoption of vector databases, helps ensure efficient query performance and data retrieval capabilities within this Guidance. Furthermore, this Guidance uses K-Means clustering to group similar data tables, which can enhance the performance of similarity searches.

The serverless architecture of this Guidance, combined with the use of managed services such as Lambda and Amazon SageMaker, helps optimize resource utilization and reduce the need for manual performance tuning.

Read the Performance Efficiency whitepaper
Cost Optimization

The serverless architecture of this Guidance, with its pay-as-you-go pricing model, can help reduce the overall cost of running the system, as resources are only consumed when needed. Additionally, the use of managed services, such as OpenSearch and SageMaker, can help organizations avoid the overhead associated with managing and maintaining the underlying infrastructure.

Read the Cost Optimization whitepaper
Sustainability

Through right-sized, transient resources that avoid excess idling, this Guidance minimizes energy consumption and hardware waste. For example, rather than pre-provisioning servers that continually run even when unutilized, Lambda functions are invoked on-demand only when needed. Each function is individually configured with the optimal amount of memory and CPU capacity required to complete its designated task, avoiding over-provisioning of resources. By dynamically allocating just the right compute power when workloads arrive and terminating those resources after use, Lambda eliminates resource waste from idle servers.

Read the Sustainability whitepaper

Disclaimer

The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.

References to third-party services or organizations in this Guidance do not imply an endorsement, sponsorship, or affiliation between Amazon or AWS and the third party. Guidance from AWS is a technical starting point, and you can customize your integration with third-party services when you deploy the architecture.

Was this page helpful?

Feedback

[SEO Subhead]

Architecture Diagram

Get Started

Deploy this Guidance

Sample code

Well-Architected Pillars

Related Content

[Title]

Disclaimer

Was this page helpful?

Guidance for Deduplicating Syndicated Data on AWS

[SEO Subhead]

Architecture Diagram

Get Started

Deploy this Guidance

Sample code

Well-Architected Pillars

Related Content

[Title]

Disclaimer

Was this page helpful?

Ending Support for Internet Explorer