This Guidance shows how large enterprise customers can efficiently identify and manage duplicate datasets distributed across multiple AWS accounts. It helps these users to search and locate identical or highly similar data tables, allowing for the identification of redundant data assets. This enables procurement teams to easily access a comprehensive, searchable data inventory, thereby avoiding the unnecessary purchase of the same datasets multiple times. Through these capabilities, this Guidance helps organizations optimize their data management practices and drive cost savings through the elimination of data duplication.

Please note: [Disclaimer]

Architecture Diagram

[Architecture diagram description]

Download the architecture diagram PDF 

Well-Architected Pillars

The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.

The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.

  • This Guidance is designed to be fully serverless, reducing the operational overhead and complexity associated with maintaining infrastructure. In addition, the use of Lambda, OpenSearch, and other managed services allows the system to scale automatically without the need for manual intervention. Furthermore, this Guidance outlines a systematic approach to handling data updates and changes, with a systemic user enrichment building block facilitating automated, scheduled, and event-driven updates.

    Read the Operational Excellence whitepaper 
  • This Guidance restricts access to the data stored in OpenSearch by granting permissions only to allowlisted users, roles, or principals. Furthermore, this Guidance defines the access control and authorization mechanisms for both administrative and remote users. This includes specifying the appropriate permissions and privileges required for different user roles to access and interact with the various components of the system. For example, administrative users may be granted full control over the configuration and management of the Guidance, while remote users could be limited to read-only access or specific data querying capabilities.

    Read the Security whitepaper 
  • The serverless architecture of this Guidance, with its inherent capability to automatically scale resources and self-heal, promotes the overall reliability of the system. Additionally, the use of Amazon SQS to manage data updates and changes helps ensure message durability and delivery. Furthermore, this Guidance provides the ability to incrementally add new AWS accounts and Regions, which supports the scalability and fault tolerance of the overall system.

    Read the Reliability whitepaper 
  • The use of OpenSearch, a fully managed service, as well as the adoption of vector databases, helps ensure efficient query performance and data retrieval capabilities within this Guidance. Furthermore, this Guidance uses K-Means clustering to group similar data tables, which can enhance the performance of similarity searches.

    The serverless architecture of this Guidance, combined with the use of managed services such as Lambda and Amazon SageMaker, helps optimize resource utilization and reduce the need for manual performance tuning.

    Read the Performance Efficiency whitepaper 
  • The serverless architecture of this Guidance, with its pay-as-you-go pricing model, can help reduce the overall cost of running the system, as resources are only consumed when needed. Additionally, the use of managed services, such as OpenSearch and SageMaker, can help organizations avoid the overhead associated with managing and maintaining the underlying infrastructure.

    Read the Cost Optimization whitepaper 
  • Through right-sized, transient resources that avoid excess idling, this Guidance minimizes energy consumption and hardware waste. For example, rather than pre-provisioning servers that continually run even when unutilized, Lambda functions are invoked on-demand only when needed. Each function is individually configured with the optimal amount of memory and CPU capacity required to complete its designated task, avoiding over-provisioning of resources. By dynamically allocating just the right compute power when workloads arrive and terminating those resources after use, Lambda eliminates resource waste from idle servers.

    Read the Sustainability whitepaper 
[Content Type]

[Title]

This [blog post/e-book/Guidance/sample code] demonstrates how [insert short description].

Disclaimer

The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.

References to third-party services or organizations in this Guidance do not imply an endorsement, sponsorship, or affiliation between Amazon or AWS and the third party. Guidance from AWS is a technical starting point, and you can customize your integration with third-party services when you deploy the architecture.

Was this page helpful?