AWS Big Data Blog

Category: Best Practices

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

ANZ Institutional Division has transformed its data management approach by implementing a federated data platform based on data mesh principles. This shift aims to unlock untapped data potential, improve operational efficiency, and increase agility. The new strategy empowers domain teams to create and manage their own data products, treating data as a valuable asset rather than a byproduct. This post explores how the shift to a data product mindset is being implemented, the challenges faced, and the early wins that are shaping the future of data management in the Institutional Division.

Unlocking near real-time analytics with petabytes of transaction data using Amazon Aurora Zero-ETL integration with Amazon Redshift and dbt Cloud

In this post, we explore how to use Aurora MySQL-Compatible Edition Zero-ETL integration with Amazon Redshift and dbt Cloud to enable near real-time analytics. By using dbt Cloud for data transformation, data teams can focus on writing business rules to drive insights from their transaction data to respond effectively to critical, time sensitive events.

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

Over the last year, Amazon Redshift added several performance optimizations for data lake queries across multiple areas of query engine such as rewrite, planning, scan execution and consuming AWS Glue Data Catalog column statistics. In this post, we highlight the performance improvements we observed using industry standard TPC-DS benchmarks. Overall execution time of TPC-DS 3 TB benchmark improved by 3x. Some of the queries in our benchmark experienced up to 12x speed up.

Differentiate generative AI applications with your data using AWS analytics and managed databases

While the potential of generative artificial intelligence (AI) is increasingly under evaluation, organizations are at different stages in defining their generative AI vision. In many organizations, the focus is on large language models (LLMs), and foundation models (FMs) more broadly. This is just the tip of the iceberg, because what enables you to obtain differential […]

Integrate sparse and dense vectors to enhance knowledge retrieval in RAG using Amazon OpenSearch Service

In this post, instead of using the BM25 algorithm, we introduce sparse vector retrieval. This approach offers improved term expansion while maintaining interpretability. We walk through the steps of integrating sparse and dense vectors for knowledge retrieval using Amazon OpenSearch Service and run some experiments on some public datasets to show its advantages.

Optimize cost and performance for Amazon MWAA

Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a managed service for Apache Airflow that allows you to orchestrate data pipelines and workflows at scale. With Amazon MWAA, you can design Directed Acyclic Graphs (DAGs) that describe your workflows without managing the operational burden of scaling the infrastructure. In this post, we provide guidance […]

Reducing long-term logging expenses by 4,800% with Amazon OpenSearch Service

When you use Amazon OpenSearch Service for time-bound data like server logs, service logs, application logs, clickstreams, or event streams, storage cost is one of the primary drivers for the overall cost of your solution. Over the last year, OpenSearch Service has released features that have opened up new possibilities for storing your log data […]

OpenSearch optimized instance (OR1) is game changing for indexing performance and cost

Amazon OpenSearch Service securely unlocks real-time search, monitoring, and analysis of business and operational data for use cases like application monitoring, log analytics, observability, and website search. In this post, we examine the OR1 instance type, an OpenSearch optimized instance introduced on November 29, 2023. OR1 is an instance type for Amazon OpenSearch Service that […]

Improve Apache Kafka scalability and resiliency using Amazon MSK tiered storage

Since the launch of tiered storage for Amazon Managed Streaming for Apache Kafka (Amazon MSK), customers have embraced this feature for its ability to optimize storage costs and improve performance. In previous posts, we explored the inner workings of Kafka, maximized the potential of Amazon MSK, and delved into the intricacies of Amazon MSK tiered […]

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

Large-scale data warehouse migration to the cloud is a complex and challenging endeavor that many organizations undertake to modernize their data infrastructure, enhance data management capabilities, and unlock new business opportunities. As data volumes continue to grow exponentially, traditional data warehousing solutions may struggle to keep up with the increasing demands for scalability, performance, and […]