Advanced (300) | AWS Big Data Blog

Build a data lakehouse in a hybrid Environment using Amazon EMR Serverless, Apache DolphinScheduler, and TiDB

This post discusses a decoupled approach of building a serverless data lakehouse using AWS Cloud-centered services, including Amazon EMR Serverless, Amazon Athena, Amazon Simple Storage Service (Amazon S3), Apache DolphinScheduler (an open source data job scheduler) as well as PingCAP TiDB, a third-party data warehouse product that can be deployed either on premises or on the cloud or through a software as a service (SaaS).

Migrate from Standard brokers to Express brokers in Amazon MSK using Amazon MSK Replicator

Creating a new cluster with Express brokers is straightforward, as described in Amazon MSK Express brokers. However, if you have an existing MSK cluster, you need to migrate to a new Express based cluster. In this post, we discuss how you should plan and perform the migration to Express brokers for your existing MSK workloads on Standard brokers. Express brokers offer a different user experience and a different shared responsibility boundary, so using them on an existing cluster is not possible. However, you can use Amazon MSK Replicator to copy all data and metadata from your existing MSK cluster to a new cluster comprising of Express brokers.

Generate vector embeddings for your data using AWS Lambda as a processor for Amazon OpenSearch Ingestion

In this post, we demonstrate how to use the OpenSearch Ingestion’s Lambda processor to generate embeddings for your source data and ingest them to an OpenSearch Serverless vector collection. This solution uses the flexibility of OpenSearch Ingestion pipelines with a Lambda processor to dynamically generate embeddings.

Accelerate queries on Apache Iceberg tables through AWS Glue auto compaction

In this post, we explore new features of the AWS Glue Data Catalog, which now supports improved automatic compaction of Iceberg tables for streaming data, making it straightforward for you to keep your transactional data lakes consistently performant. Enabling automatic compaction on Iceberg tables reduces metadata overhead on your Iceberg tables and improves query performance

Implement a custom subscription workflow for unmanaged Amazon S3 assets published with Amazon DataZone

In this post, we demonstrate how to implement a custom subscription workflow using Amazon DataZone, Amazon EventBridge, and AWS Lambda to automate the fulfillment process for unmanaged data assets, such as unstructured data stored in Amazon S3. This solution enhances governance and simplifies access to unstructured data assets across the organization.

Federate to Amazon Redshift Query Editor v2 with Microsoft Entra ID

In this post, we explore the process of federating into AWS using Microsoft Entra ID and AWS Identity and Access Management (IAM), and how to restrict access to datasets based on permissions linked to AD groups. We guide you through the setup process, and demonstrate how to seamlessly connect to the Redshift Query Editor while making sure data access permissions are accurately enforced based on your Microsoft Entra ID groups.

Introducing the HubSpot connector for AWS Glue

This post introduces the new HubSpot managed connector for AWS Glue, and demonstrates how you can integrate HubSpot data into your existing data lake on AWS. By consolidating HubSpot data with data from your AWS accounts and from other SaaS services, you can enhance, analyze, and optionally write the data back to HubSpot, creating a seamless and integrated data experience.

Develop a business chargeback model within your organization using Amazon Redshift multi-warehouse writes

Now, we are announcing general availability (GA) of Amazon Redshift multi-data warehouse writes through data sharing. This new capability allows you to scale your write workloads and achieve better performance for extract, transform, and load (ETL) workloads by using different warehouses of different types and sizes based on your workload needs.

Run Apache XTable in AWS Lambda for background conversion of open table formats

In this post, we explore how Apache XTable, combined with the AWS Glue Data Catalog, enables background conversions between open table formats residing on Amazon S3-based data lakes, with minimal to no changes to existing pipelines, in a scalable and cost-effective way.

Manage access controls in generative AI-powered search applications using Amazon OpenSearch Service and Amazon Cognito

In this post, we show you how to manage user access to enterprise documents in generative AI-powered tools according to the access you assign to each persona. This post illustrates how to build a document search RAG solution that makes sure only authorized users can access and interact with specific documents based on their roles, departments, and other relevant attributes. It combines OpenSearch Service and Amazon Cognito custom attributes to make a tag-based access control mechanism that makes it straightforward to manage at scale.

Select your cookie preferences

AWS Big Data Blog

Category: Advanced (300)