AWS Big Data Blog

Power data ingestion into Splunk using Amazon Data Firehose

With Kinesis Data Firehose, customers can use a fully managed, reliable, and scalable data streaming solution to Splunk. In this post, we tell you a bit more about the Kinesis Data Firehose and Splunk integration. We also show you how to ingest large amounts of data into Splunk using Kinesis Data Firehose.

Best practices for querying Apache Iceberg data with Amazon Redshift

In this post, we discuss the best practices that you can follow while querying Apache Iceberg data with Amazon Redshift

IPv6 addressing with Amazon Redshift

As we witness the gradual transition from IPv4 to IPv6, AWS continues to expand its support for dual-stack networking across its service portfolio. In this post, we show how you can migrate your Amazon Redshift Serverless workgroup from IPv4-only to dual-stack mode, so you can make your data warehouse future ready.

Reference guide for building a self-service analytics solution with Amazon SageMaker

In this post, we show how to use Amazon SageMaker Catalog to publish data from multiple sources, including Amazon S3, Amazon Redshift, and Snowflake. This approach enables self-service access while ensuring robust data governance and metadata management.

Introducing the Apache Spark troubleshooting agent for Amazon EMR and AWS Glue

In this post, we show you how the Apache Spark troubleshooting agent helps analyze Apache Spark issues by providing detailed root causes and actionable recommendations. You’ll learn how to streamline your troubleshooting workflow by integrating this agent with your existing monitoring solutions across Amazon EMR and AWS Glue.

Introducing Apache Spark upgrade agent for Amazon EMR

In this post, you learn how to assess your existing Amazon EMR Spark applications, use the Spark upgrade agent directly from the Kiro IDE, upgrade a sample e-commerce order analytics Spark application project (including build configs, source code, tests, and data quality validation), and review code changes before rolling them out through your CI/CD pipeline.

Accelerate Apache Hive read and write on Amazon EMR using enhanced S3A

In this post, we demonstrate how Apache Hive on Amazon EMR 7.10 delivers significant performance improvements for both read and write operations on Amazon S3.

Amazon EMR HBase on Amazon S3 transitioning to EMR S3A with comparable EMRFS performance

Starting with version 7.10, Amazon EMR is transitioning from EMR File System (EMRFS) to EMR S3A as the default file system connector for Amazon S3 access. This transition brings HBase on Amazon S3 to a new level, offering performance parity with EMRFS while delivering substantial improvements, including better standardization, improved portability, stronger community support, improved performance through non-blocking I/O, asynchronous clients, and better credential management with AWS SDK V2 integration. In this post, we discuss this transition and its benefits.

How Socure achieved 50% cost reduction by migrating from self-managed Spark to Amazon EMR Serverless

Socure is one of the leading providers of digital identity verification and fraud solutions. Socure’s data science environment includes a streaming pipeline called Transaction ETL (TETL), built on OSS Apache Spark running on Amazon EKS. TETL ingests and processes data volumes ranging from small to large datasets while maintaining high-throughput performance. In this post, we show how Socure was able to achieve 50% cost reduction by migrating the TETL streaming pipeline from self-managed spark to Amazon EMR serverless.

How Bayer transforms Pharma R&D with a cloud-based data science ecosystem using Amazon SageMaker

In this post, we discuss how Bayer AG used the next generation of Amazon SageMaker to build a cloud-based Pharma R&D Data Science Ecosystem (DSE) that unified data ingestion, storage, analytics, and AI/ML workflows.