Intermediate (200) | AWS Big Data Blog

Amazon EMR Serverless now supports 32 vCPU workers for the most demanding Spark jobs

Accelerate Spark on EMR Serverless with larger workers and shuffle-optimized disks

Amazon EMR Serverless now supports a 32 vCPU / 244 GB worker configuration for the most demanding Spark jobs. Across 126 TPC-DS and TPC-H queries, larger workers delivered an average 29% faster query execution and 29% lower cost, with the biggest gains on shuffle-heavy, multi-table join queries.

Build a contract compliance search system with Amazon OpenSearch

In this post, you build a contract compliance search system that combines semantic search with semantic highlighting in Amazon OpenSearch Service. You deploy the solution using two AWS CloudFormation stacks, test it with synthetic contract documents, and see how a single query surfaces both the right contracts and the right clauses within them.

High-performance Remote Shuffle Service on Amazon EMR with Apache Celeborn

In this post, we show how Apache Celeborn resolves this trade-off for Amazon EMR on EKS and Amazon EMR on EC2, improving job reliability while unlocking additional cost savings.

Introducing Apache Spark Connect support in AWS Glue interactive sessions

Apache Spark Connect bridges the gap between these two worlds: you develop in local Python, but execute on AWS Glue against actual data. Today, AWS Glue interactive sessions support Spark Connect natively. You can connect from any environment that supports the PySpark remote() API, including VS Code, PyCharm, Amazon SageMaker Unified Studio notebooks, and standalone Python applications. You don’t need to install specialized kernels or manage cluster infrastructure.

Deploy modern data platforms in minutes with MDAA

In this post, we explore how MDAA transforms data architecture development from months of manual coding to production-ready deployment through configuration-driven infrastructure and embedded governance, examine a real customer transformation, and provide a clear implementation pathway for your own data modernization journey.

Amazon Redshift RG: Faster and lower cost, Graviton-powered

In this post, we describe the innovations that make RG instances so much faster. We also share benchmark results showing that RG delivers up to 4.2x better price-performance than other leading data warehouses.

Scale analytics with Amazon Redshift multi-warehouse enhancements

In this post, we introduce new capabilities of Amazon Redshift that enhance our multi-warehouse and scaling capabilities: remote materialized view (MV) operations, remote table DDL support, and concurrency scaling enhancements for zero-ETL and S3 event integration. These features help you build more scalable, performant decentralized analytics architectures on Amazon Redshift.

Optimize your Tableau integration with Amazon Redshift Serverless

In this post, we provide a guide to help you use Tableau’s Relationships and Amazon Redshift Serverless architecture to deliver sub-second insights while maximizing every Redshift Processing Unit (RPU). We also provide guidance on five key areas: data model architecture for optimal query performance, security configuration and access control, performance optimization through smart configuration, cost management strategies, and query and join optimization techniques.

Detecting fraud patterns across Snowflake and AWS using SageMaker Data Agent

Amazon SageMaker Data Agent launches three new capabilities in Amazon SageMaker Unified Studio notebooks: SQL analytics on Snowflake data sources, materialized view management, and interactive charting. Practitioners can use them together to query Snowflake alongside AWS data, pre-compute and schedule repeated aggregations, and create interactive visualizations from natural language prompts in a single notebook, without writing boilerplate code or switching tools. In this post, we describe the challenges these capabilities address, introduce each one, and walk through a fraud analytics scenario that demonstrates them working together in an end-to-end investigation workflow.

Automating IT support with AI: How Nexthink uses OpenSearch Service to power self-service issue resolution

In this post, we explore how Nexthink combined Amazon OpenSearch Service vector search, Amazon Bedrock, and infrastructure as code to power the Spark agent’s retrieval layer.

AWS Big Data Blog

Category: Intermediate (200)