General and Streaming ETL Concepts
Q: What is Streaming ETL?
Streaming ETL is the processing and movement of real-time data from one place to another. ETL is short for the database functions extract, transform, and load. Extract refers to collecting data from some source. Transform refers to any processes performed on that data. Load refers to sending the processed data to a destination, such as a warehouse, a datalake, or an analytical tool.
Q: What is Amazon Data Firehose?
Data Firehose is a streaming ETL solution. It is the easiest way to load streaming data into data stores and analytics tools. It can capture, transform, and load streaming data into Amazon S3, Amazon Redshift, Amazon OpenSearch Service, Snowflake, Apache Iceberg tables and Splunk, enabling near real-time analytics with existing business intelligence tools and dashboards you’re already using today. It is a fully managed service that automatically scales to match the throughput of your data and requires no ongoing administration. It can also batch, compress, and encrypt the data before loading it, minimizing the amount of storage used at the destination and increasing security.
Q: What is a source in Firehose?
A source is where your streaming data is continuously generated and captured. For example, a source can be a logging server running on Amazon EC2 instances, an application running on mobile devices, or a sensor on an IoT device. You can connect your sources to Firehose using 1) Amazon Data Firehose API, which uses the AWS SDK for Java, .NET, Node.js, Python, or Ruby. 2) Kinesis Data Stream, where Firehose reads data easily from an existing Kinesis data stream and load it into Firehose destinations. 3) Amazon MSK, where Firehose reads data easily from an existing Amazon MSK cluster and load it into Amazon S3 buckets. 4) AWS natively supported Service like AWS Cloudwatch, AWS EventBridge, AWS IOT, or AWS Pinpoint. For complete list, see the Amazon Data Firehose developer guide. 5) Kinesis Agents, which is a stand-alone Java software application that continuously monitors a set of files and sends new data to your stream. 6) Fluentbit, which an open source Log Processor and Forwarder. 7) AWS Lambda, which is a serverless compute service that lets you run code without provisioning or managing servers. You can use write your Lambda function to send traffic from S3 or DynamoDB to Firehose based on a triggered event.
Q: What is a destination in Firehose?
A destination is the data store where your data will be delivered. Firehose currently supports Amazon S3, Amazon Redshift, Amazon OpenSearch Service, Snowflake, Apache Iceberg tables, Splunk, Datadog, NewRelic, Dynatrace, Sumo Logic, LogicMonitor, MongoDB, and HTTP End Point as destinations.
Q: What does Firehose manage on my behalf?
Data Firehose manages all underlying infrastructure, storage, networking, and configuration needed to capture and load your data into Amazon S3, Amazon Redshift, Amazon OpenSearch Service, Snowflake, Apache Iceberg tables or Splunk. You do not have to worry about provisioning, deployment, ongoing maintenance of the hardware, software, or write any other application to manage this process. Data Firehose also scales elastically without requiring any intervention or associated developer overhead. Moreover, Data Firehose synchronously replicates data across three facilities in an AWS Region, providing high availability and durability for the data as it is transported to the destinations.
Q: How do I use Firehose?
After you sign up for Amazon Web Services, you can start using Firehose with the following steps:
- Create an Firehose stream through the Firehose Console or the CreateDeliveryStream operation. You can optionally configure an AWS Lambda function in your Firehose stream to prepare and transform the raw data before loading the data.
- Configure your data producers to continuously send data to your Firehose stream using the Amazon Kinesis Agent or the Firehose API.
- Firehose automatically and continuously loads your data to the destinations you specify.
Q: What is a Firehose stream in Firehose?
A Firehose stream is the underlying entity of Firehose. You use Firehose by creating a Firehose stream and then sending data to it. You can create a Firehose stream through the Firehose Console or the CreateDeliveryStream operation. For more information, see Creating a Firehose stream.
Q: What is a record in Firehose?
A record is the data of interest your data producer sends to a Firehose stream. The maximum size of a record (before Base64-encoding) is 1024 KB if your data source is Direct PUT or Kinesis Data Streams. The maximum size of a record (before Base64-encoding) is 10 MB if your data source is Amazon MSK.
Q: What are the limits of Firehose?
For information about limits, see Amazon Data Firehose Limits in the developer guide.
Data Sources
Q: What programming languages or platforms can I use to access Firehose API?
Firehose API is available in Amazon Web Services SDKs. For a list of programming languages or platforms for Amazon Web Services SDKs, see Tools for Amazon Web Services.
Q: What is Amazon Kinesis Agent?
Kinesis Agent is a pre-built Java application that offers an easy way to collect and send data to your Firehose stream. You can install the agent on Linux-based server environments such as web servers, log servers, and database servers. The agent monitors certain files and continuously sends data to your Firehose stream. Amazon Kinesis Agent currently supports Amazon Linux, Red Hat Enterprise Linux, and Microsoft Windows. For more information, see Writing with Agents.
Q: Where do I get Amazon Kinesis Agent?
You can download and install Kinesis Agent using the following command and link:
- On Amazon Linux: sudo yum install –y aws-kinesis-agent
- On Red Hat Enterprise Linux: sudo yum install –y https://s3.amazonaws.com/streaming-data-agent/aws-kinesis-agent-latest.amzn1.noarch.rpm
- From GitHub: awlabs/amazon-kinesis-agent
- On Windows: https://docs.aws.amazon.com/kinesis-agent-windows/latest/userguide/getting-started.html#getting-started-installation
Q: What is the difference between PutRecord and PutRecordBatch operations?
You can add data to a Firehose stream through Kinesis Agent or Firehose’s PutRecord and PutRecordBatch operations. PutRecord operation allows a single data record within an API call and PutRecordBatch operation allows multiple data records within an API call. For more information, see PutRecord and PutRecordBatch.
Q: How do I add data to my Firehose stream from my Amazon MSK?
When you create or update your Firehose stream through AWS console or Firehose APIs, you can configure a Amazon MSK cluster/topic as the source of your Firehose stream. Once configured, Firehose will automatically read data from your MSK topic and load the data to specified S3 bucket(s).
Q: What are the key benefits of Amazon MSK and Firehose Integration?
You can reduce your application operation complexity and overhead by transforming and loading streaming data sourced from your Amazon MSK topics into Amazon S3 with no code required. For example, with Amazon MSK and Firehose, you get no code, built-in data conversion and transformation features such as Parquet/ORC format conversion, data buffering, and service-side data validation. You also get automatic delivery retrys, data retention, auto scaling and redundancy, so data is delivered reliably.
Q: What types of Amazon MSK endpoints are supported with Firehose?
To use this feature, MSK clusters must have public end-points or private links enabled.
Q: Can you connect Firehose to Amazon MSK cluster in a different AWS account?
Yes, Firehose can connect to Amazon MSK clusters that are available in different AWS accounts. Firehose can also deliver to S3 buckets that belong to different accounts.
Q: What is the checkpoint time to start consuming data from Amazon MSK topic?
The checkpoint time to start consuming data from the Amazon MSK topic is the creation time of the Firehose stream. Firehose does not read from custom offset values.
Q: How do I add data to my Firehose stream from my Kinesis Data Stream?
When you create or update your Firehose stream through AWS console or Firehose APIs, you can configure a Firehose stream as the source of your Firehose stream. Once configured, Firehose will automatically read data from your Firehose stream and load the data to specified destinations.
Q: How often does Firehose read data from my Kinesis stream?
Firehose calls Kinesis Data Streams GetRecords() once every second for each Kinesis shard.
Q: From where does Firehose read data when my Kinesis Data Stream is configured as the source of my Firehose stream?
Firehose starts reading data from the LATEST position of your Kinesis Data Stream when it’s configured as the source of a Firehose stream. For more information about Kinesis Data Stream position, see GetShardIterator in the Kinesis Data Streams Service API Reference.
Q: Can I configure my Kinesis Data Stream to be the source of multiple Firehose stream?
Yes, you can. However, note that the GetRecords() call from Firehose is counted against the overall throttling limit of your Kinesis shard so that you need to plan your Firehose stream along with your other Kinesis applications to make sure you won’t get throttled. For more information, see Kinesis Data Streams Limits in the Kinesis Data Streams developer guide.
Q: Can I still add data to Firehose stream through Kinesis Agent or Firehose’s PutRecord and PutRecordBatch operations when my Kinesis Data Stream is configured as source?
No, you cannot. When a Kinesis Data Stream is configured as the source of a Firehose stream, Firehose’s PutRecord and PutRecordBatch operations will be disabled. You should add data to your Kinesis Data Stream through the Kinesis Data Streams PutRecord and PutRecords operations instead.
Q: How do I add data to my Firehose stream from AWS IoT?
You add data to your Firehose stream from AWS IoT by creating an AWS IoT action that sends events to your Firehose stream. For more information. See Writing to Amazon Data Firehose Using AWS IoT in the Firehose developer guide.
Q: How can I stream my VPC flow logs to Firehose?
When you create or update your Firehose stream through AWS console or Firehose APIs, you can configure Direct PUT as the source of your Firehose stream. Once the stream is created, you can configure the created Firehose stream as your Firehose stream in the Vended Logs section of the VPC Flow logs console.
Q: How do I add data to my Firehose stream from CloudWatch Logs?
You add data to yourFirehose stream from CloudWatch Logs by creating a CloudWatch Logs subscription filter that sends events to your Firehose stream. For more information, see Using CloudWatch Logs Subscription Filters in Amazon CloudWatch user guide.
Q: How do I add data to my Firehose stream from CloudWatch Events?
You add data to yourFirehose stream from CloudWatch Events by creating a CloudWatch Events rule with your Firehose stream as target. For more information, see Writing to Amazon Data Firehose Using CloudWatch Events in the Firehose developer guide.
Q: How do I add data to my Amazon Data Firehose stream from AWS Eventbridge?
You add data to your Firehose stream from AWS EventBridge console. For more information, see AWS EventBridge documentation.
Q: What kind of encryption can I use?
Firehose allows you to encrypt your data after it’s delivered to your Amazon S3 bucket. While creating your Firehose stream, you can choose to encrypt your data with an AWS Key Management Service (KMS) key that you own. For more information about KMS, see AWS Key Management Service.
Q: What is the IAM role that I need to specify while creating a Firehose stream?
Firehose assumes the IAM role you specify to access resources such as your Amazon S3 bucket and Amazon OpenSearch domain. For more information, see Controlling Access with Firehose in the Firehose developer guide.
Data Transformation and Format Conversion
Q: How do I prepare and transform raw data in Firehose?
Firehose supports built-in data format conversion from data raw or Json into formats like Apache Parquet and Apache ORC required by your destination data stores, without having to build your own data processing pipelines. Firehose also allows you to dynamically partition your streaming data before delivery to S3 using static or dynamically defined keys like “customer_id” or “transaction_id”. Firehose groups data by these keys and delivers into key-unique S3 prefixes, making it easier for you to perform high performance, cost efficient analytics in S3 using Athena, EMR, and Redshift Spectrum.
In addition to the built-in format conversion option in Amazon Data Firehose, you can also use an AWS Lambda function to prepare and transform incoming raw data in your Firehose stream before loading it to destinations. You can configure an AWS Lambda function for data transformation when you create a new Firehose stream or when you edit an existing Firehose stream. Amazon has created multiple Lambda Blue prints that you can choose from for quick start. For complete list, see the Amazon Data Firehose developer guide.
Q: What compression format can I use?
Amazon Data Firehose allows you to compress your data before delivering it to Amazon S3. The service currently supports GZIP, ZIP, and SNAPPY compression formats. Only GZIP is supported if the data is further loaded to Amazon Redshift.
Q: How does compression work when I use the CloudWatch Logs subscription feature?
You can use CloudWatch Logs subscription feature to stream data from CloudWatch Logs to Firehose. All log events from CloudWatch Logs are already compressed in gzip format, so you should keep Firehose’s compression configuration as uncompressed to avoid double-compression. For more information about CloudWatch Logs subscription feature, see Subscription Filters with Amazon Data Firehose in the Amazon CloudWatch Logs user guide.
Q: How do I return prepared and transformed data from my AWS Lambda function back to Amazon Data Firehose?
All transformed records from Lambda must be returned to Firehose with the following three parameters; otherwise, Firehose will reject the records and treat them as data transformation failure.
- recordId: Firehose passes a recordId along with each record to Lambda during the invocation. Each transformed record should be returned with the exact same recordId. Any mismatch between the original recordId and returned recordId will be treated as data transformation failure.
- result: The status of transformation result of each record. The following values are allowed for this parameter: “Ok” if the record is transformed successfully as expected. “Dropped” if your processing logic intentionally drops the record as expected. “ProcessingFailed” if the record is not able to be transformed as expected. Firehose treats returned records with “Ok” and “Dropped” statuses as successfully processed records, and the ones with “ProcessingFailed” status as unsuccessfully processed records when it generates SucceedProcessing.Records and SucceedProcessing.Bytes metrics.
- data: The transformed data payload after based64 encoding.
Q: What is error logging?
If you enable data transformation with Lambda, Firehose can log any Lambda invocation and data delivery errors to Amazon CloudWatch Logs so that you can view the specific error logs if Lambda invocation or data delivery fails. For more information, see Monitoring with Amazon CloudWatch Logs.
Q: What is source record backup?
If you use data transformation with Lambda, you can enable source record backup, and Amazon Data Firehose will deliver the un-transformed incoming data to a separate S3 bucket. You can specify an extra prefix to be added in front of the “YYYY/MM/DD/HH” UTC time prefix generated by Firehose.
Built-in Data Transformation for Amazon S3
Q: When should I use Firehose dynamic partitioning?
Firehose dynamic partitioning eliminates the complexities and delays of manual partitioning at the source or after storing the data, and enables faster analytics for querying optimized data sets. This makes the data sets immediately available for analytics tools to run their queries efficiently and enhances fine-grained access control for data. For example, marketing automation customers can partition data on-the-fly by customer id, allowing customer-specific queries to query optimized data sets and deliver results faster. IT operations or security monitoring customers can create groupings based on event timestamp embedded in logs so they can query optimized data sets and get results faster. This feature combined with Amazon Data Firehose's existing JSON-to-parquet format conversion feature makes Amazon Data Firehose an ideal streaming ETL option for S3.
Q: How do I setup dynamic partitioning with Firehose?
You can setup Firehose data partitioning capability through the AWS Management Console, CLIs or SDKs. When you create or update a Firehose stream, select Amazon S3 as the delivery destination for the Firehose stream and enable dynamic partitioning. You can specify keys or create an expression that will be evaluated at runtime to define keys used for partitioning. For example, you can select a data field in the incoming stream such as customer id and define an S3 prefix expression such as customer_id=!{partitionKey:customer_id}/, that will be evaluated in runtime based on the ingested records to define to which S3 prefix deliver the records.
Q: What kind of transformations and data processing can I do with dynamic partitioning and with partitioning keys?
Firehose supports parquet/orc conversion out of the box when you write your data to Amazon S3. Firehose also integrates with Lambda function, so you can write your own transformation code. Firehose also has built-in support for extracting the key data fields from records that are in JSON format. Firehose also supports the JQ parsing language to enable transformations on those partition keys. To learn more, see the Firehose developer guide.
Data Delivery and Destinations
Q: Can I keep a copy of all the raw data in my S3 bucket?
Yes, Firehose can back up all un-transformed records to your S3 bucket concurrently while delivering transformed records to destination. Source record backup can be enabled when you create or update your Firehose stream.
Q: How often does Firehose deliver data to my Amazon S3 bucket?
The frequency of data delivery to Amazon S3 is determined by the S3 buffer size and buffer interval value you configured for your Firehose stream. Firehose buffers incoming data before delivering it to Amazon S3. You can configure the values for S3 buffer size (1 MB to 128 MB) or buffer interval (60 to 900 seconds), and the condition satisfied first triggers data delivery to Amazon S3. If you have Apache parquet or dynamic partitioning enabled, then your buffer size is in MBs and ranges from 64MB to 128MB for Amazon S3 destination, with is 128MB being the default value. Note that in circumstances where data delivery to the destination is falling behind data ingestion into the Firehose stream, Firehose raises the buffer size automatically to catch up and make sure that all data is delivered to the destination.
Q: How is buffer size applied if I choose to compress my data?
Buffer size is applied before compression. As a result, if you choose to compress your data, the size of the objects within your Amazon S3 bucket can be smaller than the buffer size you specify.
Q: What privilege is required for the Amazon Redshift user that I need to specify while creating a Firehose stream?
The Redshift user needs to have Redshift INSERT privilege for copying data from your Amazon S3 bucket to your Redshift instance.
Q: What do I need to do if my Amazon Redshift instance is within a VPC?
If your Redshift instance is within a VPC, you need to grant Amazon Data Firehose access to your Redshift instance by unblocking Firehose IP addresses from your VPC. For information about how to unblock IPs to your VPC, see Grant Firehose Access to an Amazon Redshift Destination in the Amazon Data Firehose developer guide.
Q: Why do I need to provide an Amazon S3 bucket while choosing Amazon Redshift as destination?
For Redshift destinations, Amazon Data Firehose delivers data to your Amazon S3 bucket first and then issues the Redshift COPY command to load data from your S3 bucket to your Redshift instance.
Q: Is it possible for a single Firehose stream to deliver data to multiple Snowflake tables?
Currently, a single Firehose stream can only deliver data to one Snowflake table. To deliver data to multiple Snowflake tables, you need to create multiple Firehose streams.
Q: What delivery model does Firehose use when delivering data to Snowflake streaming?
Firehose uses exactly-once delivery semantics for Snowflake. This means that each record is delivered to Snowflake exactly once, even if there are errors or retries. However, exactly-once delivery does not guarantee that there will be no duplicates in the data end to end, as data may be duplicated by the producer or by other parts of the ETL pipeline.
Q: What is the minimum latency for delivering to Snowflake streaming using Firehose?
We expect most data streams to be delivered within 5 seconds.
Q: What is Amazon OpenSearch Service?
Amazon OpenSearch Service makes it easy for you to perform interactive log analytics, real-time application monitoring, website search, and more. OpenSearch is an open source, distributed search and analytics suite derived from Elasticsearch. Amazon OpenSearch Service offers the latest versions of OpenSearch, support for 19 versions of Elasticsearch (1.5 to 7.10 versions), and visualization capabilities powered by OpenSearch Dashboards and Kibana (1.5 to 7.10 versions). Click here for more information on Amazon OpenSearch.
Q: What is index rotation for Amazon OpenSearch Service destination?
Firehose can rotate your Amazon OpenSearch Service index based on a time duration. You can configure this time duration while creating your Firehose stream. For more information, see Index Rotation for the Amazon OpenSearch Destination in the Amazon Data Firehose developer guide.
Q: Why do I need to provide an Amazon S3 bucket when choosing Amazon OpenSearch Service as destination?
When loading data into Amazon OpenSearch Service, Firehose can back up all of the data or only the data that failed to deliver. To take advantage of this feature and prevent any data loss, you need to provide a backup Amazon S3 bucket.
Q: Can I change the configurations of my Firehose stream after it’s created?
You can change the configuration of your Firehose stream at any time after it’s created. You can do so by using the Firehose Console or the UpdateDestination operation. Your Firehose stream remains in ACTIVE state while your configurations are updated and you can continue to send data to your Firehose stream. The updated configurations normally take effect within a few minutes.
When delivering to a VPC destination, you can change the destination endpoint URL, as long as new destination is accessible within the same VPC, subnets and security groups. For changes of VPC, subnets and security groups, you need to re-create the Firehose stream.
Q: Can I use a Firehose stream in one account to deliver my data into an Amazon OpenSearch Service domain VPC destination in a different account?
Firehose delivery can deliver to a different account in Amazon OpenSearch Service only when Firehose and Amazon OpenSearch Service are connected through public end point.
If Firehose and Amazon OpenSearch Service are connected through in a private VPC. Then Firehose stream and destination Amazon OpenSearch Service domain VPC need to be in the same account.
Q: Can I use a Firehose stream in one region to deliver my data into an Amazon OpenSearch Service domain VPC destination in a different region?
No, your Firehose stream and destination Amazon OpenSearch Service domain need to be in the same region.
Q: How often does Firehose deliver data to my Amazon OpenSearch domain?
The frequency of data delivery to Amazon OpenSearch Service is determined by the OpenSearch buffer size and buffer interval values that you configured for your Firehose stream. Firehose buffers incoming data before delivering it to Amazon OpenSearch Service. You can configure the values for OpenSearch buffer size (1 MB to 100 MB) or buffer interval (60 to 900 seconds), and the condition satisfied first triggers data delivery to Amazon OpenSearch Service. Note that in circumstances where data delivery to the destination is falling behind data ingestion into the Firehose stream, Amazon Data Firehose raises the buffer size automatically to catch up and make sure that all data is delivered to the destination.
Q: What is the manifests folder in my Amazon S3 bucket?
For Redshift destinations, Amazon Data Firehose generates manifest files to load Amazon S3 objects to Redshift instances in batch. The manifests folder stores the manifest files generated by Firehose.
Q: How do backed up OpenSearch documents look like in my Amazon S3 bucket?
If “all documents” mode is used, Amazon Data Firehose concatenates multiple incoming records based on buffering configuration of your Firehose stream, and then delivers them to your S3 bucket as an S3 object. Regardless of which backup mode is configured, the failed documents are delivered to your S3 bucket using a certain JSON format that provides additional information such as error code and time of delivery attempt. For more information, see Amazon S3 Backup for the Amazon OpenSearch Destination in the Amazon Data Firehose developer guide.
Q: Can a single Firehose stream deliver data to multiple Amazon S3 buckets?
A single Firehose stream can currently only deliver data to one Amazon S3 bucket. If you want to have data delivered to multiple S3 buckets, you can create multiple Firehose streams.
Q: Can a single Firehose stream deliver data to multiple Amazon Redshift instances or tables?
A single Firehose stream can currently only deliver data to one Redshift instance and one table. If you want to have data delivered to multiple Redshift instances or tables, you can create multiple Firehose streams.
Q: Can a single Firehose stream deliver data to multiple Amazon OpenSearch Service domains or indexes?
A single Firehose stream can only deliver data to one Amazon OpenSearch Service domain and one index currently. If you want to have data delivered to multiple Amazon OpenSearch domains or indexes, you can create multiple Firehose stream.
Q: How does Amazon Data Firehose deliver data to my Amazon OpenSearch Service domain into a VPC?
When you enable Firehose to deliver data to an Amazon OpenSearch Service destination in a VPC, Amazon Data Firehose creates one or more cross account elastic network interfaces (ENI) in your VPC for each subnet(s) that you choose. Amazon Data Firehose uses these ENIs to deliver the data into your VPC. The number of ENIs scales automatically to meet the service requirements.
Q: Is it possible for a single Firehose stream to deliver data to multiple Apache Iceberg tables?
Yes, one Firehose stream can deliver data to multiple Apache Iceberg tables.
Q: Does Firehose support connecting to the AWS Glue Data Catalog in a different account, or in a different AWS region?
Yes, Firehose supports connecting to the AWS Glue Data Catalog in a different account, or in a different AWS Region.
Q: Can I use Data Transformation feature using Lambda when delivering to Apache Iceberg tables?
Yes, you can use Data Transformation using Lambda when delivering to Apache Iceberg tables.
Troubleshooting and managing Firehose streams
Q: Why do I get throttled when sending data to my Amazon Data Firehose stream?
By default, each Firehose stream can intake up to 2,000 transactions/second, 5,000 records/second, and 5 MB/second. You can have this limit increased easily by submitting a service limit increase form.
Q: Why do I see duplicated records in my Amazon S3 bucket, Amazon Redshift table, Amazon OpenSearch index, or Splunk clusters?
Amazon Data Firehose uses at least once semantics for data delivery. In rare circumstances such as request timeout upon data delivery attempt, delivery retry by Firehose could introduce duplicates if the previous request eventually goes through.
Q: What happens if data delivery to my Amazon S3 bucket fails?
If your data source is Direct PUT and the data delivery to your Amazon S3 bucket fails, then Amazon Data Firehose will retry to deliver data every 5 seconds for up to a maximum period of 24 hours. If the issue continues beyond the 24-hour maximum retention period, then Amazon Data Firehose discards the data.
If your data source is Kinesis Data Streams and the data delivery to your Amazon S3 bucket fails, then Amazon Data Firehose will retry to deliver data every 5 seconds for up to a maximum period of what is configured on Kinesis Data Streams.
Q: What happens if data delivery to my Amazon Redshift instance fails?
If data delivery to your Redshift instance fails, Amazon Data Firehose retries data delivery every 5 minutes for up to a maximum period of 120 minutes. After 120 minutes, Amazon Data Firehose skips the current batch of S3 objects that are ready for COPY and moves on to the next batch. The information about the skipped objects is delivered to your S3 bucket as a manifest file in the errors folder, which you can use for manual backfill. For information about how to COPY data manually with manifest files, see Using a Manifest to Specify Data Files.
Q: What happens if data delivery to my Amazon OpenSearch domain fails?
For Amazon OpenSearch Service destination, you can specify a retry duration between 0 and 7200 seconds when creating the Firehose stream. If data delivery to your Amazon OpenSearch domain fails, Amazon Data Firehose retries data delivery for the specified time duration. After the retrial period, Amazon Data Firehose skips the current batch of data and moves on to the next batch. Details on skipped documents are delivered to your S3 bucket in the opensearch_failed folder, which you can use for manual backfill.
Q: What happens if there is a data transformation failure?
There are two types of failure scenarios when Firehose attempts to invoke your Lambda function for data transformation:
- The first type is when the function invocation fails for reasons such as reaching network timeout, and hitting Lambda invocation limits. Under these failure scenarios, Firehose retries the invocation for three times by default and then skips that particular batch of records. The skipped records are treated as unsuccessfully processed records. You can configure the number of invocation re-trials between 0 and 300 using the CreateDeliveryStream and UpdateDeliveryStream APIs. For this type of failure, you can also use Firehose’s error logging feature to emit invocation errors to CloudWatch Logs. For more information, see Monitoring with Amazon CloudWatch Logs.
- The second type of failure scenario occurs when a record’s transformation result is set to “ProcessingFailed” when it is returned from your Lambda function. Firehose treats these records as unsuccessfully processed records. For this type of failure, you can use Lambda’s logging feature to emit error logs to CloudWatch Logs. For more information, see Accessing Amazon CloudWatch Logs for AWS Lambda.
For both types of failure scenarios, the unsuccessfully processed records are delivered to your S3 bucket in the processing_failed folder.
Q: Why is the size of delivered S3 objects larger than the buffer size I specified in my Firehose stream configuration?
The size of delivered S3 objects should reflect the specified buffer size most of the time if buffer size condition is satisfied before buffer interval condition. However, when data delivery to destination is falling behind data writing to Firehose stream, Firehose raises buffer size dynamically to catch up and make sure that all data is delivered to the destination. In these circumstances, the size of delivered S3 objects might be larger than the specified buffer size.
Q: What is the errors folder in my Amazon S3 bucket?
The errors folder stores manifest files that contain information of S3 objects that failed to load to your Redshift instance. You can reload these objects manually through Redshift COPY command. For information about how to COPY data manually with manifest files, see Using a Manifest to Specify Data Files.
Q: What is the opensearch_failed folder in my Amazon S3 bucket?
The opensearch_failed folder stores the documents that failed to load to your Amazon OpenSearch What happens if data delivery to my Amazon OpenSearch domain fails?domain. You can re-index these documents manually for backfill.
Q: What is the processing_failed folder in my Amazon S3 bucket?
The processing_failed folder stores the records that failed to transform in your AWS Lambda function. You can re-process these records manually.
Q: How do I monitor the operations and performance of my Amazon Data Firehose stream?
Firehose Console displays key operational and performance metrics such as incoming data volume and delivered data volume. Amazon Data Firehose also integrates with Amazon CloudWatch Metrics so that you can collect, view, and analyze metrics for your Firehose streams. For more information about Amazon Data Firehose metrics, see Monitoring with Amazon CloudWatch Metrics in the Amazon Data Firehose developer guide.
Q: How do I monitor data transformation and delivery failures of my Amazon Data Firehose stream?
Amazon Data Firehose integrates with Amazon CloudWatch Logs so that you can view the specific error logs if data transformation or delivery fails. You can enable error logging when creating your Firehose stream. For more information, see Monitoring with Amazon CloudWatch Logs in the Amazon Data Firehose developer guide.
Q: How do I manage and control access to my Amazon Data Firehose stream?
Amazon Data Firehose integrates with AWS Identity and Access Management, a service that enables you to securely control access to your AWS services and resources for your users. For example, you can create a policy that only allows a specific user or group to add data to your Firehose stream. For more information about access management and control of your stream, see Controlling Access with Amazon Data Firehose.
Q: How do I log API calls made to my Amazon Data Firehose stream for security analysis and operational troubleshooting?
Amazon Data Firehose integrates with AWS CloudTrail, a service that records AWS API calls for your account and delivers log files to you. For more information about API call logging and a list of supported Amazon Data Firehose API operations, see Logging Amazon Data Firehose API calls Using AWS CloudTrail.
Pricing and billing
Q: Is Firehose available in the AWS Free Tier?
No. Firehose is not currently available in AWS Free Tier. AWS Free Tier is a program that offers free trial for a group of AWS services. For more details see AWS Free Tier.
Q: How much does Firehose cost?
Firehose uses simple pay as you go pricing. There is neither upfront cost nor minimum fees and you only pay for the resources you use. Amazon Data Firehose pricing is based on the data volume (GB) ingested by Firehose, with each record rounded up to the nearest 5KB for Direct PUT and Kinesis Data Streams as a sources. For Vended Logs as a source, pricing is based on the data volume (GB) ingested by Firehose. For more information about Amazon Data Firehose cost, see Amazon Data Firehose Pricing.
Q: When I use PutRecordBatch operation to send data to Amazon Data Firehose, how is the 5KB roundup calculated?
The 5KB roundup is calculated at the record level rather than the API operation level. For example, if your PutRecordBatch call contains two 1KB records, the data volume from that call is metered as 10KB. (5KB per record)
Q: Does Firehose cost include Amazon S3, Amazon Redshift, Amazon OpenSearch Service, and AWS Lambda costs?
No, you will be billed separately for charges associated with Amazon S3, Amazon Redshift, Amazon OpenSearch Service, and AWS Lambda usage, including storage and request costs. For more information, see Amazon S3 Pricing, Amazon Redshift Pricing, Amazon OpenSearch Service Pricing, and AWS Lambda Pricing.
Service Level Agreement
Q: What does the Amazon Data Firehose SLA guarantee?
Our Amazon Data Firehose SLA guarantees a Monthly Uptime Percentage of at least 99.9% for Amazon Data Firehose.
Q: How do I know if I qualify for a SLA Service Credit?
You are eligible for a SLA credit for Amazon Data Firehose under the Amazon Data Firehose SLA if more than one Availability Zone in which you are running a task, within the same region has a Monthly Uptime Percentage of less than 99.9% during any monthly billing cycle.
For full details on all of the terms and conditions of the SLA, as well as details on how to submit a claim, please see the Amazon Data Firehose SLA details page.
Learn more about Amazon Data Firehose pricing