AWS Database Blog

Cost-effective bulk processing with Amazon DynamoDB

Your Amazon DynamoDB table might store millions, billions, or even trillions of items. If you ever need to perform a bulk update action against items in a large table, it’s important to consider the cost. In this post, I show you three techniques for cost-effective in-place bulk processing with DynamoDB.

Characteristics of bulk processing

You might have several reasons for bulk processing:

  • To delete outdated items for compliance reasons or to save ongoing storage costs. DynamoDB Time to Live (TTL) provides a zero-cost deletion capability targeted at use cases like this, but the TTL attribute must already exist on each item to be deleted. Perhaps your outdated items were loaded without a TTL attribute.
  • To backfill an appropriate TTL attribute value on more recent items so that they will be automatically deleted according to the proper schedule going forward.
  • To backfill a new global secondary index (GSI) partition key or sort key with a net-new attribute. For example, if the GSI sort key needs to be a combination of state and city (<state>#<city>), that requires placing a new attribute onto every item.
  • To apply any bulk changes where it’s not necessary for the work to be completed immediately.

A common characteristic of bulk processing is that you don’t necessarily need it done right away. Often, bulk requests can be delayed to the next second, the next hour, or even the next day. This allows for innovative designs to accomplish the work at lower cost than if everything had to be performed immediately.

For this post, I’m going to assume you want to do in-place no-downtime bulk updates on a table that’s also accepting organic (customer-driven) traffic. I’m also going to assume you have a truly large bulk update task, counted in the billions of items. If your bulk update task is smaller than this, the cost will be so low (no matter how you perform the work) that it doesn’t require the more careful approaches laid out here.

Large-scale pricing example

Imagine you have a truly large table: 500 billion items, each 500 bytes in size, for a total table size of 250 TB. This data has accumulated over the last 36 months and now you’d like to delete everything with a timestamp attribute older than 13 months. That’s 319 billion items you need to delete. Imagine furthermore you’d like to add a TTL attribute to the other 181 billion items so they’ll also be automatically deleted at 13 months of age going forward.

Metric Quantity
Item count 500 billion
Item size 500 bytes
Table size 250 TB
Items to delete 319 billion
Items to update 181 billion

Cost to modify all items using on-demand mode

The cost of writing (either deleting or updating) 500 billion small items depends on the table mode. In on-demand mode, it’s going to take exactly 500 billion write request units (WRUs). There’s no advantage with on-demand in controlling the timing of the requests (except to spread the work across the table to avoid hot partitions). Here is how that cost calculation breaks down for the us-east-1 AWS Region for a table in on-demand mode:

500 billion WRUs priced at $1.25 per million = $625,000

There are also read costs to consider. A Scan can pull multiple items at a time. Using eventually consistent reads, it can pull sixteen 500-byte items per consumed read unit. That means scanning all 500 billion items will require about 31.25 billion read request units (RRUs).

31.25 billion RRUs priced at $0.25 per million = $7,810

The read costs are inconsequential compared to writes, so this post will focus on optimizing the timing of the writes.

Action Quantity needed Cost per Total cost
Write 500 billion
500-byte items
500 billion WRUs $1.25 per million $625,000
Scan 500 billion
500-byte items
31.25 billion RRUs $0.25 per million $7,810

That’s about 790,000 item updates for every $1 of cost. Because the workload is flat and predictable, we can do much better using provisioned mode.

Cost to modify all items using provisioned mode

In provisioned mode, the pricing calculation is a little more complicated. This is because the bulk writes and the organic writes have to coexist in a way that doesn’t cause the table to be throttled. It’s in that coexisting relationship that we’ll find savings later in this post.

Provisioned tables that handle organic traffic tend to have autoscaling enabled. Autoscaling moves the amount being provisioned in response to the amount being consumed, aiming to keep the provisioned amount sufficiently above the consumed amount that temporary spikes stay within the provisioned amount.

Autoscaling settings include a minimum value (don’t ever scale below this), maximum value (don’t ever scale above this), and a target utilization percentage (aim to consume equal to this much of provisioned, with a default value of 70 percent). The target utilization effectively controls how much padding goes above the consumed amount to determine the provisioned amount. Higher numbers provide less padding (increasing the risk of throttling).

A steady bulk job consuming 1,000,000 WCUs will run for 500,000 seconds (139 hours) to process all 500 billion items. During that time, autoscaling will detect the increased traffic and provision an additional 1,428,500 WCUs (1,000,000 WCU divided by the 70 percent target) on top of whatever was needed for the organic traffic.

Here is what the cost calculation looks like in us-east-1 for a table in provisioned capacity mode:

1,428,500 WCUs * 139 hours * $0.00065 per WCU-hour = $129,065

The relatively minor cost of the scan is included in the following table.

Action Quantity consumed Quantity provisioned Cost per Total cost
Write 500 billion
500-byte items
1,000,000 WCUs
for 139 hours
1,428,500 WCUs
for 139 hours
$0.00065 per WCU-hour $129,065
Scan 500 billion
500-byte items
62,500 RRUs
for 139 hours
89,285 RCUs
for 139 hours
$0.00013 per RCU-hour $1,614

That’s about 3,826,000 item updates for every $1 of cost, a significant cost reduction from on-demand, as you’d expect for workloads that don’t need the instant reactiveness of on-demand mode. On-demand mode can actually be lower cost if the traffic is spiky, but perfectly flat work prices well with provisioned, as this shows.

Let’s now look at three different techniques that use timing to control when and at what rate the bulk write work is performed in order to achieve free or reduced-cost in-place updates.

Use unused reserved capacity

If you’ve purchased reserved capacity, you can potentially use it to provide free bulk writes by timing the writes during the periods when you have unused reserved capacity.

Reserved capacity (RC) provides a discount in exchange for a one- or three-year commitment to provision a minimum amount of throughput. If you have daily traffic varying from 300,000 WCUs to 700,000 WCUs, the right amount of one-year RC to purchase is somewhere around 500,000 WCUs. Buying only at the low point misses out on savings for all the workload above baseline; buying at peak over-provisions for much of the day; buying somewhere around the midpoint tends to be best.

This means that any time when organic traffic is between 300,000 and 500,000, a carefully constructed bulk job executor could fill the gap and perform the bulk work at zero extra cost. If you’ve already committed to a 500,000 minimum, you might as well fill the slow period with bulk work.

The following chart visualizes this. The spiky orange line is the organic consumed capacity, fluctuating up and down throughout the day. The horizontal red line is the RC purchased amount. The gap below the horizontal purchased line and above the spiky consumed line is the opportunity for free writes.

Figure 1: Time writes to match periods with unused reserved capacity

Figure 1: Time writes to match periods with unused reserved capacity

Imagine a bulk executor parameterized to consume a certain amount of self-throttled capacity during particular hours in the day, based on historic consumption norms. For example, if there are usually 200,000 reserved-but-unused WCUs during the middle of the night, the bulk executor can know to consume those WCUs for itself. Processing at a rate of 200,000 items per second would touch all 500 billion items in 694 hours. Assuming an average of 30 hours per week with this level of unused RC, it would take 13 weeks to process the backlog.

Calculating the historic norms is straightforward if you have only a single table but takes extra work in a larger organization. That’s because reserved capacity isn’t assigned to a single table but instead is shared among all tables for a given Region in all accounts connected to the same payer account. It’s not the low periods on just one table that matter, it’s the low periods across all tables, and then calculating what RC is unused during those periods. That’s the run rate to give to the bulk executor.

Autoscaling introduces complexity to this design. If you leave autoscaling at the 70 percent default target, you can only actually consume 140,000 WCUs because that’s the write level that will result in all 200,000 spare WCUs being provisioned to fill the utilization gap.

What you can do then is, during the hours of bulk execution, tighten the autoscaling settings to the most aggressive possible value of 90 percent. This greatly cuts down the padding and lets you consume up to 180,000 WCUs to reach the 200,000 WCUs provisioned.

Normally, if you set an aggressive target utilization, it increases the risk of throttling during any organic traffic spikes. You can mitigate this by having the bulk executor quickly and harshly self-throttle when it receives a ProvisionedThroughputExceededException so that it stops its own throughput consumption for a short while to make way for organic traffic. The padding is still effectively there, just being opportunistically consumed by a bulk executor job that converts itself to padding as needed. Picture a safety circular saw encountering a test hot dog and abruptly stopping itself to minimize any damage. There’s less need for conservatively autoscaling padding if the bulk job can shed its consumption quickly. Make sure to adjust the SDK retry settings to avoid automatic retries when implementing a harsh self-throttle.

This approach, when available, can perform all the updates over time without additional service charges, but it does require a fixed amount of engineering effort up front.

Tighten autoscaling

If you don’t happen to have unused RC but are using provisioned mode with autoscaling, tightening autoscaling provides another way to achieve nearly free bulk writes.

As discussed at the end of the previous section, the goal of autoscaling target utilization is to keep the provisioned capacity line sufficiently above the consumed capacity line that short bursts of consumption don’t exceed the provisioned amount. A higher target leaves less padding, while a lower target leaves more padding. The following figure shows how autoscaling works. The spikey orange line is the consumed WCU and the flat blue lines are the provisioned WCU.

Figure 2: Consumed and provisioned WCU with autoscaling

Figure 2: Consumed and provisioned WCU with autoscaling

The technique then is to raise the target (which lowers the flat provisioned lines), add in a steady proportional percentage of bulk work on top of the organic traffic (raising the spiky line a little bit and putting the provisioned line back where it was originally), and have the bulk work self-throttle if encountering table-level throttles (making it acceptable to have the more aggressive target). The bulk work gets done at a rate proportional to some percentage of organic traffic without additional cost.

The following chart visualizes this. The lower spiky orange line is the same organic traffic, straight blue lines are the same provisioned amount, and the upper spiky orange line is the consumed capacity with the bit of bulk work added on. Raising the target utilization keeps the straight lines in place even with the higher spiky line.

Figure 3: Tighten target utilization and add throttle-sensitive bulk work

Figure 3: Tighten target utilization and add throttle-sensitive bulk work

Having a higher target utilization value allows less room for spikes in traffic. That’s why you wouldn’t naturally have the target any higher than is safe. The overly high value is acceptable here because the bulk work can severely self-throttle (as covered previously) whenever it receives a throughput exceeded exception, freeing up the throughput immediately for the organic traffic and making the full distance between the orange and blue lines available for organic spikes.

The bulk executor operates at a fixed percentage of the observed organic traffic to fill the gap opened by the higher target utilization. When changing the target from 70 percent to 80 percent, it opens the bulk job to consume approximately 15 percent of organic traffic as it rises and falls, which mathematically keeps the provisioned line at roughly the same levels as would have been consumed originally.

Here’s the math confirming that changing 70 percent to 80 percent allows for a 15 percent increase:

  • Assume 500,000 WCUs of fluctuating organic traffic.
  • A 70 percent target results in 500,000/0.7 = 714,285 WCUs provisioned by autoscaling.
  • Change it to an 80 percent target and it’s 500,000/0.8 = 625,000 WCUs.
  • Add 15 percent of bulk work changing 500,000 to 575,000 and it’s now 575,000/0.8 = 718,750 WCUs provisioned. That’s right about where we started.

The exact math is (80/70) – 1 = 14.3%. If you were to move a 50 percent target up to 70 percent you can add (70/50) – 1 = 40% of bulk work.

Adding a fluctuating but average rate of 75,000 WCUs to the organic traffic will complete the 500 billion item backlog in 11 weeks.

The challenge of this design is in keeping the bulk work running at a steady percentage of the organic traffic, which can vary. The bulk executor must observe the Amazon CloudWatch metrics so as to detect the (few minutes delayed) overall consumed capacity, subtract its own recent history, mathematically determine the organic usage level, and adjust its self-throttle to run as a fixed percentage of that organic traffic. It’s hard to be precise, which is why it’s best to not be too aggressive in the new utilization target. Prefer 80 percent over 90 percent, for example.

One extra challenge: it’s not possible to differentiate between a table-level throughput exceeded exception and a single hot partition causing a throughput exceeded exception. The bulk job should err on the side of harshly self-throttling to avoid throttling organic traffic in either case. It’s important that the bulk job evenly spread its writes across partitions, as we’ll discuss later, to minimize how often the bulk executor has to pause due to one hot partition.

This approach, like the previous, can perform all the updates over time without additional service charges, but with a fixed amount of engineering effort up front.

Remove autoscaling

If you’re using provisioned capacity and want to control how long the bulk executor should take, or if you prefer a design with fewer moving parts, removing autoscaling provides cost-effective but not quite free bulk processing.

What you do is turn off autoscaling, set your table to a fixed amount of provisioned write capacity (somewhere above peak organic traffic), and have the bulk executor fill in the gap between the organic traffic and the provisioned level, being sure to set the bulk executor to self-throttle itself to ensure organic traffic isn’t throttled.

The cost savings comes from the fact that, during the duration of the bulk processing time, the normal padding required of autoscaling isn’t needed. If the bulk processing runs for 10 weeks, any padding that would have been required during those weeks can be subtracted from the cost of the bulk processing. Unused padding becomes useful bulk processing time.

The bulk executor needs to track its own consumption, observe in CloudWatch the overall consumption, and self-throttle to keep the overall consumption close to the provisioned level. It also needs to severely self-throttle should it encounter a throughput exceeded exception.

The consumption goal target should be a little below 100 percent to keep the benefits of burst capacity to handle unanticipated organic write spikes. Burst capacity allows a table to run above its provisioned capacity for short durations. Burst capacity is given on a best-effort basis, but when available helps your table avoid throttles if there’s an organic spike; instead, you’ll see throttles only after the burst capacity has run out.

The amount available to burst is 300 times the per-second provisioned capacity. As an example of what that means, a table provisioned at 200,000 WCUs can potentially burst at 300,000 for 10 minutes or 250,000 for 20 minutes before encountering throttles at the table level (individual partitions have no burst capacity). The burst capacity is replenished as your traffic spends time below provisioned capacity.

Providing a gap of 5 percent below full consumption means that after 100 minutes at that level, the burst capacity allowance would be fully restored after any total depletion, so aiming for 95 percent consumption is a good compromise between speed, cost effectiveness, and keeping burst availability. If you anticipate more spikes in the organic workload, you can leave a larger percentage of gap.

The following chart visualizes this. Spiky orange is the organic consumed capacity, flat blue the typical provisioned amount, and dotted red the new fixed provisioned capacity. The bulk work should run a bit below the flat red line.

Figure 4: Turn off auto-scaling and use bulk work to achieve flat consumption

Figure 4: Turn off auto-scaling and use bulk work to achieve flat consumption

This approach subtracts the cost inherent in the overhead of auto-scaling (for example, with a 70 percent target utilization, 30 percent of the cost is devoted just to padding) and achieves around 95 percent utilization for organic writes as well as bulk writes.

For the 500 billion item table, let’s assume you want to provision at a fixed 800,000 WCUs because that’s above any expected peak in the usual 300,000 to 700,000 WCUs of fluctuating organic traffic. The bulk traffic will fill in the gap between the organic traffic and the target of 760,000 WCUs (respecting that 5 percent gap). The rate of bulk traffic will vary but since the organic traffic averages 500,000 WCUs the bulk work will average 260,000 WCUs. At that write rate, it will process all 500 billion items in three weeks.

Let’s consider costs. During those three weeks, if there was no bulk activity, we would expect an average of 500,000 WCUs of organic traffic and, with a 70 percent target utilization, that would produce an average of 714,285 WCUs provisioned. Instead, we opted to provision a fixed 800,000 WCUs. For three weeks we ran the bulk executor at an average rate of 260,000 WCUs while only adding an incremental cost of about 85,000 WCUs. That means the bulk work was done at a 67 percent cost reduction compared to fully utilized provisioned writes and an even greater cost reduction compared to an autoscaling table.

At the beginning of this post, we calculated a bulk job against a provisioned table would cost $129,065. This approach consumes an average of 85,000 WCUs across three weeks for a total cost of:

85,000 WCUs * 21 days * 24 hours * $0.00065 per WCU-hour = $27,846

That’s about 18,000,000 item updates per $1 of cost.

Comparing approaches

The first approach—use unused reserved capacity—is straightforward and provides potentially free bulk writes, but only works to the extent that you have repeating and predictable periods of underutilized reserved capacity. The longer and deeper the periods, the faster the bulk work can complete.

The second approach—tighten autoscaling—is more complex. It achieves potentially free bulk writes, with some risk of increased throttling or of the bulk work getting in the way of autoscaling based on the organic write traffic. The risk can be controlled by parameterizing the percentage of bulk work added to the organic traffic.

The third approach—remove autoscaling—is straightforward, should execute the bulk work faster than the second option, but will generally incur some cost for the bulk work as the single provisioned capacity line will probably be above the average of the autoscaling line. However, the cost is in your control as you choose the fixed provisioned amount and you can save the cost of the padded throughput during the time of the bulk work.

. Use unused RC Tighten autoscaling Remove autoscaling
Summary Convert unused RC into used bulk work during natural lulls in RC usage Raise the autoscaling target and add bulk work (that temporarily halts itself to make way for organic traffic spikes) to make the high target acceptable Provision a high fixed amount and add bulk work (that temporarily halts itself to make way for organic traffic spikes) to achieve near full utilization
Cost Free Free Low
Advantages Simplest design Works even without RC You control the execution speed; straightforward design
Disadvantages Must be using RC and have predictable lulls Most complex design Incurs a cost

Spread the write load using a shuffled scan

Whichever approach you choose, you’ll want to spread the write load across the table’s partitions, which requires some extra care if you’re driving writes from a table scan.

When doing repeated Scan calls, you pull a page at a time. Each call returns the number of items you specify as a limit, up to a maximum of 1 MB. If your call hasn’t reached the end, its response includes a LastEvaluatedKey, which you pass into the next Scan call to pull the next page. Internally, the Scan call pulls items from one partition, then the next.

If you do what seems natural and drive your writes directly from the items returned by the Scan, you’ll be sending your writes toward whatever partition the Scan is working against at that point, creating a rolling hot partition and limiting your write rate to around 1,000 WCUs per second (the fixed write limit of one partition).

To avoid this, you want to do a shuffled scan:

  1. Initiate a parallel scan.
  2. Specify you want the scan to use for example 10,000 segments.
  3. Pull 100 items from a segment (using limit), then pull 100 from another segment chosen at random.
  4. Keep track of the LastEvaluatedKey for each segment number so you can reuse it when you come back to that segment again to read the next block.
  5. When a segment doesn’t return a LastEvaluatedKey anymore, you know that segment is fully read and you don’t read from it again.
  6. When all segments are done, the full table will have been scanned in shuffled order.

This pulls 100 items at a time from 10,000 different segments of the table, effectively jumping here and there to create a work list that’s well distributed across the table.

If you employ parallel workers, you can give each worker a random set of segments for it to shuffle against. That allows each worker to spread its activity across the table.

Summary

Bulk writes—be they deletes, inserts, or updates—have the unique characteristic that they can be performed at a time and at a rate that’s chosen for cost-effectiveness.

If you have reserved capacity, it’s possible to obtain writes at no extra charge by making those writes when the organic traffic levels are below the RC minimums.

If you’re using autoscaling, it’s possible to obtain free or nearly-free writes by raising the target utilization percentage and filling the gap with bulk writes, preventing throttling issues by having the bulk job harshly self-throttle whenever it reaches a throughput limit.

If you want a simple design or faster processing, you can obtain low-cost writes by setting a fixed provisioned WCU amount (no autoscaling) and have the bulk work consume capacity almost to that fixed amount. The cost savings come from removing the usual cost of the autoscaling overhead during the time of the bulk work.

Remember that when driving any bulk writes from a table scan, it’s important to spread the scan calls across the table so as to spread the write calls as well, thereby avoiding hot partitions. You can do that by using a shuffled scan (built on a parallel scan) that pulls a small number of items repeatedly from random segments.

If you have any comments or questions, leave a comment in the comments section. You can find more DynamoDB posts and others posts written by Jason Hunter in the AWS Database Blog.


About the Author

Jason Hunter is a California-based Principal Solutions Architect specializing in DynamoDB. He’s been working with NoSQL Databases since 2003. He’s known for his contributions to Java, open source, and XML.