AWS Database Blog
Effective data sorting with Amazon DynamoDB
Amazon DynamoDB offers high scalability and performance for applications with varying workloads. While DynamoDB excels at efficiently distributing data across multiple partitions, it inherently follows a specific sorting order based on the schema selected. In this post, we show two example data models, one designed to store e-commerce order information and one to store game scores. We use these data models to explore how DynamoDB naturally arranges items and delve into effective strategies for establishing customized ordering.
Before we examine the details, it’s essential to understand the significance of partition and sort keys in DynamoDB and how we can harness their strengths to create an efficient and scalable data model.
Partition Key and Sort Key characteristics
A composite primary key in DynamoDB consists of two attributes: the partition key and the sort key. The partition key value is utilized as input for an internal hash function, which then determines the specific partition (internal physical storage within DynamoDB) where the item is stored. Items with the same partition key value are stored together and sorted based on their sort key values.
In tables with both a partition key and a sort key, it’s possible for multiple items to share the same partition key value. However, items with the same partition key must have distinct sort key values.
Ordering
The sort key, also known as the range key, is responsible for determining the order in which items are stored within a partition. When you query or scan a DynamoDB table, the sort key enables you to retrieve data in a specific order based on the values of the sort key.
Items that share the same partition key value are organized based on the sort key. The sorting mechanism varies depending on the data type of the sort key:
- If the sort key’s data type is Number, DynamoDB arranges items in numeric order, ensuring that numerical comparisons are straightforward and efficient.
- When the sort key is of type String, DynamoDB sequences items in accordance with the UTF-8 byte order, making it ideal for lexicographical sorting.
- For Binary data types, DynamoDB treats each byte of the binary data as unsigned, facilitating precise byte-level ordering.
Conditions
The sort key in a DynamoDB table is a powerful tool for optimizing query efficiency. By combining the sort key with conditions, you can perform precise and efficient queries that retrieve only the data you need. For example, you can use conditions to fetch items with a sortable attribute range, such as a date. This enables targeted retrieval, reducing the amount of data scanned and improving query performance. By strategically designing your data model and leveraging the sort key effectively, you can tailor your queries to match various access patterns and efficiently access the data that matters most to your application.
E-commerce data model example
To gain a clearer understanding of how sorting works in relation to the partition key, let’s visualize the concept. DynamoDB stores data in items (analogous to rows) where each item has a unique identifier called the partition key, which serves as the primary way to distribute data across partitions. This model uses a sort key, which determines the order of items within each partition. Our DynamoDB table contains user orders, with the userID as the partition key and their date of order as the sort key. DynamoDB does not have a native Date data-type so our sort key uses an ISO8601 string format.
For the partition key userID, DynamoDB distributes users’ data across partitions based on their ID. Within each partition, DynamoDB sorts the data by the sort key which is the order date. Visualizing this, we can think of the data organized somewhat like a filing cabinet:
- Each drawer in the cabinet represents a partition, identified by a unique userID.
- Inside each drawer (partition), you’ll find files (items) for each user, sorted by their order date.
The following table illustrates our example use-case. We can refer to a group of items which share the same partition key (but different sort keys) as an item collection.
This data model enables us to query a user’s orders within any specified time frame. For instance, we can efficiently retrieve all the orders placed by “user123” within a three month timeframe. Below is an example of how to execute that request using the AWS Command Line Interface (CLI):
Now let us imagine that additional business needs have the added requirements for these additional access patterns:
- Get all orders for the last 24 hours
- Get all orders for the last 7 days
- Get all orders for the last 1 month
- Get all orders for the last 3 months
We have seen how DynamoDB maintains ordering within an item collection according to the sort key value. These new access patterns require a sorted order that spans items (or all partition keys).
Solution overview
To establish a sorted order that spans all partition keys, one critical observation is that we lack an attribute that allows us to group data into a unified item-collection.
If retrieving all orders from the past is not a commonly requested access pattern, we can utilize a Scan operation and filter the results to match the desired timeframe. However, this method may prove inefficient in terms of both performance and cost. Therefore, if this access pattern is frequently requested, we require an alternative approach.
Leveraging a Global Secondary Index
A Global Secondary Index (GSI) is a DynamoDB feature that maintains an eventually consistent copy of some or all of the base table’s data. GSIs allow for efficient querying of a table based on attributes other than the primary key. It provides flexibility for querying and filtering data, supports parallel queries, and is essential for optimizing query performance while accommodating various access patterns.
Now that we understand how DynamoDB maintains order within an item collection, we can design an alternative schema to support our additional access patterns using a GSI.
Approach 1 (non-optimal)
Recognizing the ability of item collections to organize data effectively, we have implemented a Global Secondary Index (GSI) that uses a date attribute with a one-day granularity. This allows us to efficiently group orders for each specific day. To facilitate this, we have introduced an additional attribute in our data structure named gsi1_pk which stores the necessary date values.
If you need to enhance your current data model by adding an additional attribute to each item, you’ll need to carry out a backfill operation. For a comprehensive guide on how to do this, refer to our detailed blog post.
Now, we have the capability to efficiently query data based on specific days, like retrieving all orders for 2023-10-03. While this approach is effective for single-day queries, our use-case demands handling broader date ranges. For instance, obtaining data for an entire week would necessitate making seven parallel requests, one for each day of the week. While manageable for a week, it’s important to note that as the date range expands, the number of required requests increases linearly, which can impact scalability.
An example to obtain all of the orders from 2023-10-03 to 2023-10-06 would look like the following:
Number of requests: 4
Returned Items: 4
Consumed Capacity: 2
While simple to implement, this approach scales poorly as the number of dates queried increases. One notable drawback is the need to make requests even for dates like 2023-10-05 where no data exists, incurring unnecessary costs without yielding any relevant information.
Approach 2 (optimal)
An improved strategy involves harnessing the sort key, enabling us to utilize conditions effectively. In this approach, we select a fixed value for our GSI partition key gsi1_pk, effectively consolidating all data into a single item collection. The sort key is defined as an ISO 8601 timestamp (string) down to millisecond granularity. These timestamps are already stored within our items under the SK attribute.
Observe that we’ve created a unified item collection stored under the gsi1_pk partition key attribute, with a constant value of 1. Consequently, all items from our table are now sorted in lexicographical order based on the order creation timestamps.
Now let’s repeat our example to obtain all of the orders from 2023-10-03 to 2023-10-06 which looks like the following:
Number of requests: 1
Returned Items: 4
Consumed Capacity: 0.5
This approach not only enhances efficiency but also offers heightened flexibility. Should our business needs evolve to encompass alternative access patterns, such as retrieving all orders in the last 30 minutes, obtaining the most recent 100 orders, or accessing the oldest 100 orders, our data model equips us with the versatility to execute these queries efficiently.
Using a single fixed value as the partition key can introduce performance bottlenecks, which we will address later in this post.
Gaming data model
DynamoDB is frequently used to store game information like scores and player information. Its scalability and performance capabilities make it a good fit for gaming applications. The flexible schema design allows for adjustments to game mechanics without complicated database modifications, while its low-latency operations ensure real-time updates making it a dependable option for managing leaderboards, player profiles, and achievements in gaming applications. An example data model may look like the following:
In this data model example, we observe a simple primary key defined as a partition key representing the user’s unique identifier, referred to as the userId. This design proves highly effective for uncomplicated key-value queries centered around the userId, like retrieving the score for user0011 or updating the score for user30046.
Imagine a new use-case required to generate leaderboards showcasing the top 10 and top 50 users in our game. While it may seem intuitive to introduce a “score” attribute as the sort key to facilitate this access pattern, this approach encounters two significant challenges that render it impractical. In DynamoDB, you cannot modify primary keys in the main table, which impedes efficient updates to the score value. For instance, you can’t use UpdateItem to change primary key attributes. Instead, you must delete the item and then use PutItem to introduce a new item with the desired attributes. More crucially DynamoDB limits sorting to the item collection which in this context means each user’s item would effectively constitute an item-collection comprising just a single item, undermining the feasibility of the desired leaderboard functionality.
We can use a similar solution as our ecommerce orders table here, create a GSI with a static partition key value, so that all user items are held within a single item-collection, and use the score attribute as the GSI sort-key. So we include gsi1_pk as an attribute in our data model:
Now our GSI stores the data grouped together and ordered by player score in ascending order:
Given our current use-case of delivering the highest-scoring users, we must read from the index in descending order. To accomplish this, we can make use of the ScanIndexForward property within the Query API and set it to False.
This solution allows us to retrieve the N highest scores from our index efficiently, minimizing any unnecessary consumption of read capacity.
Cost considerations
When utilizing a global secondary index (GSI) in DynamoDB, factors like attribute projection, storage, and throughput affect costs. The choice of attribute projection is a crucial consideration. Using INCLUDE projection allows for selecting a subset of attributes to be included in the index, thereby reducing storage costs. Conversely, ALL projection includes all attributes in the index, simplifying querying but significantly increasing both storage and throughput costs. Striking a balance between cost optimization and functionality is vital, and it requires careful evaluation of the projected attributes’ importance, the frequency of their access, and the available budget. By weighing these factors, one can make informed decisions to optimize costs while maintaining the desired level of performance and functionality.
Performance considerations
While adopting a single value as a partition key in the DynamoDB data model is convenient, it involves trade-offs that can affect performance. When a static value is used as the partition key for a GSI, all data items are concentrated within a single partition. As a result, the workload won’t be distributed evenly across multiple partitions, limiting the benefits of horizontal scaling in DynamoDB.
This concentration of data on a single partition can lead to performance bottlenecks. In scenarios where high read or write rates are expected, this approach can result in hot partitions, where a single partition becomes overwhelmed with excessive read or write requests. This can lead to throttling, increased latency, and reduced overall throughput. Therefore, considering DynamoDB’s partition limits, this strategy is best suited for tables where the write traffic doesn’t surpass 1000 WCU and index read traffic remains below 3000 RCU.
Careful consideration should be given to selecting a suitable partition key strategy to ensure optimal performance and scalability in DynamoDB.
Optimizing performance through partition key sharding
To address the scalability concerns associated with a single value as a partition key, you can employ GSI partition key sharding. This approach allows you to evenly distribute the workload across multiple partitions, resulting in improved performance and scalability.
Partition key sharding involves dividing the data across multiple partition keys instead of relying on a single static value as the partition key. This approach distributes the workload across multiple partitions, taking advantage of horizontal scaling in DynamoDB.
Partition key sharding offers several benefits:
- Improved scalability – By distributing the workload across multiple partitions, partition key sharding allows for better scalability as the application grows
- Enhanced performance – With the workload evenly distributed, partition key sharding helps avoid hot partitions, preventing performance bottlenecks and reducing the likelihood of throttling
- Flexibility in data distribution – Partition key sharding provides the flexibility to adjust and optimize data distribution based on changing access patterns and evolving application requirements
Calculated range sharding can be the simplest solution for partitioning your data. In this approach, you assign values ranging from 1 to N as the partition keys. With each DynamoDB partition capable of providing a maximum throughput of 1000 WCUs per second, you can determine the appropriate value for N using the following formula:
N = expected_peak_throughput / 1000
By dividing your expected peak throughput by 1000, you can calculate the number of partitions (N) required to distribute the workload evenly across the DynamoDB table. This ensures that each partition receives a manageable amount of requests and avoids overloading a single partition. Calculated range sharding simplifies the process of determining the number of partitions necessary for achieving the desired performance and scalability in DynamoDB.
When utilizing GSI write sharding in DynamoDB, the process of retrieving data becomes more complex as it requires executing parallel requests across the various partitions. GSI write sharding involves distributing data into multiple partitions based on the chosen partition key including the shard suffix. Each partition handles a subset of the data, allowing for improved scalability and performance. However, when fetching data, it is necessary to query all relevant partitions simultaneously to retrieve a complete result set. This entails issuing parallel requests to each partition and merging the results to provide a comprehensive and consistent view of the data. By leveraging parallelism in this manner, DynamoDB ensures efficient retrieval of data across multiple partitions, enabling high-speed and concurrent read operations in a distributed and scalable manner.
Conclusion
In this post, we have demonstrated the utilization of a Global Secondary Index (GSI) in combination with sort key ordering to attain a globally sorted order across all items. This approach empowers us to efficiently organize and access our data, ensuring a robust foundation for various querying needs and improved performance within DynamoDB.
Additionally, partition key sharding provides a solution to overcome the non-distributed requests associated with using a single value as a partition key in DynamoDB. By distributing the workload across multiple partitions, this technique improves scalability and enhances performance. By adopting partition key sharding in your DynamoDB data model, you can achieve better performance, scalability, and flexibility, ensuring optimal utilization of database resources.
Take the next step in mastering DynamoDB’s data modeling capabilities and optimizing your database performance. Explore the documentations data modeling section for in-depth insights, and learn how to create and manage Global Secondary Indexes (GSIs) to further enhance your DynamoDB expertise.
About the authors
Lee Hannigan, is a Sr. DynamoDB Specialist Solutions Architect based in Donegal, Ireland. He brings a wealth of expertise in distributed systems, backed by a strong foundation in big data and analytics technologies. In his role as a DynamoDB Specialist Solutions Architect, Lee excels in assisting customers with the design, evaluation, and optimization of their workloads leveraging DynamoDB’s capabilities.