Best practices for running Apache Cassandra with Amazon EBS

This is a guest post written by Jon Haddad an Apache Cassandra committer specializing in performance tuning, fixing broken clusters, and cost optimization. He’s an independent consultant and trainer who works with companies of all sizes to help them be successful. He previously worked on some of the biggest Cassandra deployments in the world.

AWS has been a popular place to run Apache Cassandra, and Cassandra operators often ask what storage they should use to get the best price-performance. For the vast majority, the Cassandra community has opted to use ephemeral SSDs on older instance types or AWS Nitro SSD (NVMe) on newer ones. These drives work well to provide low latency on small reads, and also provide high throughput on bulk operations such as compaction.

However, using instance store drives with Cassandra has operational challenges. Operations such as exporting and importing backups requires copying data files to an external location like Amazon Simple Storage Service (Amazon S3) with tools such as Medusa. Replacing a Cassandra node can be done using backups, but the time it takes is proportional to the amount of data you need to restore. Cassandra nodes will generally store up to 1–2 TB of data, but this isn’t a hard limit, and a well-tuned cluster can store significantly more data per node.

Developers want to use Amazon Elastic Block Store (Amazon EBS) instead of instance storage with Cassandra because it helps simplify operations. Amazon EBS offers several advantages over an instance store that make it a solid choice:

Replacing a failed node is almost instant
It’s straightforward to quickly change instance types by detaching EBS volumes from one instance and reattaching them to a new one
EBS snapshots are straightforward to take
Many Cassandra nodes don’t require tens of thousands of IOPS to perform well and can work very well on Amazon EBS even when provisioned with only a few thousand IOPS

In this post, we discuss the basics of improving the performance of Amazon EBS with Cassandra to take advantage of the operational benefits. We explore some basic tools used by Cassandra operators to gain insight into key performance metrics. You can then apply these metrics to modify key operating system (OS) tuneables and Cassandra configuration. Finally, we review benchmarks on performance gains by implementing best practices for Amazon EBS.

Amazon EBS options

Amazon EBS gp3 volumes are the latest generation of General Purpose SSD volumes. They offer more predictable performance scaling and prices that are up to 20% lower than gp2 volumes. Additionally, gp2 volumes required increasing volume size to achieve higher IOPS. For example, if you needed 10,000 IOPS per volume, you were required to have a 3.334 TB volume size. With gp3, volume size is independent of provisioned IOPS. If you’re currently using gp2 volumes, you can upgrade your volumes to gp3 using Amazon EBS Elastic Volume operations. EBS gp3 volumes have a baseline of 3,000 IOPS, with additional IOPS up to a maximum of 16,000. If you need higher IOPS, you can use io2 Block Express up to 256,000 IOPS.

The following is a high-level comparison of gp3 and io2 Block express. EBS gp3 offers the following benefits:

Single-digit millisecond latency
Storage capacity up to 16 TiB per volume
Provisioned IOPS up to 16,000 per volume
Throughput up to 1,000 MiBps per volume

With io2 Block Express, you get the highest level of performance both in terms of throughput and low latency:

Sub-millisecond average latency
Storage capacity up to 64 TiB per volume
Provisioned IOPS up to 256,000 per volume
Throughput up to 4,000 MBps per volume

Several basic tunings are available to have your Cassandra cluster run optimally on Amazon EBS while avoiding costly overprovisioning. Skipping these steps can increase storage cost, increase latency, and limit throughput, so it’s important to take the time to implement these optimizations.

How Cassandra reads data from persistent storage

Before we explore possible optimizations, it’s critical to first understand how I/O is implemented in Cassandra. When Cassandra wants to read data off persistent storage, it doesn’t access the disk directly. Instead, Cassandra queries the file system, which is managed by the OS. Most of the time, it can be convenient to ignore that complexity, but when you need to tune performance, you need to understand it intimately. For instance, before reading directly from disk, the file system will check the OS page cache. Optimizing the page cache can improve latency and lower I/O usage. There are two types of reads you can see on a node:

Random reads – These come from CQL queries and can access any location in any data file. We expect these queries to respond as quickly as possible, so we want to avoid reading any extra, unnecessary data, because that increases response times.
Sequential reads – These are performed during Cassandra compaction and aren’t sensitive to latency. Compaction is a process that merges data files to boost read performance and remove deleted data. Here we want to achieve high throughput, and we’re going to read this data from the beginning of a file to the end of the file, in order. Consuming too many IOPS in compaction can impact IOPS available for CQL queries.

Performance tools

To optimize your system to get the best performance possible, it’s helpful to have the right tooling. In this section, we discuss helpful tools to understand the performance of different layers of Cassandra disk access.

When you’re trying to understand database performance, it’s important to first determine if there are any hardware bottlenecks. It can be difficult, if not impossible, to build an accurate mental model of the root cause without going through this process.

Because we’re tuning Cassandra for Amazon EBS, we focus on tools that can help you understand I/O. Several excellent ones are available in the sysstat and bcc-tools projects.

Let’s explore five common tools for understanding I/O performance when working with Cassandra.

Get an overview using iostat

Linux iostat is a general purpose tool to help you understand your I/O usage at a high level. It has many display options, but for this example, we care most about transactions per second (TPS) and throughput (MBps), which helps identify micro-bursting. Micro-bursting occurs when an EBS volume bursts high IOPS or throughput for significantly shorter periods than the collection interval. Because the volume bursts high IOPS or throughput for a short time, Amazon CloudWatch may not reflect the bursting. Exceeding provisioned IOPS can result in throttling and higher latency.

The tps column can approximately tell you how close you are to reaching the provisioned IOPS of the device, as well its read and write throughput. After each stage of optimization, use iostat to see the impact of your changes on throughput and IOPS. Note that EBS disks are labeled as nvme, even though they’re Amazon EBS:

The following output is an example of running iostat while Cassandra is under load.

root@cassandra0:~$ iostat -dmxs 2 10
Linux 6.5.0-1016-aws (cassandra0) 	03/27/24 	_x86_64_	(8 CPU)
Device             tps      MB/s    rqm/s   await  areq-sz  aqu-sz  %util
loop0             0.01      0.00     0.00    2.41    33.27    0.00   0.01
loop1             0.00      0.00     0.00    8.71     1.71    0.00   0.00
loop2             0.02      0.00     0.00    2.73     8.97    0.00   0.01
loop3             0.00      0.00     0.00    8.46     1.92    0.00   0.00
loop4             0.05      0.00     0.00    2.61    37.72    0.00   0.02
loop5             0.00      0.00     0.00    0.00     1.27    0.00   0.00
nvme0n1           5.17      0.08     2.15    2.88    15.39    0.01   0.97
nvme1n1          17.34     28.04     0.16    1.31  1655.44    0.02   0.39

View a histogram of file system performance using xfsdist

When you’re investigating performance, it’s often useful to understand each component of an I/O request. When performing a read, you make a request to the file system. Fortunately, bcc-tools makes it straightforward to get a breakdown of the distribution of response times at this level.

xfsdist tells you the overall performance of how the file system, page cache, and storage device work together. For example, unthrottled compaction without readahead enabled can often use up all the available IOPS on a disk, which can cause read latencies to spike. When a disk uses all of its provisioned IOPS, the disk’s performance is throttled, leading to latency spikes or throttled compaction. In the following example, you can observe outlier I/O requests:

$ xfsdist 1 10 -m
Tracing XFS operation latency... Hit Ctrl-C to end.

operation = write
     msecs               : count     distribution
         0 -> 1          : 31942    |****************************************|
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 79       |                                        |

The preceding output shows a small number of requests (79) that are outside of the desired response time. This indicates that you’re gone over either your IOPS threshold or throughput threshold. You can use the iostat command from the previous section to discover if you’re reaching an IOPS limit or throughput limit. If you’re reaching a throughput limit, you should try to reduce resource usage by throttling compaction, increasing readahead, switching to an instance with more memory, or increasing your provisioned IOPS. In the next section, we look at how to evaluate which is the best option.

The following is an example from a healthy volume. The response time is always between 0–1 milliseconds, which is exactly what you want out of the EBS volume:

$ xfsdist 1 10 -m
Tracing XFS operation latency... Hit Ctrl-C to end.

operation = read
     msecs               : count     distribution
         0 -> 1          : 9735     |****************************************|

Get detailed information about slow reads using xfsslower

Now that you understand the high-level performance, it can be helpful to determine what file operations are slow, using xfsslower. There are tools for different file systems, such as ext4slower and zfsslower, that you can use with the appropriate file system.

Each line in the output is one operation, which might have been returned from the page cache or from the underlying block device. You can see the name of the read that initiated the request in the COMM column, the type of operation (R for read, W for write, S for sync), how many bytes were read, and the offset of the file in KB.

The first parameter tells xfsslower to show operations that took longer than N milliseconds. You can pass 0 to see every file system operation. This can generate a lot of data on a busy volume and should be used with care.

The following code provides an example:

$ xfsslower 1
Tracing XFS operations slower than 1 ms
TIME     COMM           PID    T BYTES   OFF_KB   LAT(ms) FILENAME
19:20:16 HintsDispatche 19371  R 4096    44952       2.10 f4067d52-5f01-4ed6-9c7f-7d2503b5
19:20:16 CompactionExec 19371  R 16376   14760       0.03 nb-531-big-Data.db
19:20:16 CompactionExec 19371  R 16329   14631       0.04 nb-872-big-Data.db
19:20:16 CompactionExec 19371  R 16412   14598       0.01 nb-303-big-Data.db
19:20:16 CompactionExec 19371  R 16357   14565       0.01 nb-742-big-Data.db
19:20:16 CompactionExec 19371  R 16388   14743       0.02 nb-352-big-Data.db
19:20:16 CompactionExec 19371  R 16392   115279      2.34 nb-1094-big-Data.db
19:20:16 CompactionExec 19371  W 16387   1454049     0.03 nb-1204-big-Data.db

Learn about your page cache with cachestat

Because there’s a fast path (page cache) and a slow path (disk), it’s helpful to know what percentage of Cassandra’s requests are going to each path. Any memory not used by Cassandra or another process on the machine will be used for page cache automatically. cachestat can tell you how many requests the page cache receives and what the hit rate is:

The following command will print the hit miss rate every second, up to 10 readings.

$ cachestat 1 10
HITS   MISSES  DIRTIES  READ_HIT% WRITE_HIT%   BUFFERS_MB  CACHED_MB
102989        0        0     100.0%       0.0%           10       1589
101581        0        0     100.0%       0.0%           10       1589
89928     1031       39      98.8%       1.1%            1        562
19243    18420        0      51.1%      48.9%            1        635
20840    18417        0      53.1%      46.9%            1        707
22867    18430        2      55.4%      44.6%            1        778
24927    18315        0      57.6%      42.4%            1        850
27529    18282        0      60.1%      39.9%            1        921
30444    18258       11      62.5%      37.5%            1        993
33881    18073        0      65.2%      34.8%            1       1063

The more RAM on a machine, the more data can be cached in memory. Generally, access patterns tend to follow the 80/20 rule, where 20% of the data is accessed 80% of the time. The more data you can put in page cache, the less IOPS is consumed, reducing utilization and avoiding throttling.

Drill down to the block level with biolatency

Finally, we get to the block device, or in this case the EBS volume. We want our requests to be as fast as possible, and to avoid impacting the device unnecessarily. Here you can see the distribution of response times at the device level, without the benefit of page cache. biolatency can tell you how many requests Cassandra performs as well as the distribution of response times:

$ biolatency-bpfcc 1 10 
Tracing block device I/O... Hit Ctrl-C to end.
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 2        |***                                     |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 1        |*                                       |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 4        |******                                  |
      1024 -> 2047       : 23       |****************************************|
      2048 -> 4095       : 7        |************                            |

Observations

Running the preceding commands on a running cluster and analyzing the results, then cross-referencing with the Amazon EBS documentation, can tell you several things:

A single I/O operation can read or write up to 256 KiB with gp3
Cassandra performs multiple small, random reads when reading data files
Compaction is a sequential operation that can be optimized by using readahead, to reduce the number of disk operations for sequential read operations
Random I/O doesn’t benefit from large reads unless a lot of data is being read
Returning data from the page cache is significantly faster than disk
More RAM gives you more space for the page cache

Finding the ideal setting for readahead can help reduce the IOPS impact of compaction, at a slight cost to read latency. Using instances with more memory can help with this by keeping more data in the page cache, reducing IOPs even further.

Best practices for Cassandra

In this section, we discuss several best practices for Cassandra.

Readahead

When you perform a read off persistent storage, the readahead setting causes the OS to pre-fetch the disk’s data and put it in the page cache. The value for readahead determines how much data is pre-fetched. This works well for sequential workloads like compaction, but isn’t as helpful for the small, random reads that happen when a user issues a query.

You can find out your readahead by running the following code and looking at the RA column for each device:

$ root@cassandra0:~# blockdev --report
RO    RA   SSZ   BSZ        StartSec            Size   Device
ro   256   512  1024               0        26456064   /dev/loop0
ro   256   512  1024               0        58363904   /dev/loop1
ro   256   512  1024               0        67010560   /dev/loop2
ro   256   512  1024               0        91230208   /dev/loop3
ro   256   512  1024               0        40996864   /dev/loop4
rw     8   512   512               0    300000000000   /dev/nvme1n1
rw   256   512  4096               0     17179869184   /dev/nvme0n1
rw   256   512  4096          227328     17063460352   /dev/nvme0n1p1
rw   256   512  4096            2048         4194304   /dev/nvme0n1p14
rw   256   512   512           10240       111149056   /dev/nvme0n1p15

Minimizing readahead on ephemeral drives makes sense because it optimizes for read performance. It does require more IOPS for compaction, but with ephemeral NVMe drives, this is a minimal cost. With Amazon EBS, small reads in compaction can consume a higher percentage of your total provisioned IOPS. You can improve this by changing the readahead OS-level setting. Using 16 KB as the minimum readahead value, which is the Amazon EBS native block size, can read more data for a lower number of IOPs. Restart the Cassandra process for this setting to take effect. When changing readahead with the blockdev command, double the desired KB. For example, to achieve 16 KB readahead, use the value of 32 in the blockdev command:

$ blockdev --setra 32 /dev/sda

The optimal setting will depend on the workload. This setting introduces a trade-off between compaction and reads, so it’s best to apply readahead if your workloads are write heavy, utilizing a lot of compaction, with more flexibility on read latency. With a drive using 3,000 IOPS, compaction will be limited to about 12–15 MBps without readahead, and 50–60 MBps with it set to 16 KB.

Tune compression

One of the compression settings is chunk_length_in_kb. Although it’s not obvious, this is one of the most impactful changes that can be made to improve read performance. The chunk length is the size of the buffer that’s used when using compression. Larger buffers can compress slightly better, but they can be expensive to read off disk. Bigger buffers mean reading more data off disk for a single read, even if you only need a few bytes. This can cause you to reach your provisioned throughput limit even though you’re doing very few requests.

It’s a good idea to use a combination of nodetool tabblehistograms to look at your partition sizes and nodetool tablestats for your compression rate to determine if it’s a good idea to decrease.

The output of nodetool tablehistograms tells you the distribution of how big your partition sizes are. This gives you an idea of what the maximum size of a read could be if you were to read from a single partition:

root@cassandra0:~$ nt tablehistograms easy_cass_stress keyvalue
easy_cass_stress/keyvalue histograms
Percentile      Read Latency     Write Latency          SSTables    Partition Size        Cell Count
                    (micros)          (micros)                             (bytes)
50%                    14.00             10.00              0.00               215                 1
75%                    17.00             12.00              0.00               215                 1
95%                    35.00             17.00              0.00               258                 1
98%                    42.00             24.00              0.00               258                 1
99%                    50.00             29.00              0.00               258                 1
Min                     3.00              2.00              0.00               125                 0
Max                   310.00            215.00              0.00               258                 1

In the preceding example, the largest partition is only 258 bytes. A best practice is to use a smaller chunk length to reduce read amplification and CPU overhead on tables with small partitions, or when using lightweight transactions or counters, because they perform a read before write. You have to use a power of two for the chunk length, and the smallest possible value is 4 KB, so we use that here:

cqlsh> alter table easy_cass_stress.keyvalue with compression = {'sstable_compression': 'LZ4Compressor', 'chunk_length_kb': 4};

If you’re running Cassandra 3.X, modifying this from 64 KB to 16 KB will improve performance. In Cassandra 4.x and beyond, the default is 16 KB. Tables created prior to 4.0 should be altered to use the new chunk length value.

You’ll need to recompress all your SSTables to see the benefit. If you’re using 4.0 or below, you can use the following code:

nodetool upgradesstables -a

If you’re using 4.1 and above, you can use recompress_sstables, which is more efficient:

nodetool recompress_sstables

See CASSANDRA-13241 for additional information.

Compaction throughput

It can be tempting to increase compaction throughput to a very high setting or unlimited, but this can cause spikes in disk usage that harm latency. Set it low enough so that it barely keeps up, amortizing the cost of the compaction over a long period of time. A good starting point is to set compaction throughput to 32 MBps in cassandra.yaml.

In versions before 4.1, you can use the following setting:

compaction_throughput_mb_per_sec: 32

In versions 4.1 and up, you can use the new setting:

compaction_throughput: 32MiB/s

Using a higher setting for compaction throughput will use more IOPS, so be sure to provision your drives with enough IOPS and throughput to be able to handle compaction, plus enough headroom for reads.

Disk access mode

The disk_access_mode setting in Apache Cassandra determines how Cassandra handles I/O operations for reading and writing SSTable files to the disk. When set to mmap, this mode uses memory-mapped files for SSTables, allowing Cassandra to map files into the application’s address space. However, this can cause Cassandra to excessively page if SSTables do not fit into memory, resulting in increased utilization or Linux out-of-memory errors. To prevent this, set disk_access_mode to mmap_index_only. This setting maps only the SSTable index files with memory mapping, while accessing data files with traditional I/O methods.

disk_access_mode: mmap_index_only

Benchmarking

It can be intimidating to set up a lab environment with all the recommended tools across different versions of Cassandra and configuration options. Fortunately, easy-cass-lab makes it straightforward to set up Cassandra test environments in AWS, and the stress tool easy-cass-stress enables user-friendly benchmarking. easy-cass-lab also bundles the disk benchmarking tool fio, which is used by the Linux kernel to evaluate disk performance.

Conclusion

By tuning compression, picking the right compaction strategy, and reducing—but not eliminating—readahead, you can minimize read amplification and keep your Cassandra cluster performance running consistently on Amazon EBS while reducing required IOPS.

Running Cassandra on EBS volumes like gp3 and io2 can provide great price-performance benefits compared to ephemeral storage, while also simplifying operational tasks like backups, upgrades, and node replacements. However, it requires careful tuning of configurations like readahead, compression chunk length, and compaction throughput to achieve optimal performance.

In this post, we covered how Cassandra interacts with the OS’s file system and page cache when reading from disk. We looked at various Linux tools like iostat, xfsdist, xfsslower, cachestat, and biolatency, which give insights into different layers of disk I/O performance.

We then applied those insights to recommend several best practices for Cassandra on Amazon EBS: increasing readahead block size, reducing compression chunk length for small partitions, and limiting compaction throughput.

Finally, we introduced the easy-cass-lab and easy-cass-stress tools to set up test environments on AWS for benchmarking Cassandra performance.

With the right tooling, monitoring, and configuration tuning, you can achieve great price-performance for running Cassandra deployments on EBS volumes while benefiting from operational advantages over ephemeral instance storage.

In addition to running Apache Cassandra on Amazon EC2 with EBS, AWS also offers Amazon Keyspaces (for Apache Cassandra) , a fully managed database compatible with Cassandra but with some functional differences.

We’re looking to bring some of these optimizations into Cassandra itself, eliminating the need to tune readahead. If this sounds useful to you, follow the improvements happening in CASSANDRA-15452.

About the Author

Jon Haddad is an Apache Cassandra committer specializing in performance tuning, fixing broken clusters, and cost optimization. He’s an independent consultant and trainer who works with companies of all sizes to help them be successful. He previously worked on some of the biggest Cassandra deployments in the world.

AWS Database Blog