AWS Database Blog
Best practices for running Apache Cassandra with Amazon EBS
This is a guest post written by Jon Haddad an Apache Cassandra committer specializing in performance tuning, fixing broken clusters, and cost optimization. He’s an independent consultant and trainer who works with companies of all sizes to help them be successful. He previously worked on some of the biggest Cassandra deployments in the world.
AWS has been a popular place to run Apache Cassandra, and Cassandra operators often ask what storage they should use to get the best price-performance. For the vast majority, the Cassandra community has opted to use ephemeral SSDs on older instance types or AWS Nitro SSD (NVMe) on newer ones. These drives work well to provide low latency on small reads, and also provide high throughput on bulk operations such as compaction.
However, using instance store drives with Cassandra has operational challenges. Operations such as exporting and importing backups requires copying data files to an external location like Amazon Simple Storage Service (Amazon S3) with tools such as Medusa. Replacing a Cassandra node can be done using backups, but the time it takes is proportional to the amount of data you need to restore. Cassandra nodes will generally store up to 1–2 TB of data, but this isn’t a hard limit, and a well-tuned cluster can store significantly more data per node.
Developers want to use Amazon Elastic Block Store (Amazon EBS) instead of instance storage with Cassandra because it helps simplify operations. Amazon EBS offers several advantages over an instance store that make it a solid choice:
- Replacing a failed node is almost instant
- It’s straightforward to quickly change instance types by detaching EBS volumes from one instance and reattaching them to a new one
- EBS snapshots are straightforward to take
- Many Cassandra nodes don’t require tens of thousands of IOPS to perform well and can work very well on Amazon EBS even when provisioned with only a few thousand IOPS
In this post, we discuss the basics of improving the performance of Amazon EBS with Cassandra to take advantage of the operational benefits. We explore some basic tools used by Cassandra operators to gain insight into key performance metrics. You can then apply these metrics to modify key operating system (OS) tuneables and Cassandra configuration. Finally, we review benchmarks on performance gains by implementing best practices for Amazon EBS.
Amazon EBS options
Amazon EBS gp3 volumes are the latest generation of General Purpose SSD volumes. They offer more predictable performance scaling and prices that are up to 20% lower than gp2 volumes. Additionally, gp2 volumes required increasing volume size to achieve higher IOPS. For example, if you needed 10,000 IOPS per volume, you were required to have a 3.334 TB volume size. With gp3, volume size is independent of provisioned IOPS. If you’re currently using gp2 volumes, you can upgrade your volumes to gp3 using Amazon EBS Elastic Volume operations. EBS gp3 volumes have a baseline of 3,000 IOPS, with additional IOPS up to a maximum of 16,000. If you need higher IOPS, you can use io2 Block Express up to 256,000 IOPS.
The following is a high-level comparison of gp3 and io2 Block express. EBS gp3 offers the following benefits:
- Single-digit millisecond latency
- Storage capacity up to 16 TiB per volume
- Provisioned IOPS up to 16,000 per volume
- Throughput up to 1,000 MiBps per volume
With io2 Block Express, you get the highest level of performance both in terms of throughput and low latency:
- Sub-millisecond average latency
- Storage capacity up to 64 TiB per volume
- Provisioned IOPS up to 256,000 per volume
- Throughput up to 4,000 MBps per volume
Several basic tunings are available to have your Cassandra cluster run optimally on Amazon EBS while avoiding costly overprovisioning. Skipping these steps can increase storage cost, increase latency, and limit throughput, so it’s important to take the time to implement these optimizations.
How Cassandra reads data from persistent storage
Before we explore possible optimizations, it’s critical to first understand how I/O is implemented in Cassandra. When Cassandra wants to read data off persistent storage, it doesn’t access the disk directly. Instead, Cassandra queries the file system, which is managed by the OS. Most of the time, it can be convenient to ignore that complexity, but when you need to tune performance, you need to understand it intimately. For instance, before reading directly from disk, the file system will check the OS page cache. Optimizing the page cache can improve latency and lower I/O usage. There are two types of reads you can see on a node:
- Random reads – These come from CQL queries and can access any location in any data file. We expect these queries to respond as quickly as possible, so we want to avoid reading any extra, unnecessary data, because that increases response times.
- Sequential reads – These are performed during Cassandra compaction and aren’t sensitive to latency. Compaction is a process that merges data files to boost read performance and remove deleted data. Here we want to achieve high throughput, and we’re going to read this data from the beginning of a file to the end of the file, in order. Consuming too many IOPS in compaction can impact IOPS available for CQL queries.
Performance tools
To optimize your system to get the best performance possible, it’s helpful to have the right tooling. In this section, we discuss helpful tools to understand the performance of different layers of Cassandra disk access.
When you’re trying to understand database performance, it’s important to first determine if there are any hardware bottlenecks. It can be difficult, if not impossible, to build an accurate mental model of the root cause without going through this process.
Because we’re tuning Cassandra for Amazon EBS, we focus on tools that can help you understand I/O. Several excellent ones are available in the sysstat and bcc-tools projects.
Let’s explore five common tools for understanding I/O performance when working with Cassandra.
Get an overview using iostat
Linux iostat is a general purpose tool to help you understand your I/O usage at a high level. It has many display options, but for this example, we care most about transactions per second (TPS) and throughput (MBps), which helps identify micro-bursting. Micro-bursting occurs when an EBS volume bursts high IOPS or throughput for significantly shorter periods than the collection interval. Because the volume bursts high IOPS or throughput for a short time, Amazon CloudWatch may not reflect the bursting. Exceeding provisioned IOPS can result in throttling and higher latency.
The tps column can approximately tell you how close you are to reaching the provisioned IOPS of the device, as well its read and write throughput. After each stage of optimization, use iostat to see the impact of your changes on throughput and IOPS. Note that EBS disks are labeled as nvme, even though they’re Amazon EBS:
The following output is an example of running iostat while Cassandra is under load.
root@cassandra0:~$ iostat -dmxs 2 10
Linux 6.5.0-1016-aws (cassandra0) 03/27/24 _x86_64_ (8 CPU)
Device tps MB/s rqm/s await areq-sz aqu-sz %util
loop0 0.01 0.00 0.00 2.41 33.27 0.00 0.01
loop1 0.00 0.00 0.00 8.71 1.71 0.00 0.00
loop2 0.02 0.00 0.00 2.73 8.97 0.00 0.01
loop3 0.00 0.00 0.00 8.46 1.92 0.00 0.00
loop4 0.05 0.00 0.00 2.61 37.72 0.00 0.02
loop5 0.00 0.00 0.00 0.00 1.27 0.00 0.00
nvme0n1 5.17 0.08 2.15 2.88 15.39 0.01 0.97
nvme1n1 17.34 28.04 0.16 1.31 1655.44 0.02 0.39
View a histogram of file system performance using xfsdist
When you’re investigating performance, it’s often useful to understand each component of an I/O request. When performing a read, you make a request to the file system. Fortunately, bcc-tools makes it straightforward to get a breakdown of the distribution of response times at this level.
xfsdist tells you the overall performance of how the file system, page cache, and storage device work together. For example, unthrottled compaction without readahead enabled can often use up all the available IOPS on a disk, which can cause read latencies to spike. When a disk uses all of its provisioned IOPS, the disk’s performance is throttled, leading to latency spikes or throttled compaction. In the following example, you can observe outlier I/O requests:
$ xfsdist 1 10 -m
Tracing XFS operation latency... Hit Ctrl-C to end.
operation = write
msecs : count distribution
0 -> 1 : 31942 |****************************************|
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 79 | |
The preceding output shows a small number of requests (79) that are outside of the desired response time. This indicates that you’re gone over either your IOPS threshold or throughput threshold. You can use the iostat command from the previous section to discover if you’re reaching an IOPS limit or throughput limit. If you’re reaching a throughput limit, you should try to reduce resource usage by throttling compaction, increasing readahead, switching to an instance with more memory, or increasing your provisioned IOPS. In the next section, we look at how to evaluate which is the best option.
The following is an example from a healthy volume. The response time is always between 0–1 milliseconds, which is exactly what you want out of the EBS volume:
$ xfsdist 1 10 -m
Tracing XFS operation latency... Hit Ctrl-C to end.
operation = read
msecs : count distribution
0 -> 1 : 9735 |****************************************|
Get detailed information about slow reads using xfsslower
Now that you understand the high-level performance, it can be helpful to determine what file operations are slow, using xfsslower. There are tools for different file systems, such as ext4slower and zfsslower, that you can use with the appropriate file system.
Each line in the output is one operation, which might have been returned from the page cache or from the underlying block device. You can see the name of the read that initiated the request in the COMM column, the type of operation (R for read, W for write, S for sync), how many bytes were read, and the offset of the file in KB.
The first parameter tells xfsslower to show operations that took longer than N milliseconds. You can pass 0 to see every file system operation. This can generate a lot of data on a busy volume and should be used with care.
The following code provides an example:
$ xfsslower 1
Tracing XFS operations slower than 1 ms
TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME
19:20:16 HintsDispatche 19371 R 4096 44952 2.10 f4067d52-5f01-4ed6-9c7f-7d2503b5
19:20:16 CompactionExec 19371 R 16376 14760 0.03 nb-531-big-Data.db
19:20:16 CompactionExec 19371 R 16329 14631 0.04 nb-872-big-Data.db
19:20:16 CompactionExec 19371 R 16412 14598 0.01 nb-303-big-Data.db
19:20:16 CompactionExec 19371 R 16357 14565 0.01 nb-742-big-Data.db
19:20:16 CompactionExec 19371 R 16388 14743 0.02 nb-352-big-Data.db
19:20:16 CompactionExec 19371 R 16392 115279 2.34 nb-1094-big-Data.db
19:20:16 CompactionExec 19371 W 16387 1454049 0.03 nb-1204-big-Data.db
Learn about your page cache with cachestat
Because there’s a fast path (page cache) and a slow path (disk), it’s helpful to know what percentage of Cassandra’s requests are going to each path. Any memory not used by Cassandra or another process on the machine will be used for page cache automatically. cachestat can tell you how many requests the page cache receives and what the hit rate is:
The following command will print the hit miss rate every second, up to 10 readings.
$ cachestat 1 10
HITS MISSES DIRTIES READ_HIT% WRITE_HIT% BUFFERS_MB CACHED_MB
102989 0 0 100.0% 0.0% 10 1589
101581 0 0 100.0% 0.0% 10 1589
89928 1031 39 98.8% 1.1% 1 562
19243 18420 0 51.1% 48.9% 1 635
20840 18417 0 53.1% 46.9% 1 707
22867 18430 2 55.4% 44.6% 1 778
24927 18315 0 57.6% 42.4% 1 850
27529 18282 0 60.1% 39.9% 1 921
30444 18258 11 62.5% 37.5% 1 993
33881 18073 0 65.2% 34.8% 1 1063
The more RAM on a machine, the more data can be cached in memory. Generally, access patterns tend to follow the 80/20 rule, where 20% of the data is accessed 80% of the time. The more data you can put in page cache, the less IOPS is consumed, reducing utilization and avoiding throttling.
Drill down to the block level with biolatency
Finally, we get to the block device, or in this case the EBS volume. We want our requests to be as fast as possible, and to avoid impacting the device unnecessarily. Here you can see the distribution of response times at the device level, without the benefit of page cache. biolatency can tell you how many requests Cassandra performs as well as the distribution of response times:
$ biolatency-bpfcc 1 10
Tracing block device I/O... Hit Ctrl-C to end.
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 2 |*** |
32 -> 63 : 0 | |
64 -> 127 : 1 |* |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 4 |****** |
1024 -> 2047 : 23 |****************************************|
2048 -> 4095 : 7 |************ |
Observations
Running the preceding commands on a running cluster and analyzing the results, then cross-referencing with the Amazon EBS documentation, can tell you several things:
- A single I/O operation can read or write up to 256 KiB with gp3
- Cassandra performs multiple small, random reads when reading data files
- Compaction is a sequential operation that can be optimized by using readahead, to reduce the number of disk operations for sequential read operations
- Random I/O doesn’t benefit from large reads unless a lot of data is being read
- Returning data from the page cache is significantly faster than disk
- More RAM gives you more space for the page cache
Finding the ideal setting for readahead can help reduce the IOPS impact of compaction, at a slight cost to read latency. Using instances with more memory can help with this by keeping more data in the page cache, reducing IOPs even further.
Best practices for Cassandra
In this section, we discuss several best practices for Cassandra.
Readahead
When you perform a read off persistent storage, the readahead setting causes the OS to pre-fetch the disk’s data and put it in the page cache. The value for readahead determines how much data is pre-fetched. This works well for sequential workloads like compaction, but isn’t as helpful for the small, random reads that happen when a user issues a query.
You can find out your readahead by running the following code and looking at the RA column for each device:
$ root@cassandra0:~# blockdev --report
RO RA SSZ BSZ StartSec Size Device
ro 256 512 1024 0 26456064 /dev/loop0
ro 256 512 1024 0 58363904 /dev/loop1
ro 256 512 1024 0 67010560 /dev/loop2
ro 256 512 1024 0 91230208 /dev/loop3
ro 256 512 1024 0 40996864 /dev/loop4
rw 8 512 512 0 300000000000 /dev/nvme1n1
rw 256 512 4096 0 17179869184 /dev/nvme0n1
rw 256 512 4096 227328 17063460352 /dev/nvme0n1p1
rw 256 512 4096 2048 4194304 /dev/nvme0n1p14
rw 256 512 512 10240 111149056 /dev/nvme0n1p15
Minimizing readahead on ephemeral drives makes sense because it optimizes for read performance. It does require more IOPS for compaction, but with ephemeral NVMe drives, this is a minimal cost. With Amazon EBS, small reads in compaction can consume a higher percentage of your total provisioned IOPS. You can improve this by changing the readahead OS-level setting. Using 16 KB as the minimum readahead value, which is the Amazon EBS native block size, can read more data for a lower number of IOPs. Restart the Cassandra process for this setting to take effect. When changing readahead with the blockdev command, double the desired KB. For example, to achieve 16 KB readahead, use the value of 32 in the blockdev command:
$ blockdev --setra 32 /dev/sda
The optimal setting will depend on the workload. This setting introduces a trade-off between compaction and reads, so it’s best to apply readahead if your workloads are write heavy, utilizing a lot of compaction, with more flexibility on read latency. With a drive using 3,000 IOPS, compaction will be limited to about 12–15 MBps without readahead, and 50–60 MBps with it set to 16 KB.
Tune compression
One of the compression settings is chunk_length_in_kb. Although it’s not obvious, this is one of the most impactful changes that can be made to improve read performance. The chunk length is the size of the buffer that’s used when using compression. Larger buffers can compress slightly better, but they can be expensive to read off disk. Bigger buffers mean reading more data off disk for a single read, even if you only need a few bytes. This can cause you to reach your provisioned throughput limit even though you’re doing very few requests.
It’s a good idea to use a combination of nodetool tabblehistograms to look at your partition sizes and nodetool tablestats for your compression rate to determine if it’s a good idea to decrease.
The output of nodetool tablehistograms tells you the distribution of how big your partition sizes are. This gives you an idea of what the maximum size of a read could be if you were to read from a single partition:
root@cassandra0:~$ nt tablehistograms easy_cass_stress keyvalue
easy_cass_stress/keyvalue histograms
Percentile Read Latency Write Latency SSTables Partition Size Cell Count
(micros) (micros) (bytes)
50% 14.00 10.00 0.00 215 1
75% 17.00 12.00 0.00 215 1
95% 35.00 17.00 0.00 258 1
98% 42.00 24.00 0.00 258 1
99% 50.00 29.00 0.00 258 1
Min 3.00 2.00 0.00 125 0
Max 310.00 215.00 0.00 258 1
In the preceding example, the largest partition is only 258 bytes. A best practice is to use a smaller chunk length to reduce read amplification and CPU overhead on tables with small partitions, or when using lightweight transactions or counters, because they perform a read before write. You have to use a power of two for the chunk length, and the smallest possible value is 4 KB, so we use that here:
cqlsh> alter table easy_cass_stress.keyvalue with compression = {'sstable_compression': 'LZ4Compressor', 'chunk_length_kb': 4};
If you’re running Cassandra 3.X, modifying this from 64 KB to 16 KB will improve performance. In Cassandra 4.x and beyond, the default is 16 KB. Tables created prior to 4.0 should be altered to use the new chunk length value.
You’ll need to recompress all your SSTables to see the benefit. If you’re using 4.0 or below, you can use the following code:
nodetool upgradesstables -a
If you’re using 4.1 and above, you can use recompress_sstables, which is more efficient:
nodetool recompress_sstables
See CASSANDRA-13241 for additional information.
Compaction throughput
It can be tempting to increase compaction throughput to a very high setting or unlimited, but this can cause spikes in disk usage that harm latency. Set it low enough so that it barely keeps up, amortizing the cost of the compaction over a long period of time. A good starting point is to set compaction throughput to 32 MBps in cassandra.yaml.
In versions before 4.1, you can use the following setting:
compaction_throughput_mb_per_sec: 32
In versions 4.1 and up, you can use the new setting:
compaction_throughput: 32MiB/s
Using a higher setting for compaction throughput will use more IOPS, so be sure to provision your drives with enough IOPS and throughput to be able to handle compaction, plus enough headroom for reads.
Disk access mode
The disk_access_mode setting in Apache Cassandra determines how Cassandra handles I/O operations for reading and writing SSTable files to the disk. When set to mmap, this mode uses memory-mapped files for SSTables, allowing Cassandra to map files into the application’s address space. However, this can cause Cassandra to excessively page if SSTables do not fit into memory, resulting in increased utilization or Linux out-of-memory errors. To prevent this, set disk_access_mode to mmap_index_only. This setting maps only the SSTable index files with memory mapping, while accessing data files with traditional I/O methods.
disk_access_mode: mmap_index_only
Benchmarking
It can be intimidating to set up a lab environment with all the recommended tools across different versions of Cassandra and configuration options. Fortunately, easy-cass-lab makes it straightforward to set up Cassandra test environments in AWS, and the stress tool easy-cass-stress enables user-friendly benchmarking. easy-cass-lab also bundles the disk benchmarking tool fio, which is used by the Linux kernel to evaluate disk performance.
Conclusion
By tuning compression, picking the right compaction strategy, and reducing—but not eliminating—readahead, you can minimize read amplification and keep your Cassandra cluster performance running consistently on Amazon EBS while reducing required IOPS.
Running Cassandra on EBS volumes like gp3 and io2 can provide great price-performance benefits compared to ephemeral storage, while also simplifying operational tasks like backups, upgrades, and node replacements. However, it requires careful tuning of configurations like readahead, compression chunk length, and compaction throughput to achieve optimal performance.
In this post, we covered how Cassandra interacts with the OS’s file system and page cache when reading from disk. We looked at various Linux tools like iostat, xfsdist, xfsslower, cachestat, and biolatency, which give insights into different layers of disk I/O performance.
We then applied those insights to recommend several best practices for Cassandra on Amazon EBS: increasing readahead block size, reducing compression chunk length for small partitions, and limiting compaction throughput.
Finally, we introduced the easy-cass-lab and easy-cass-stress tools to set up test environments on AWS for benchmarking Cassandra performance.
With the right tooling, monitoring, and configuration tuning, you can achieve great price-performance for running Cassandra deployments on EBS volumes while benefiting from operational advantages over ephemeral instance storage.
In addition to running Apache Cassandra on Amazon EC2 with EBS, AWS also offers Amazon Keyspaces (for Apache Cassandra) , a fully managed database compatible with Cassandra but with some functional differences.
We’re looking to bring some of these optimizations into Cassandra itself, eliminating the need to tune readahead. If this sounds useful to you, follow the improvements happening in CASSANDRA-15452.
About the Author
Jon Haddad is an Apache Cassandra committer specializing in performance tuning, fixing broken clusters, and cost optimization. He’s an independent consultant and trainer who works with companies of all sizes to help them be successful. He previously worked on some of the biggest Cassandra deployments in the world.