Do you care about P99 latency ?

TL;DR: We have found that Cassandra's P99 spikes stem from OS file cache thrashing during compactions; prefetching pollutes cache, evictions block memory allocations. We show a 23-48% reduction in P99 latency without changing cassandra installation; by selectively prefetching and proactively evicting file cache pages.

For the uninitiated, P99 (99th percentile) latency marks the cutoff beyond which the slowest 1% of requests fall. While it sounds like a small edge case, lowering it is critical because in distributed services, a single user request might trigger dozens of internal DB calls. If any one of those calls hit a P99 latency spike, the entire user request becomes slow. As the system scales, the probability that a user hits a slow path becomes inevitable.

Designing a system which guarantees a low P99 latency is extremely difficult; moreover, improving the tail of an existing system nears impossibility. This is because tail latency arises out of rare events (eg. outliers like garbage collection, network retries and compactions) and it is non-trivial to remove their effects without fundamentally changing them. P99 latency is one of those things that, once you truly understand, makes you go “Oh Sh*t!“.

So how do we reduce P99 ?

Since there are a lot of sources of tail latency; there isn’t a silver bullet to reduce it for a given system. Here we focus on one specific contributor: the interaction between storage and memory. This source of tail latency is difficult to address from within Cassandra alone as it emerges from the behavior of the OS rather than just the database. Our approach to reducing tail latency is by improving the cooperation between userspace and the OS. Let me explain:

Background:

The OS maintains an in-memory file cache (aka page cache) of all recently accessed file pages for temporal locality. Typically, the size of file cache is bounded by anonymous allocations (e.g. using malloc) and the total available memory. Since file cache pages are an in-memory copy of the data persisted in storage, they are readily evicted when the system is low on memory. The OS only ever evicts pages when an allocation hits low zone watermark in memory; at which point, this allocation request waits till the OS has evicted pages upto the high zone watermark using the clock replacement algorithm. The OS also predicts future file accesses and prefetches adjacent file data for spatial locality. This prediction algorithm is designed to be light weight to reduce latency in the read path. More on this in a bit.

Cassandra is built on Log Structured Merge trees (LSM). Incoming writes are first recorded in memory and flushed to disk as immutable SSTables. Over time, Cassandra performs periodic compactions to merge these SSTables; dozens of small, sorted files are read concurrently, their contents are scanned, merged, and re-written into fewer, larger files in lexicographical order, after which the old files are deleted. This process is inherently I/O-intensive. Compactions repeatedly stream large volumes of data from disk, and generate sustained read-write traffic. While necessary for maintaining read performance and space efficiency, these sequential scans interact poorly with the OS page cache, competing with latency-critical reads and increasing memory pressure.

Cassandra being a mature software, does some nice system tricks to minimize these unnecessary cache evictions during compactions. It reads a fixed chunk of file (~10 MB) and then calls fadvise with FADV_DONTNEED on that file range to evict those pages from memory before reading the next chunk of file. This limits the amount of cache pollution for each moribund file to around 10 MB.

So what’s the problem ?

There are two significant user-space flows here:

Get operations on Cassandra that translate to reads on the filesystem.
Compactions that translate to sequential reads and writes on the filesystem.

The OS infers sequentiality by scanning some adjacent pages from the requested page in the page cache. If any of those pages are present in memory, it classifies the access pattern as sequential and performs a readahead upto read_ahead_kb ahead of the requested page. This is a two fold problem:

The presence of an adjacent page in memory doesn’t imply sequential access; they may have been read far apart in time.

In the above animation, the file page access sequence is 1, 4, 2; which is not a sequential access. But linux checks sequentiality by checking the existance of an adjacent page and hence it deems page no. 2 to be a sequential access and prefetches page no. 3.
The OS prefetches file data for accesses from Get operations aswell. Unless the user is doing range queries over the keys on Cassandra or compacting files, prefetched file pages only increase eviction overhead and can be wasteful of storage IOPS and bandwidth.

A quick conclusion that one could jump to, at this point, is to turn off file prefetching and force fetch files that are being compacted. But this kind of blanket policy will only increase latency aberrations in the system.

Here is how SpeedyIO tames P99:

It identifies which files are okay to prefetch based on the kind of file (*Data.db, *log.db, etc), the size of the file and their access pattern.
It proactively evicts cold file pages in cache: doesn’t let the memory banks deplete to low zone watermark where memory allocations have to wait for evictions.

Overall, this enables the OS to retain the hottest file data in memory, prefetch near-term data, and keep allocations off the slow path. If code speaks to you more than words check it out here.

SpeedyIO sits as a runtime library (LD_PRELOAD) in the cassandra launcher. This is a single line change in the launcher (path/to/cassandra_installation/bin/cassandra).

...

if [ "x$allow_root" != "xyes" ] ; then
    if [ "id -u" = "0" ] || [ "id -g" = "0" ] ; then
        echo "Running Cassandra as root user or group is not recommended - please start Cassandra using a different system user."
        echo "If you really want to force running Cassandra as root, use -R command line option."
        exit 1
    fi
fi

LD_PRELOAD="/path/to/lib_speedyio_release.so" ##This is all you need to add

# Start up the service
launch_service "$pidfile" "$foreground" "$properties" "$classname"

...

Experiments

Here we have a cluster with 32 nodes running cassandra 5 on Centos 8 (linux 4.18). YCSB is used to generate a uniformly distributed 50-50 read-update load on the system (called workload A uniform in ycsb terminology). All experimental configurations can be found here.

The figure below shows P99 read latency with increasing load on the system. It shows a ~48% reduction in P99 at high load (80 - 100 kops). Note that vanilla Cassandra was tuned to the best of our knowledge using publicly available recommendations and our own empirical analysis. The performance gains shown here are in addition to those optimizations. Also note that write latency data is skipped for brevity since it is not affected.

The following figure shows the same experiment conducted on Linux 6.18, exhibiting an approximately 23% reduction in P99 latency at high load levels (80–100 kops).

The next plot shows the P99 read latency at a sustained throughput of 80k operations over a 24-hour period. Vanilla Cassandra(orange) exhibits significant variability in P99 latency, whereas Cassandra with SpeedyIO (blue) maintains a more stable latency profile.

So what’s the catch?

SpeedyIO works best when a few underlying assumptions about the system and workload hold true. These assumptions are common in many real-world Cassandra deployments, but they are worth making explicit.

Storage latency vs. memory latency
SpeedyIO delivers the most benefit when storage access is significantly slower than memory access. In these environments, incorrect page-cache decisions are expensive and directly amplify tail latency. By improving cache behavior under these conditions, SpeedyIO can meaningfully reduce P99 latency. When storage latency is already very low, the relative impact of cache optimization naturally diminishes.
Dataset size relative to memory
SpeedyIO is most effective when the active dataset does not fully fit in memory and the page cache is under pressure. In this regime, deciding which pages to keep or evict has a large impact on tail latency. When available memory comfortably exceeds the dataset’s working set, cache misses are rare and the opportunity for improvement is limited.

Safety guarantees

SpeedyIO is conservative in what it is allowed to do:

Data correctness is never modified
- No write reordering
- No partial reads
- No interference with fsync or durability guarantees
I/O paths are never blocked
- SpeedyIO does not introduce unbounded waits or stall read/write syscalls
Memory is never pinned
- SpeedyIO does not prevent the kernel from reclaiming memory when under pressure
Runtime disable is supported
- SpeedyIO can be turned off without restarting the Cassandra node
Incremental deployment is safe
- Nodes can be enabled or disabled independently
- There is no requirement for all nodes in a cluster to run SpeedyIO
Operator visibility
- SpeedyIO emits descriptive logs intended to help SREs understand and debug behavior
Scope is limited to Cassandra
- SpeedyIO only affects the Cassandra JVM process
- Other applications running on the same system are not impacted

What’s next

SpeedyIO is designed to reduce tail latency by improving how the OS page cache interacts with Cassandra’s I/O patterns. Ongoing and future work focuses on broadening its applicability and validating its behavior across a wider range of deployment scenarios.

A near-term focus is extending testing and tuning across additional Cassandra compaction strategies, including Time Window Compaction Strategy (TWCS), Date Tiered Compaction Strategy (DTCS), and Unified Compaction Strategy (UCS). Each of these strategies exhibits distinct I/O access patterns, and adapting SpeedyIO’s policies to them is necessary to ensure consistent tail-latency improvements.

Another area of development is making SpeedyIO more kernel-aware. The current policies are primarily designed and tuned for Linux 4.x. While they function correctly on Linux 6.x, they do not fully exploit newer kernel memory-management behavior. Introducing kernel-version–specific policies would allow SpeedyIO to better align with evolving eviction, reclaim, and prefetch heuristics, and deliver more consistent results across kernel versions.

Improving support for shared environments is also an active area of work. At present, SpeedyIO assumes Cassandra is the dominant storage I/O workload on a node. Lightweight background tasks and analytics are supported, but colocating multiple I/O-intensive workloads can dilute or negate the benefits. Future iterations aim to better isolate Cassandra’s critical I/O paths so that tail-latency gains are preserved even in more heavily shared deployments.

Finally, SpeedyIO is not yet optimized for systems where memory is spread across multiple NUMA nodes. Adding NUMA-aware policies is a necessary step towards robust support on modern multi-socket and high-core-count machines.

Check out the code here! ! If you need help with your cassandra clusters, contact us.

Do you care about P99 latency ?

So how do we reduce P99 ?

Background:

So what’s the problem ?

Here is how SpeedyIO tames P99:

Experiments

So what’s the catch?

Safety guarantees

What’s next

Comments

More from this blog

Constructing B+ Trees and LSMs - Building a database from first principles

Command Palette

So how do we reduce P99 ?

Background:

So what’s the problem ?

Here is how SpeedyIO tames P99:

Experiments

So what’s the catch?

Safety guarantees

What’s next

Comments

More from this blog