SpeedyIO Blog

Do you care about P99 latency ?

Shaleen Garg — Tue, 03 Feb 2026 08:36:38 GMT

TL;DR: We have found that Cassandra's P99 spikes stem from OS file cache thrashing during compactions; prefetching pollutes cache, evictions block memory allocations. We show a 23-48% reduction in P99 latency without changing cassandra installation; by selectively prefetching and proactively evicting file cache pages.

For the uninitiated, P99 (99th percentile) latency marks the cutoff beyond which the slowest 1% of requests fall. While it sounds like a small edge case, lowering it is critical because in distributed services, a single user request might trigger dozens of internal DB calls. If any one of those calls hit a P99 latency spike, the entire user request becomes slow. As the system scales, the probability that a user hits a slow path becomes inevitable.

Designing a system which guarantees a low P99 latency is extremely difficult; moreover, improving the tail of an existing system nears impossibility. This is because tail latency arises out of rare events (eg. outliers like garbage collection, network retries and compactions) and it is non-trivial to remove their effects without fundamentally changing them. P99 latency is one of those things that, once you truly understand, makes you go “Oh Sh*t!“.

So how do we reduce P99 ?

Since there are a lot of sources of tail latency; there isn’t a silver bullet to reduce it for a given system. Here we focus on one specific contributor: the interaction between storage and memory. This source of tail latency is difficult to address from within Cassandra alone as it emerges from the behavior of the OS rather than just the database. Our approach to reducing tail latency is by improving the cooperation between userspace and the OS. Let me explain:

Background:

The OS maintains an in-memory file cache (aka page cache) of all recently accessed file pages for temporal locality. Typically, the size of file cache is bounded by anonymous allocations (e.g. using malloc) and the total available memory. Since file cache pages are an in-memory copy of the data persisted in storage, they are readily evicted when the system is low on memory. The OS only ever evicts pages when an allocation hits low zone watermark in memory; at which point, this allocation request waits till the OS has evicted pages upto the high zone watermark using the clock replacement algorithm. The OS also predicts future file accesses and prefetches adjacent file data for spatial locality. This prediction algorithm is designed to be light weight to reduce latency in the read path. More on this in a bit.

Cassandra is built on Log Structured Merge trees (LSM). Incoming writes are first recorded in memory and flushed to disk as immutable SSTables. Over time, Cassandra performs periodic compactions to merge these SSTables; dozens of small, sorted files are read concurrently, their contents are scanned, merged, and re-written into fewer, larger files in lexicographical order, after which the old files are deleted. This process is inherently I/O-intensive. Compactions repeatedly stream large volumes of data from disk, and generate sustained read-write traffic. While necessary for maintaining read performance and space efficiency, these sequential scans interact poorly with the OS page cache, competing with latency-critical reads and increasing memory pressure.

Cassandra being a mature software, does some nice system tricks to minimize these unnecessary cache evictions during compactions. It reads a fixed chunk of file (~10 MB) and then calls fadvise with FADV_DONTNEED on that file range to evict those pages from memory before reading the next chunk of file. This limits the amount of cache pollution for each moribund file to around 10 MB.

So what’s the problem ?

There are two significant user-space flows here:

Get operations on Cassandra that translate to reads on the filesystem.
Compactions that translate to sequential reads and writes on the filesystem.

The OS infers sequentiality by scanning some adjacent pages from the requested page in the page cache. If any of those pages are present in memory, it classifies the access pattern as sequential and performs a readahead upto read_ahead_kb ahead of the requested page. This is a two fold problem:

The presence of an adjacent page in memory doesn’t imply sequential access; they may have been read far apart in time.

In the above animation, the file page access sequence is 1, 4, 2; which is not a sequential access. But linux checks sequentiality by checking the existance of an adjacent page and hence it deems page no. 2 to be a sequential access and prefetches page no. 3.
The OS prefetches file data for accesses from Get operations aswell. Unless the user is doing range queries over the keys on Cassandra or compacting files, prefetched file pages only increase eviction overhead and can be wasteful of storage IOPS and bandwidth.

A quick conclusion that one could jump to, at this point, is to turn off file prefetching and force fetch files that are being compacted. But this kind of blanket policy will only increase latency aberrations in the system.

Here is how SpeedyIO tames P99:

It identifies which files are okay to prefetch based on the kind of file (*Data.db, *log.db, etc), the size of the file and their access pattern.
It proactively evicts cold file pages in cache: doesn’t let the memory banks deplete to low zone watermark where memory allocations have to wait for evictions.

Overall, this enables the OS to retain the hottest file data in memory, prefetch near-term data, and keep allocations off the slow path. If code speaks to you more than words check it out here.

SpeedyIO sits as a runtime library (LD_PRELOAD) in the cassandra launcher. This is a single line change in the launcher (path/to/cassandra_installation/bin/cassandra).

...

if [ "x$allow_root" != "xyes" ] ; then
    if [ "id -u" = "0" ] || [ "id -g" = "0" ] ; then
        echo "Running Cassandra as root user or group is not recommended - please start Cassandra using a different system user."
        echo "If you really want to force running Cassandra as root, use -R command line option."
        exit 1
    fi
fi

LD_PRELOAD="/path/to/lib_speedyio_release.so" ##This is all you need to add

# Start up the service
launch_service "$pidfile" "$foreground" "$properties" "$classname"

...

Experiments

Here we have a cluster with 32 nodes running cassandra 5 on Centos 8 (linux 4.18). YCSB is used to generate a uniformly distributed 50-50 read-update load on the system (called workload A uniform in ycsb terminology). All experimental configurations can be found here.

The figure below shows P99 read latency with increasing load on the system. It shows a ~48% reduction in P99 at high load (80 - 100 kops). Note that vanilla Cassandra was tuned to the best of our knowledge using publicly available recommendations and our own empirical analysis. The performance gains shown here are in addition to those optimizations. Also note that write latency data is skipped for brevity since it is not affected.

The following figure shows the same experiment conducted on Linux 6.18, exhibiting an approximately 23% reduction in P99 latency at high load levels (80–100 kops).

The next plot shows the P99 read latency at a sustained throughput of 80k operations over a 24-hour period. Vanilla Cassandra(orange) exhibits significant variability in P99 latency, whereas Cassandra with SpeedyIO (blue) maintains a more stable latency profile.

So what’s the catch?

SpeedyIO works best when a few underlying assumptions about the system and workload hold true. These assumptions are common in many real-world Cassandra deployments, but they are worth making explicit.

Storage latency vs. memory latency
SpeedyIO delivers the most benefit when storage access is significantly slower than memory access. In these environments, incorrect page-cache decisions are expensive and directly amplify tail latency. By improving cache behavior under these conditions, SpeedyIO can meaningfully reduce P99 latency. When storage latency is already very low, the relative impact of cache optimization naturally diminishes.
Dataset size relative to memory
SpeedyIO is most effective when the active dataset does not fully fit in memory and the page cache is under pressure. In this regime, deciding which pages to keep or evict has a large impact on tail latency. When available memory comfortably exceeds the dataset’s working set, cache misses are rare and the opportunity for improvement is limited.

Safety guarantees

SpeedyIO is conservative in what it is allowed to do:

Data correctness is never modified
- No write reordering
- No partial reads
- No interference with fsync or durability guarantees
I/O paths are never blocked
- SpeedyIO does not introduce unbounded waits or stall read/write syscalls
Memory is never pinned
- SpeedyIO does not prevent the kernel from reclaiming memory when under pressure
Runtime disable is supported
- SpeedyIO can be turned off without restarting the Cassandra node
Incremental deployment is safe
- Nodes can be enabled or disabled independently
- There is no requirement for all nodes in a cluster to run SpeedyIO
Operator visibility
- SpeedyIO emits descriptive logs intended to help SREs understand and debug behavior
Scope is limited to Cassandra
- SpeedyIO only affects the Cassandra JVM process
- Other applications running on the same system are not impacted

What’s next

SpeedyIO is designed to reduce tail latency by improving how the OS page cache interacts with Cassandra’s I/O patterns. Ongoing and future work focuses on broadening its applicability and validating its behavior across a wider range of deployment scenarios.

A near-term focus is extending testing and tuning across additional Cassandra compaction strategies, including Time Window Compaction Strategy (TWCS), Date Tiered Compaction Strategy (DTCS), and Unified Compaction Strategy (UCS). Each of these strategies exhibits distinct I/O access patterns, and adapting SpeedyIO’s policies to them is necessary to ensure consistent tail-latency improvements.

Another area of development is making SpeedyIO more kernel-aware. The current policies are primarily designed and tuned for Linux 4.x. While they function correctly on Linux 6.x, they do not fully exploit newer kernel memory-management behavior. Introducing kernel-version–specific policies would allow SpeedyIO to better align with evolving eviction, reclaim, and prefetch heuristics, and deliver more consistent results across kernel versions.

Improving support for shared environments is also an active area of work. At present, SpeedyIO assumes Cassandra is the dominant storage I/O workload on a node. Lightweight background tasks and analytics are supported, but colocating multiple I/O-intensive workloads can dilute or negate the benefits. Future iterations aim to better isolate Cassandra’s critical I/O paths so that tail-latency gains are preserved even in more heavily shared deployments.

Finally, SpeedyIO is not yet optimized for systems where memory is spread across multiple NUMA nodes. Adding NUMA-aware policies is a necessary step towards robust support on modern multi-socket and high-core-count machines.

Check out the code here! ! If you need help with your cassandra clusters, contact us.

Constructing B+ Trees and LSMs - Building a database from first principles

Shaleen Garg — Wed, 05 Nov 2025 16:35:48 GMT

In this post, I want to take you on a journey of designing a database from first principles.

Read this as if you were the engineer tasked with building a data system for the hardware of each era. At every stage, think about what bottleneck you’d hit next and how you might fix it.

We’ll start in the 1970s, where the first data structures emerged to serve single-threaded systems and spinning disks. Then we’ll move into the 1990s, when workloads became write-heavy, storage became layered, and multi-core CPUs became commonplace, forcing a rethink of how databases ingest and serve data.

Along the way, every fix for one bottleneck will reveal the next, showing how these trade-offs directly map to hardware behaviour: cachelines, disk pages and IO patterns.

By the end, you’ll not only understand why B+trees and LSMs look the way they do, but what principles guide their evolution today.

First Principles (circa 1970)

If you were asked to build data software in 1970, you’d define a simple contract: store and retrieve values by key, with four core operations:

Insert new key-value pairs
Update existing keys
Lookup a key
Delete a key

A straightforward first choice is an array of KV tuples.

Insertions are cheap, just an append.

Lookups/updates/deletes involve linear scans O(N) where N is the number of keys.

It is perfectly functional albeit painfully slow as N grows.

Lookups could be made quicker by sorting the array by key; enabling binary search which is O(log N).

But inserts/deletes now require shifting elements; O(N) write cost.

To speedup insertions, you might switch to a linked list; that way insertions and deletions won’t involve shifting other elements. But this sacrifices random access; no binary search, so lookups fall back to O(N) scans.

This is still extremely inefficient.

Adding Guide Posts

What if there is a binary search like scaffolding above the linked list such that it can be easily navigated ?

Internal nodes: Keys + pointers to children nodes guiding the search.
Leaf nodes: actual KV pairs

Traversals for insert/lookup/deletions are now O(log N).

Asymptotics aren’t the Whole Story: Hardware Realities

While the above system looks reasonably optimal, theoretically, it doesn't guarantee good wall clock performance.

On real machines:

Each node visit risks a cache line miss; if not in main memory(RAM), a block read from storage
Deep trees amplify these misses.

So, you decide to widen the nodes to hold many keys and child pointers (fan out m > 2). This reduces the number of access to slower memory by reducing the depth of the tree. m is chosen such that each node in the tree fits in a cache line or disk page to maximize locality.

Now all tree traversals are O(logₘ N). Performance is good, customers are flocking; stocks rising; investors are happy, for now.

Context Changes

Fast forward to the late 1990s. Data is rapidly becoming the core of every product and service. You are now a senior architect at a large corp.

Hardware and workloads have evolved: write rates now dwarf reads, storage tiers are multiplying and CPUs have started to become multi-core ie. multi-threading is mainstream. Databases are now expected to sustain very high write throughputs while keeping lookups fast and scaling cleanly with concurrency. The old B+ tree starts to show its limitations.

Struggle with B+ trees

Limited write concurrency: Writes(updates and inserts) incur restructuring and in-place updates in the tree.
- Restructuring the tree limits concurrency. Even fine-grained locks can’t prevent contention in the upper levels of the tree. Difficult to scale for DB writes.
- Each write becomes a cascade of random read-modify-writes across multiple nodes in the tree. This includes the leaf nodes for the actual data manipulation and also the inner nodes while restructuring. So the quanta of work per write operation is extremely high.
Poor cache locality: Nodes can spread all over the disk; accessing different parts of the tree would cause frequent cache line and page misses. Lot of random IO, ie. low throughput.
Limitations wrt modern hardware: B+ trees assume a single storage hierarchy (RAM -> HDD). But multiple storage levels have started to emerge, SSDs, Network storage etc. B+ tree architecture is not aware of it.

These are all symptoms of one core design flaw - in-place updates. To modify a key, one must:

Navigate the inner nodes (random IO)
Rewrite the leaf page (read-modify-write)
restructuring tree (more random IO)

What if in-place updates are done away with altogether ?

Append, Don’t Overwrite

The fastest operation any disk can perform is a sequential append - writing to the tail of a log. Treat every mutation, insert, update, delete as a log record appending to the end of a file.

Lookups scan the log from newest to oldest, returning the first match. Multi-threaded writers serialize at the append point, but the critical section is tiny, and therefore quick.

At the outset this solves a lot of problems described above. Writes are simple and fast, no random read-modify-writes. Lookups are sequential reads which are disk friendly but a lot of I/O for a single key lookup - read amplification.

Reducing Read Amplification

Digging through several research papers, you find the idea of probabilistic hash based query filtering promising. A function that takes in a key and returns the probabilistic existence of it in a bucket. Here bounded false positives are acceptable; false negatives are not.

Break the single large log into many smaller log files, each with a filter corresponding to it. During a lookup, only read candidate log files where the key is “possibly present”. The filter’s bounds of false positives can be tuned by changing its parameters and hash function used. This removes many wasted reads but the number of candidate log files can still be large and the file has to be scanned for lookup.

Then you wonder: what if you sort all the keys in the log in lexicographical order? That could work, but since a key might appear multiple times within the same log, you’d lose the notion of recency - lookups might return outdated values. Food for thought: unique and sorted on disk logs would make life easy.

Consider keeping an in-memory sorted structure for the active write set, just like a B+ tree. Duplicate writes to a key will be updated in-place in the memtable. On reaching a size threshold, flush it as an immutable sorted string table (SST file) to disk. Name SST files monotonically to encode recency. Lookups can now use binary search on candidate SST files.

With this setup, asymptotically, writes are amortized O(log k) where k is #keys in the memtable and reads within an SST file are O(log m) where m is #keys per SST file.

Reducing Space Amplification

Over time, as more memtables are flushed, the DB ends up with many SST files, each containing overlapping key ranges and multiple versions of the same key. Here the space amplification grows unchecked, contributing to storage costs.

Most workloads exhibit temporal locality (recent keys are manipulated/accessed more often) and spatial locality (nearby keys tend to be queried together). Since newer SST files primarily hold updated versions of hot keys, older ones can be merged periodically to keep only the most recent data and free up space.

Periodically, the system selects a group of SST files whose key ranges overlap and merges them into a single larger file. During the merge:

Only the most recent version of each key is preserved.
Tombstones (deletes) are applied, dropping obsolete entries.
Once the new file is persisted, old files are deleted.

“Compactions” help reclaim storage, reduces the number of candidate SST files to search during lookups and improves locality on disk by periodically keeping on-disk data roughly sorted by key.

Compaction Tuning and Hardware Awareness

As data volume grows, compactions evolve from lightweight maintenance tasks into heavy background I/O operations that can compete with live reads and writes. Preserving predictable latency and throughput is paramount.

Most workloads naturally exhibit temporal locality: recent keys are updated and queried far more frequently than older ones. Consequently, newly created SST files tend to share key ranges with other recent ones.

Over time, this process produces a natural hierarchy of “levels”, small and hot at the top, large and cold at the bottom. Lower levels require less frequent compaction, keeping background I/O predictable and bounded.

This leveled organization keeps compactions local, caps space growth, and stabilizes performance. It mirrors the same principle that underlies modern hardware hierarchies: hot data close and fast, cold data dense and cheap. Even on a single storage device, leveling delivers these benefits - controlled space usage, bounded read latency, and sustained write throughput.

Conclusion

The exercise shows that database design is not about picking a fancy algorithm, but about evolving a structure that works with real hardware and software constraints. These hardware constraints may not always show themselves as functional problems but as performance penalties, for which the user pays continuously.

Here at SpeedyIO, we are bridging the gaps between hardware and software design, one pain-point at a time. We make sure that each component of the system cooperates with each other rather than fighting against itself.

“We’re building the future of a harmonious database system. If you are obsessed with low latency systems and real world performance as we are - let’s talk.“