<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[SpeedyIO Blog]]></title><description><![CDATA[SpeedyIO Blog]]></description><link>https://blog.speedyio.com</link><image><url>https://cdn.hashnode.com/res/hashnode/image/upload/v1762108237749/ff59c2bc-652f-4b00-bae0-9cda5f459f5b.png</url><title>SpeedyIO Blog</title><link>https://blog.speedyio.com</link></image><generator>RSS for Node</generator><lastBuildDate>Fri, 08 May 2026 08:49:00 GMT</lastBuildDate><atom:link href="https://blog.speedyio.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Do you care about P99 latency ?]]></title><description><![CDATA[TL;DR: We have found that Cassandra's P99 spikes stem from OS file cache thrashing during compactions; prefetching pollutes cache, evictions block memory allocations. We show a 23-48% reduction in P99 latency without changing cassandra installation; ...]]></description><link>https://blog.speedyio.com/do-you-care-about-p99-latency</link><guid isPermaLink="true">https://blog.speedyio.com/do-you-care-about-p99-latency</guid><dc:creator><![CDATA[Shaleen Garg]]></dc:creator><pubDate>Tue, 03 Feb 2026 08:36:38 GMT</pubDate><content:encoded><![CDATA[<p>TL;DR: We have found that Cassandra's P99 spikes stem from OS file cache thrashing during compactions; prefetching pollutes cache, evictions block memory allocations. We show a 23-48% reduction in P99 latency without changing cassandra installation; by selectively prefetching and proactively evicting file cache pages.</p>
<p>For the uninitiated, P99 (99th percentile) latency marks the cutoff beyond which the slowest 1% of requests fall. While it sounds like a small edge case, lowering it is critical because in distributed services, a single user request might trigger dozens of internal DB calls. If any one of those calls hit a P99 latency spike, the entire user request becomes slow. As the system scales, the probability that a user hits a slow path becomes inevitable.</p>
<p>Designing a system which guarantees a low P99 latency is extremely difficult; moreover, improving the tail of an existing system nears impossibility. This is because tail latency arises out of rare events (eg. outliers like garbage collection, network retries and compactions) and it is non-trivial to remove their effects without fundamentally changing them. P99 latency is one of those things that, once you truly understand, makes you go <a target="_blank" href="https://www.youtube.com/watch?v=lJ8ydIuPFeU">“Oh Sh*t!“</a>.</p>
<h2 id="heading-so-how-do-we-reduce-p99"><strong>So how do we reduce P99 ?</strong></h2>
<p>Since there are a lot of sources of tail latency; there isn’t a silver bullet to reduce it for a given system. Here we focus on one specific contributor: the interaction between storage and memory. This source of tail latency is difficult to address from within Cassandra alone as it emerges from the behavior of the OS rather than just the database. Our approach to reducing tail latency is by improving the <a target="_blank" href="https://www.nature.com/articles/s41598-020-75050-4">cooperation</a> between userspace and the OS. Let me explain:</p>
<h3 id="heading-background">Background:</h3>
<p>The OS maintains an in-memory file cache (aka page cache) of all recently accessed file pages for temporal locality. Typically, the size of file cache is bounded by <a target="_blank" href="https://docs.kernel.org/admin-guide/mm/concepts.html#anonymous-memory">anonymous allocations</a> (e.g. using malloc) and the total available memory. Since file cache pages are an in-memory copy of the data persisted in storage, they are readily evicted when the system is low on memory. The OS only ever evicts pages when an allocation hits <a target="_blank" href="https://www.kernel.org/doc/gorman/html/understand/understand005.html"><em>low zone watermark</em></a> in memory; at which point, this allocation request waits till the OS has evicted pages upto the <strong><em>high zone watermark</em></strong> using the <a target="_blank" href="https://www.kernel.org/doc/gorman/html/understand/understand013.html">clock replacement algorithm</a>. The OS also predicts future file accesses and prefetches adjacent file data for spatial locality. This prediction algorithm is designed to be light weight to reduce latency in the read path. More on this in a bit.</p>
<p>Cassandra is built on <a target="_blank" href="https://blog.speedyio.com/building-a-db-from-first-principles">Log Structured Merge trees</a> (LSM). Incoming writes are first recorded in memory and flushed to disk as immutable SSTables. Over time, Cassandra performs periodic compactions to merge these SSTables; dozens of small, sorted files are read concurrently, their contents are scanned, merged, and re-written into fewer, larger files in lexicographical order, after which the old files are deleted. This process is inherently I/O-intensive. Compactions repeatedly stream large volumes of data from disk, and generate sustained read-write traffic. While necessary for maintaining read performance and space efficiency, these sequential scans interact poorly with the OS page cache, competing with latency-critical reads and increasing memory pressure.</p>
<p>Cassandra being a mature software, does some nice system tricks to minimize these unnecessary cache evictions during compactions. It reads a fixed chunk of file (~10 MB) and then calls <a target="_blank" href="https://linux.die.net/man/2/fadvise">fadvise</a> with <code>FADV_DONTNEED</code> on that file range to evict those pages from memory before reading the next chunk of file. This limits the amount of cache pollution for each moribund file to around 10 MB.</p>
<h3 id="heading-so-whats-the-problem">So what’s the problem ?</h3>
<p>There are two significant user-space flows here:</p>
<ol>
<li><p><code>Get</code> operations on Cassandra that translate to reads on the filesystem.</p>
</li>
<li><p>Compactions that translate to sequential reads and writes on the filesystem.</p>
</li>
</ol>
<p>The OS infers sequentiality by scanning some adjacent pages from the requested page in the page cache. If any of those pages are present in memory, it classifies the access pattern as sequential and performs a readahead upto <a target="_blank" href="https://www.kernel.org/doc/html/v5.3/block/queue-sysfs.html#read-ahead-kb-rw"><code>read_ahead_kb</code></a> ahead of the requested page. This is a two fold problem:</p>
<ol>
<li><p>The presence of an adjacent page in memory doesn’t imply sequential access; they may have been read far apart in time.</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770049945481/c14a0b75-f975-43df-b440-8e3d733a613b.gif" alt class="image--center mx-auto" /></p>
<p> In the above animation, the file page access sequence is 1, 4, 2; which is not a sequential access. But linux checks sequentiality by checking the existance of an adjacent page and hence it deems page no. 2 to be a sequential access and prefetches page no. 3.</p>
</li>
<li><p>The OS prefetches file data for accesses from <code>Get</code> operations aswell. Unless the user is doing range queries over the keys on Cassandra or compacting files, prefetched file pages only increase eviction overhead and can be wasteful of storage IOPS and bandwidth.</p>
</li>
</ol>
<p>A quick conclusion that one could jump to, at this point, is to turn off file prefetching and force fetch files that are being compacted. But this kind of blanket policy will only increase latency aberrations in the system.</p>
<h3 id="heading-here-is-how-speedyio-tames-p99">Here is how SpeedyIO tames P99:</h3>
<ol>
<li><p>It identifies which files are okay to prefetch based on the kind of file (<code>*Data.db, *log.db</code>, etc), the size of the file and their access pattern.</p>
</li>
<li><p>It proactively evicts cold file pages in cache: doesn’t let the memory banks deplete to low zone watermark where memory allocations have to wait for evictions.</p>
</li>
</ol>
<p>Overall, this enables the OS to retain the hottest file data in memory, prefetch near-term data, and keep allocations off the slow path. If <a target="_blank" href="https://github.com/shaleengarg/SpeedyIO/">code speaks to you more than words check it out here</a>.</p>
<p>SpeedyIO sits as a runtime library (LD_PRELOAD) in the cassandra launcher. This is a single line change in the launcher (<code>path/to/cassandra_installation/bin/cassandra</code>).</p>
<pre><code class="lang-bash">...

<span class="hljs-keyword">if</span> [ <span class="hljs-string">"x<span class="hljs-variable">$allow_root</span>"</span> != <span class="hljs-string">"xyes"</span> ] ; <span class="hljs-keyword">then</span>
    <span class="hljs-keyword">if</span> [ <span class="hljs-string">"id -u"</span> = <span class="hljs-string">"0"</span> ] || [ <span class="hljs-string">"id -g"</span> = <span class="hljs-string">"0"</span> ] ; <span class="hljs-keyword">then</span>
        <span class="hljs-built_in">echo</span> <span class="hljs-string">"Running Cassandra as root user or group is not recommended - please start Cassandra using a different system user."</span>
        <span class="hljs-built_in">echo</span> <span class="hljs-string">"If you really want to force running Cassandra as root, use -R command line option."</span>
        <span class="hljs-built_in">exit</span> 1
    <span class="hljs-keyword">fi</span>
<span class="hljs-keyword">fi</span>

LD_PRELOAD=<span class="hljs-string">"/path/to/lib_speedyio_release.so"</span> <span class="hljs-comment">##This is all you need to add</span>

<span class="hljs-comment"># Start up the service</span>
launch_service <span class="hljs-string">"<span class="hljs-variable">$pidfile</span>"</span> <span class="hljs-string">"<span class="hljs-variable">$foreground</span>"</span> <span class="hljs-string">"<span class="hljs-variable">$properties</span>"</span> <span class="hljs-string">"<span class="hljs-variable">$classname</span>"</span>

...
</code></pre>
<h2 id="heading-experiments">Experiments</h2>
<p>Here we have a cluster with 32 nodes running cassandra 5 on Centos 8 (linux 4.18). <a target="_blank" href="https://github.com/brianfrankcooper/YCSB">YCSB</a> is used to generate a uniformly distributed 50-50 read-update load on the system (called <code>workload A uniform</code> in ycsb terminology). All experimental configurations can be found <a target="_blank" href="https://github.com/shaleengarg/SpeedyIO/tree/main/indepth_experiments">here</a>.</p>
<p>The figure below shows P99 read latency with increasing load on the system. It shows a ~48% reduction in P99 at high load (80 - 100 kops). Note that vanilla Cassandra was tuned to the best of our knowledge using publicly available recommendations and our own empirical analysis. The performance gains shown here are in addition to those optimizations. Also note that write latency data is skipped for brevity since it is not affected.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1769984007907/8b10df84-43b9-4e61-9161-e39f0f982dac.png" alt class="image--center mx-auto" /></p>
<p>The following figure shows the same experiment conducted on Linux 6.18, exhibiting an approximately 23% reduction in P99 latency at high load levels (80–100 kops).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770047220541/6a07b140-61ac-46a2-8dab-cac796b94b7d.png" alt class="image--center mx-auto" /></p>
<p>The next plot shows the P99 read latency at a sustained throughput of 80k operations over a 24-hour period. Vanilla Cassandra(orange) exhibits significant variability in P99 latency, whereas Cassandra with SpeedyIO (blue) maintains a more stable latency profile.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1769673434739/fe598921-5d7a-4cd7-9244-3f58c9a03464.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-so-whats-the-catch">So what’s the catch?</h2>
<p>SpeedyIO works best when a few underlying assumptions about the system and workload hold true. These assumptions are common in many real-world Cassandra deployments, but they are worth making explicit.</p>
<ul>
<li><p><strong>Storage latency vs. memory latency</strong><br />  SpeedyIO delivers the most benefit when storage access is significantly slower than memory access. In these environments, incorrect page-cache decisions are expensive and directly amplify tail latency. By improving cache behavior under these conditions, SpeedyIO can meaningfully reduce P99 latency. When storage latency is already very low, the relative impact of cache optimization naturally diminishes.</p>
</li>
<li><p><strong>Dataset size relative to memory</strong><br />  SpeedyIO is most effective when the active dataset does not fully fit in memory and the page cache is under pressure. In this regime, deciding which pages to keep or evict has a large impact on tail latency. When available memory comfortably exceeds the dataset’s working set, cache misses are rare and the opportunity for improvement is limited.</p>
</li>
</ul>
<hr />
<h2 id="heading-safety-guarantees">Safety guarantees</h2>
<p>SpeedyIO is conservative in what it is allowed to do:</p>
<ul>
<li><p><strong>Data correctness is never modified</strong></p>
<ul>
<li><p>No write reordering</p>
</li>
<li><p>No partial reads</p>
</li>
<li><p>No interference with fsync or durability guarantees</p>
</li>
</ul>
</li>
<li><p><strong>I/O paths are never blocked</strong></p>
<ul>
<li>SpeedyIO does not introduce unbounded waits or stall read/write syscalls</li>
</ul>
</li>
<li><p><strong>Memory is never pinned</strong></p>
<ul>
<li>SpeedyIO does not prevent the kernel from reclaiming memory when under pressure</li>
</ul>
</li>
<li><p><strong>Runtime disable is supported</strong></p>
<ul>
<li>SpeedyIO can be turned off without restarting the Cassandra node</li>
</ul>
</li>
<li><p><strong>Incremental deployment is safe</strong></p>
<ul>
<li><p>Nodes can be enabled or disabled independently</p>
</li>
<li><p>There is no requirement for all nodes in a cluster to run SpeedyIO</p>
</li>
</ul>
</li>
<li><p><strong>Operator visibility</strong></p>
<ul>
<li>SpeedyIO emits descriptive logs intended to help SREs understand and debug behavior</li>
</ul>
</li>
<li><p><strong>Scope is limited to Cassandra</strong></p>
<ul>
<li><p>SpeedyIO only affects the Cassandra JVM process</p>
</li>
<li><p>Other applications running on the same system are not impacted</p>
</li>
</ul>
</li>
</ul>
<hr />
<h2 id="heading-whats-next">What’s next</h2>
<p>SpeedyIO is designed to reduce tail latency by improving how the OS page cache interacts with Cassandra’s I/O patterns. Ongoing and future work focuses on broadening its applicability and validating its behavior across a wider range of deployment scenarios.</p>
<p>A near-term focus is extending testing and tuning across additional Cassandra compaction strategies, including <code>Time Window Compaction Strategy</code> (TWCS), <code>Date Tiered Compaction Strategy</code> (DTCS), and <code>Unified Compaction Strategy</code> (UCS). Each of these strategies exhibits distinct I/O access patterns, and adapting SpeedyIO’s policies to them is necessary to ensure consistent tail-latency improvements.</p>
<p>Another area of development is making SpeedyIO more kernel-aware. The current policies are primarily designed and tuned for Linux 4.x. While they function correctly on Linux 6.x, they do not fully exploit newer kernel memory-management behavior. Introducing kernel-version–specific policies would allow SpeedyIO to better align with evolving eviction, reclaim, and prefetch heuristics, and deliver more consistent results across kernel versions.</p>
<p>Improving support for shared environments is also an active area of work. At present, SpeedyIO assumes Cassandra is the dominant storage I/O workload on a node. Lightweight background tasks and analytics are supported, but colocating multiple I/O-intensive workloads can dilute or negate the benefits. Future iterations aim to better isolate Cassandra’s critical I/O paths so that tail-latency gains are preserved even in more heavily shared deployments.</p>
<p>Finally, SpeedyIO is not yet optimized for systems where memory is spread across multiple NUMA nodes. Adding NUMA-aware policies is a necessary step towards robust support on modern multi-socket and high-core-count machines.</p>
<hr />
<p><a target="_blank" href="https://github.com/speedy-io/SpeedyIO">Check out the code here!</a> ! If you need help with your cassandra clusters, <a target="_blank" href="mailto:shaleen@speedyio.com?subject=SpeedyIO%20Inquiry">contact us</a>.</p>
]]></content:encoded></item><item><title><![CDATA[Constructing B+ Trees and LSMs - Building a database from first principles]]></title><description><![CDATA[In this post, I want to take you on a journey of designing a database from first principles.
Read this as if you were the engineer tasked with building a data system for the hardware of each era. At every stage, think about what bottleneck you’d hit ...]]></description><link>https://blog.speedyio.com/building-a-db-from-first-principles</link><guid isPermaLink="true">https://blog.speedyio.com/building-a-db-from-first-principles</guid><dc:creator><![CDATA[Shaleen Garg]]></dc:creator><pubDate>Wed, 05 Nov 2025 16:35:48 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1762263087675/c1a580d4-2065-4850-8ee6-05f6f8498748.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In this post, I want to take you on a journey of <em>designing a database from first principles</em>.</p>
<p>Read this as if <strong>you</strong> were the engineer tasked with building a data system for the hardware of each era. At every stage, think about what bottleneck you’d hit next and how you might fix it.</p>
<p>We’ll start in the 1970s, where the first data structures emerged to serve single-threaded systems and spinning disks. Then we’ll move into the 1990s, when workloads became write-heavy, storage became layered, and multi-core CPUs became commonplace,  forcing a rethink of how databases ingest and serve data.</p>
<p>Along the way, every fix for one bottleneck will reveal the next, showing how these trade-offs directly map to hardware behaviour: cachelines, disk pages and IO patterns.</p>
<p>By the end, you’ll not only understand <em>why</em> B+trees and LSMs look the way they do, but <em>what principles guide their evolution today.</em></p>
<h2 id="heading-first-principles-circa-1970"><strong>First Principles (circa 1970)</strong></h2>
<p>If you were asked to build data software in 1970, you’d define a simple contract: store and retrieve values by key, with four core operations:</p>
<ol>
<li><p><strong>Insert</strong> new key-value pairs</p>
</li>
<li><p><strong>Update</strong> existing keys</p>
</li>
<li><p><strong>Lookup</strong> a key</p>
</li>
<li><p><strong>Delete</strong> a key</p>
</li>
</ol>
<p>A straightforward first choice is an <strong>array</strong> of KV tuples.</p>
<p>Insertions are cheap, just an append.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762235312647/245a708b-0489-4dd0-8d8b-2a138965bd8b.gif" alt="Insertions in a simple array of KV tuples" class="image--center mx-auto" /></p>
<p>Lookups/updates/deletes involve linear scans O(N) where N is the number of keys.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762235366688/7f139d8e-de66-4b64-91fc-87b83a9ae5e6.gif" alt="Looksups in a simple array of KV tuples" class="image--center mx-auto" /></p>
<p>It is perfectly functional albeit painfully slow as N grows.</p>
<p>Lookups could be made quicker by sorting the array by key; enabling binary search which is O(log N).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762235468499/124b611f-7b9c-49c0-af7e-ffb7253fe140.gif" alt="Lookups on sorted array has logarithmic time complexity" class="image--center mx-auto" /></p>
<p>But inserts/deletes now require shifting elements; O(N) write cost.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762235496109/b2f5928e-d4f3-4dda-86fb-3d753d5ba76a.gif" alt="Inserts/deletes involve a log time lookup but linear time element shifting" class="image--center mx-auto" /></p>
<p>To speedup insertions, you might switch to a linked list; that way insertions and deletions won’t involve shifting other elements. But <strong>this sacrifices random access</strong>; no binary search, so lookups fall back to O(N) scans.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762236530641/56a51e17-5a03-42f3-83f5-487d398dfde2.gif" alt="with sorted linked lists insertions can be done in constant time after lookup" class="image--center mx-auto" /></p>
<p>This is still extremely inefficient.</p>
<h3 id="heading-adding-guide-posts"><strong>Adding Guide Posts</strong></h3>
<p>What if there is a binary search like scaffolding above the linked list such that it can be easily navigated ?</p>
<ul>
<li><p>Internal nodes: Keys + pointers to children nodes guiding the search.</p>
</li>
<li><p>Leaf nodes: actual KV pairs</p>
</li>
</ul>
<p>Traversals for insert/lookup/deletions are now O(log N).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762236631461/9c422e0b-1025-4776-8162-ad0ad5770bc1.gif" alt="All operations involve a tree traversal which has logarithmic time complexity." class="image--center mx-auto" /></p>
<h3 id="heading-asymptotics-arent-the-whole-story-hardware-realities"><strong>Asymptotics aren’t the Whole Story: Hardware Realities</strong></h3>
<p>While the above system looks reasonably optimal, theoretically, it doesn't guarantee good wall clock performance.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762236711350/d7cb2b67-8c24-4faf-bdf7-5109dc7f8802.png" alt="von Neumann architecture: Each volatile/non-volatile memory is divided into discrete blocks of data eg. cache line, page and storage sectors" class="image--center mx-auto" /></p>
<p>On real machines:</p>
<ol>
<li><p>Each node visit risks a cache line miss; if not in main memory(RAM), a block read from storage</p>
</li>
<li><p>Deep trees amplify these misses.</p>
</li>
</ol>
<p>So, you decide to widen the nodes to hold many keys and child pointers (fan out <code>m &gt; 2</code>). This reduces the number of access to slower memory by reducing the depth of the tree. <code>m</code> is chosen such that each node in the tree fits in a cache line or disk page to maximize locality.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762236799097/d4d8bdf2-2f94-4c8c-b8c8-16bceda52eae.gif" alt="Traversals now incur lower misses - higher performance." class="image--center mx-auto" /></p>
<p>Now all tree traversals are O(logₘ N). Performance is good, customers are flocking; stocks rising; investors are happy, for now.</p>
<h3 id="heading-context-changes"><strong>Context Changes</strong></h3>
<p>Fast forward to the late 1990s. Data is rapidly becoming the core of every product and service. You are now a senior architect at a large corp.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762236840696/e1e4e3dc-9dca-441c-b3cf-6fa749812573.gif" alt="good old days when you could hear it connect to the internet." class="image--center mx-auto" /></p>
<p>Hardware and workloads have evolved: write rates now dwarf reads, storage tiers are multiplying and CPUs have started to become multi-core ie. multi-threading is mainstream. Databases are now expected to sustain very high write throughputs while keeping lookups fast and scaling cleanly with concurrency. The old B+ tree starts to show its limitations.</p>
<h2 id="heading-struggle-with-b-trees"><strong>Struggle with B+ trees</strong></h2>
<ul>
<li><p><strong>Limited write concurrency:</strong> Writes(updates and inserts) incur restructuring and in-place updates in the tree.</p>
<ul>
<li><p>Restructuring the tree limits concurrency. Even fine-grained locks can’t prevent contention in the upper levels of the tree. Difficult to scale for DB writes.</p>
</li>
<li><p>Each write becomes a cascade of random read-modify-writes across multiple nodes in the tree. This includes the leaf nodes for the actual data manipulation and also the inner nodes while restructuring. So the quanta of work per write operation is extremely high.</p>
</li>
</ul>
</li>
<li><p><strong>Poor cache locality:</strong> Nodes can spread all over the disk; accessing different parts of the tree would cause frequent cache line and page misses. Lot of random IO, ie. low throughput.</p>
</li>
<li><p><strong>Limitations wrt modern hardware:</strong> B+ trees assume a single storage hierarchy (RAM -&gt; HDD). But multiple storage levels have started to emerge, SSDs, Network storage etc. B+ tree architecture is not aware of it.</p>
</li>
</ul>
<p>These are all symptoms of one core design flaw - <strong>in-place updates</strong>. To modify a key, one must:</p>
<ol>
<li><p>Navigate the inner nodes (random IO)</p>
</li>
<li><p>Rewrite the leaf page (read-modify-write)</p>
</li>
<li><p>restructuring tree (more random IO)</p>
</li>
</ol>
<p>What if in-place updates are done away with altogether ?</p>
<h3 id="heading-append-dont-overwrite"><strong>Append, Don’t Overwrite</strong></h3>
<p>The fastest operation any disk can perform is a sequential append - writing to the tail of a log. Treat every mutation, insert, update, delete as a log record appending to the end of a file.</p>
<p>Lookups scan the log from newest to oldest, returning the first match. Multi-threaded writers serialize at the append point, but the critical section is tiny, and therefore quick.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762237011457/8886cfec-b935-4bab-b0ca-f01a8731b845.gif" alt="example of insert, update, delete and lookup on a simple log " class="image--center mx-auto" /></p>
<p>At the outset this solves a lot of problems described above. Writes are simple and fast, no random read-modify-writes. Lookups are sequential reads which are disk friendly but a lot of I/O for a single key lookup - read amplification.</p>
<h3 id="heading-reducing-read-amplification"><strong>Reducing Read Amplification</strong></h3>
<p>Digging through several research papers, you find the idea of probabilistic hash based query filtering promising. A function that takes in a key and returns the probabilistic existence of it in a bucket. Here bounded false positives are acceptable; false negatives are not.</p>
<p>Break the single large log into many smaller log files, each with a filter corresponding to it. During a lookup, only read candidate log files where the key is “possibly present”. The filter’s bounds of false positives can be tuned by changing its parameters and hash function used. This removes many wasted reads but the number of candidate log files can still be large and the file has to be scanned for lookup.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762237088545/7a1bd2c3-9411-43a2-85a7-0ce6b212d730.gif" alt class="image--center mx-auto" /></p>
<p>Then you wonder: what if you sort all the keys in the log in lexicographical order? That could work, but since a key might appear multiple times within the same log, you’d lose the notion of recency - lookups might return outdated values. Food for thought: unique and sorted on disk logs would make life easy.</p>
<p>Consider keeping an in-memory sorted structure for the active write set, just like a B+ tree. Duplicate writes to a key will be updated in-place in the memtable. On reaching a size threshold, flush it as an immutable sorted string table (SST file) to disk. Name SST files monotonically to encode recency. Lookups can now use binary search on candidate SST files.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762237132180/8c434942-1679-4e08-8c62-52c7ff7aeaf4.gif" alt class="image--center mx-auto" /></p>
<p>With this setup, asymptotically, writes are amortized O(log k) where k is #keys in the memtable and reads within an SST file are O(log m) where m is #keys per SST file.</p>
<h3 id="heading-reducing-space-amplification"><strong>Reducing Space Amplification</strong></h3>
<p>Over time, as more memtables are flushed, the DB ends up with many SST files, each containing overlapping key ranges and multiple versions of the same key. Here the space amplification grows unchecked, contributing to storage costs.</p>
<p>Most workloads exhibit temporal locality (recent keys are manipulated/accessed more often) and spatial locality (nearby keys tend to be queried together). Since newer SST files primarily hold updated versions of hot keys, older ones can be merged periodically to keep only the most recent data and free up space.</p>
<p>Periodically, the system selects a group of SST files whose key ranges overlap and merges them into a single larger file. During the merge:</p>
<ul>
<li><p>Only the most recent version of each key is preserved.</p>
</li>
<li><p>Tombstones (deletes) are applied, dropping obsolete entries.</p>
</li>
<li><p>Once the new file is persisted, old files are deleted.</p>
</li>
</ul>
<p>“Compactions” help reclaim storage, reduces the number of candidate SST files to search during lookups and improves locality on disk by periodically keeping on-disk data roughly sorted by key.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762237197975/26d7e730-afbb-4b8e-91e8-882166f8fe9b.gif" alt class="image--center mx-auto" /></p>
<h3 id="heading-compaction-tuning-and-hardware-awareness"><strong>Compaction Tuning and Hardware Awareness</strong></h3>
<p>As data volume grows, compactions evolve from lightweight maintenance tasks into heavy background I/O operations that can compete with live reads and writes. Preserving predictable latency and throughput is paramount.</p>
<p>Most workloads naturally exhibit <strong>temporal locality</strong>: recent keys are updated and queried far more frequently than older ones. Consequently, newly created SST files tend to share key ranges with other recent ones.</p>
<p>Over time, this process produces a natural hierarchy of “levels”, small and hot at the top, large and cold at the bottom. Lower levels require less frequent compaction, keeping background I/O predictable and bounded.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762237248963/beb4fd82-3c49-4c2c-aced-59dab91003e2.png" alt class="image--center mx-auto" /></p>
<p>This leveled organization keeps compactions local, caps space growth, and stabilizes performance. It mirrors the same principle that underlies modern hardware hierarchies: hot data close and fast, cold data dense and cheap. Even on a single storage device, leveling delivers these benefits - controlled space usage, bounded read latency, and sustained write throughput.</p>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>The exercise shows that database design is not about picking a fancy algorithm, but about <em>evolving a structure that works with real hardware and software constraints</em>. These hardware constraints may not always show themselves as functional problems but as performance penalties, for which the user pays continuously.</p>
<p>Here at <a target="_blank" href="http://www.speedyio.com"><strong>SpeedyIO</strong></a><strong>,</strong> we are bridging the gaps between hardware and software design, one pain-point at a time. We make sure that each component of the system cooperates with each other rather than fighting against itself.</p>
<p>“We’re building the future of a harmonious database system. If you are obsessed with low latency systems and real world performance as we are - let’s talk.“</p>
]]></content:encoded></item></channel></rss>