Constructing B+ Trees and LSMs - Building a database from first principles

In this post, I want to take you on a journey of designing a database from first principles.
Read this as if you were the engineer tasked with building a data system for the hardware of each era. At every stage, think about what bottleneck you’d hit next and how you might fix it.
We’ll start in the 1970s, where the first data structures emerged to serve single-threaded systems and spinning disks. Then we’ll move into the 1990s, when workloads became write-heavy, storage became layered, and multi-core CPUs became commonplace, forcing a rethink of how databases ingest and serve data.
Along the way, every fix for one bottleneck will reveal the next, showing how these trade-offs directly map to hardware behaviour: cachelines, disk pages and IO patterns.
By the end, you’ll not only understand why B+trees and LSMs look the way they do, but what principles guide their evolution today.
First Principles (circa 1970)
If you were asked to build data software in 1970, you’d define a simple contract: store and retrieve values by key, with four core operations:
Insert new key-value pairs
Update existing keys
Lookup a key
Delete a key
A straightforward first choice is an array of KV tuples.
Insertions are cheap, just an append.

Lookups/updates/deletes involve linear scans O(N) where N is the number of keys.

It is perfectly functional albeit painfully slow as N grows.
Lookups could be made quicker by sorting the array by key; enabling binary search which is O(log N).

But inserts/deletes now require shifting elements; O(N) write cost.

To speedup insertions, you might switch to a linked list; that way insertions and deletions won’t involve shifting other elements. But this sacrifices random access; no binary search, so lookups fall back to O(N) scans.

This is still extremely inefficient.
Adding Guide Posts
What if there is a binary search like scaffolding above the linked list such that it can be easily navigated ?
Internal nodes: Keys + pointers to children nodes guiding the search.
Leaf nodes: actual KV pairs
Traversals for insert/lookup/deletions are now O(log N).

Asymptotics aren’t the Whole Story: Hardware Realities
While the above system looks reasonably optimal, theoretically, it doesn't guarantee good wall clock performance.

On real machines:
Each node visit risks a cache line miss; if not in main memory(RAM), a block read from storage
Deep trees amplify these misses.
So, you decide to widen the nodes to hold many keys and child pointers (fan out m > 2). This reduces the number of access to slower memory by reducing the depth of the tree. m is chosen such that each node in the tree fits in a cache line or disk page to maximize locality.

Now all tree traversals are O(logₘ N). Performance is good, customers are flocking; stocks rising; investors are happy, for now.
Context Changes
Fast forward to the late 1990s. Data is rapidly becoming the core of every product and service. You are now a senior architect at a large corp.

Hardware and workloads have evolved: write rates now dwarf reads, storage tiers are multiplying and CPUs have started to become multi-core ie. multi-threading is mainstream. Databases are now expected to sustain very high write throughputs while keeping lookups fast and scaling cleanly with concurrency. The old B+ tree starts to show its limitations.
Struggle with B+ trees
Limited write concurrency: Writes(updates and inserts) incur restructuring and in-place updates in the tree.
Restructuring the tree limits concurrency. Even fine-grained locks can’t prevent contention in the upper levels of the tree. Difficult to scale for DB writes.
Each write becomes a cascade of random read-modify-writes across multiple nodes in the tree. This includes the leaf nodes for the actual data manipulation and also the inner nodes while restructuring. So the quanta of work per write operation is extremely high.
Poor cache locality: Nodes can spread all over the disk; accessing different parts of the tree would cause frequent cache line and page misses. Lot of random IO, ie. low throughput.
Limitations wrt modern hardware: B+ trees assume a single storage hierarchy (RAM -> HDD). But multiple storage levels have started to emerge, SSDs, Network storage etc. B+ tree architecture is not aware of it.
These are all symptoms of one core design flaw - in-place updates. To modify a key, one must:
Navigate the inner nodes (random IO)
Rewrite the leaf page (read-modify-write)
restructuring tree (more random IO)
What if in-place updates are done away with altogether ?
Append, Don’t Overwrite
The fastest operation any disk can perform is a sequential append - writing to the tail of a log. Treat every mutation, insert, update, delete as a log record appending to the end of a file.
Lookups scan the log from newest to oldest, returning the first match. Multi-threaded writers serialize at the append point, but the critical section is tiny, and therefore quick.

At the outset this solves a lot of problems described above. Writes are simple and fast, no random read-modify-writes. Lookups are sequential reads which are disk friendly but a lot of I/O for a single key lookup - read amplification.
Reducing Read Amplification
Digging through several research papers, you find the idea of probabilistic hash based query filtering promising. A function that takes in a key and returns the probabilistic existence of it in a bucket. Here bounded false positives are acceptable; false negatives are not.
Break the single large log into many smaller log files, each with a filter corresponding to it. During a lookup, only read candidate log files where the key is “possibly present”. The filter’s bounds of false positives can be tuned by changing its parameters and hash function used. This removes many wasted reads but the number of candidate log files can still be large and the file has to be scanned for lookup.

Then you wonder: what if you sort all the keys in the log in lexicographical order? That could work, but since a key might appear multiple times within the same log, you’d lose the notion of recency - lookups might return outdated values. Food for thought: unique and sorted on disk logs would make life easy.
Consider keeping an in-memory sorted structure for the active write set, just like a B+ tree. Duplicate writes to a key will be updated in-place in the memtable. On reaching a size threshold, flush it as an immutable sorted string table (SST file) to disk. Name SST files monotonically to encode recency. Lookups can now use binary search on candidate SST files.

With this setup, asymptotically, writes are amortized O(log k) where k is #keys in the memtable and reads within an SST file are O(log m) where m is #keys per SST file.
Reducing Space Amplification
Over time, as more memtables are flushed, the DB ends up with many SST files, each containing overlapping key ranges and multiple versions of the same key. Here the space amplification grows unchecked, contributing to storage costs.
Most workloads exhibit temporal locality (recent keys are manipulated/accessed more often) and spatial locality (nearby keys tend to be queried together). Since newer SST files primarily hold updated versions of hot keys, older ones can be merged periodically to keep only the most recent data and free up space.
Periodically, the system selects a group of SST files whose key ranges overlap and merges them into a single larger file. During the merge:
Only the most recent version of each key is preserved.
Tombstones (deletes) are applied, dropping obsolete entries.
Once the new file is persisted, old files are deleted.
“Compactions” help reclaim storage, reduces the number of candidate SST files to search during lookups and improves locality on disk by periodically keeping on-disk data roughly sorted by key.

Compaction Tuning and Hardware Awareness
As data volume grows, compactions evolve from lightweight maintenance tasks into heavy background I/O operations that can compete with live reads and writes. Preserving predictable latency and throughput is paramount.
Most workloads naturally exhibit temporal locality: recent keys are updated and queried far more frequently than older ones. Consequently, newly created SST files tend to share key ranges with other recent ones.
Over time, this process produces a natural hierarchy of “levels”, small and hot at the top, large and cold at the bottom. Lower levels require less frequent compaction, keeping background I/O predictable and bounded.

This leveled organization keeps compactions local, caps space growth, and stabilizes performance. It mirrors the same principle that underlies modern hardware hierarchies: hot data close and fast, cold data dense and cheap. Even on a single storage device, leveling delivers these benefits - controlled space usage, bounded read latency, and sustained write throughput.
Conclusion
The exercise shows that database design is not about picking a fancy algorithm, but about evolving a structure that works with real hardware and software constraints. These hardware constraints may not always show themselves as functional problems but as performance penalties, for which the user pays continuously.
Here at SpeedyIO, we are bridging the gaps between hardware and software design, one pain-point at a time. We make sure that each component of the system cooperates with each other rather than fighting against itself.
“We’re building the future of a harmonious database system. If you are obsessed with low latency systems and real world performance as we are - let’s talk.“

