← DB Internals · intermediate · 14 min · 06 / 08 বাংলা

LSM Trees

The write-optimized alternative to B-trees — how RocksDB, Cassandra, and LevelDB handle massive write throughput.

LSM treecompactionmemtableSSTable

B-Trees vs LSM Trees

B-trees are read-optimized: lookups are fast (3-4 disk reads), but every write requires updating pages in place — random disk I/O.

LSM trees flip this tradeoff: all writes go to an in-memory buffer first, then flush to disk as sorted files. Writes are sequential (fast), but reads may need to check multiple files.

Real-World Analogy

Like a donation drop-off center — donors (writes) rapidly drop items into temporary bins (memtable). Periodically, volunteers sort and merge bins into organized shelves (SSTables). Collecting is fast; finding a specific item requires checking multiple bins.

The LSM Tree Architecture

class LSMTree {
	private memtable: SortedMap<string, Row>; // in-memory, sorted
	private immutableMemtables: SortedMap<string, Row>[] = []; // being flushed
	private levels: SSTable[][] = [[], [], [], []]; // L0, L1, L2, L3

	// WRITE: always goes to memtable (RAM) — blazing fast
	async put(key: string, value: Row): Promise<void> {
		// 1. Write to WAL (for durability)
		await this.wal.append(key, value);

		// 2. Insert into memtable (in-memory sorted structure)
		this.memtable.set(key, value);

		// 3. If memtable is full, flush to disk
		if (this.memtable.size >= this.maxMemtableSize) {
			this.immutableMemtables.push(this.memtable);
			this.memtable = new SortedMap();
			await this.flush();
		}
	}

	// READ: check memtable first, then each level
	async get(key: string): Promise<Row | null> {
		// 1. Check active memtable
		if (this.memtable.has(key)) return this.memtable.get(key)!;

		// 2. Check immutable memtables (being flushed)
		for (const mem of this.immutableMemtables) {
			if (mem.has(key)) return mem.get(key)!;
		}

		// 3. Check SSTable levels (L0 first, then L1, L2, ...)
		for (const level of this.levels) {
			for (const sst of level) {
				const result = await sst.get(key);
				if (result !== null) return result;
			}
		}

		return null;
	}
}

SSTables (Sorted String Tables)

When the memtable flushes to disk, it becomes an SSTable — an immutable, sorted file:

interface SSTable {
  // Metadata
  minKey: string;
  maxKey: string;
  bloomFilter: BloomFilter; // quick "definitely not here" check
  index: Map<string, number>; // sparse index: key → file offset

  // Data
  dataBlocks: DataBlock[]; // sorted key-value pairs

  // Lookup
  async get(key: string): Promise<Row | null> {
    // 1. Check bloom filter — fast rejection
    if (!this.bloomFilter.mightContain(key)) return null;

    // 2. Binary search the sparse index
    const blockOffset = this.findBlock(key);

    // 3. Read and search the data block
    const block = await this.readBlock(blockOffset);
    return block.find(key);
  }
}

Bloom filters are crucial for LSM reads. Without them, every read would check every SSTable file. A bloom filter tells you “definitely not in this file” with zero disk I/O — false positives are rare (~1%) and just mean one extra disk read.

Compaction

Over time, LSM trees accumulate many SSTable files. Compaction merges them to reduce read amplification and reclaim space:

// Level-based compaction (used by LevelDB, RocksDB)
async function compact(level: number): Promise<void> {
	// Pick overlapping SSTables from this level and the next
	const source = selectSSTables(level);
	const target = findOverlapping(level + 1, source);

	// Merge-sort all entries
	const merged = mergeSorted([...source, ...target]);

	// Write new SSTables to level + 1
	const newSSTables = await writeNewSSTables(merged, level + 1);

	// Remove old SSTables
	await deleteOldSSTables([...source, ...target]);
}

// Size-tiered compaction (used by Cassandra)
// Merge SSTables of similar size into larger ones
// Simpler but uses more space during compaction

Tradeoffs: Read/Write/Space Amplification

	B-Tree	LSM Tree
Write amplification	~10x (page splits, in-place updates)	~10-30x (compaction rewrites)
Read amplification	1x (single B-tree lookup)	~1-5x (check multiple levels)
Space amplification	~1.5x (page fill factor)	~1.1-2x (temporary duplicates)
Write throughput	Lower (random I/O)	Higher (sequential I/O)
Read latency	Lower (predictable)	Higher (varies)

Choose B-trees for read-heavy workloads (OLTP, typical web apps). Choose LSM trees for write-heavy workloads (time-series, logging, IoT, analytics ingestion).

Key Takeaways

LSM trees buffer writes in memory and flush sorted files to disk — all sequential I/O
Reads check multiple levels — bloom filters and sparse indexes minimize disk reads
Compaction merges files to keep read performance manageable and reclaim space
B-trees win on reads, LSM trees win on writes — pick based on your workload