← NoSQL · advanced · 12 min · 04 / 09

Wide-Column Stores

Cassandra and Bigtable: partition keys, clustering keys, query-first modeling, the write-optimized LSM path, and tunable consistency.

cassandrapartition keyclustering

Real-World Analogy

Think of a vast warehouse organized for putting things away, not browsing. Each pallet (partition) has a unique label, and a forklift can drive straight to it. Within the pallet, boxes are stacked in a fixed order (clustering). You can grab one pallet fast, or scan boxes within it in order — but you cannot ask “find every red box across all pallets” without driving the whole warehouse. To make that query fast, you build a second warehouse organized red-first. Wide-column stores trade flexible querying for the ability to write and read by partition at staggering scale.

The Wide-Column Model

A wide-column store (Cassandra, Google Bigtable, ScyllaDB, HBase) looks superficially like a table, but it behaves very differently. Data is grouped into partitions, and each partition holds an ordered set of rows, where each row is a sparse collection of columns. Different rows can have wildly different columns — hence “wide column.”

The model is best understood through its key structure. A primary key has two parts:

PRIMARY KEY ( (partition_key) , clustering_key1, clustering_key2 )
              ^^^^^^^^^^^^^^^    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
              which node owns    sort order WITHIN the partition
              the data

The partition key decides which node in the cluster stores the data (via a hash). The clustering key(s) decide the sort order of rows inside that partition. Together they uniquely identify a row.

Partition Key vs Clustering Key

This distinction governs everything. Get it wrong and your cluster either melts under hot partitions or can’t answer your queries.

Consider storing a user’s messages, newest first:

CREATE TABLE messages_by_user (
  user_id   uuid,
  sent_at   timestamp,
  message_id uuid,
  body      text,
  PRIMARY KEY ( (user_id), sent_at, message_id )
) WITH CLUSTERING ORDER BY (sent_at DESC);

user_id is the partition key: all of one user’s messages live together on the same nodes, so fetching them is a single-partition read.
sent_at (then message_id) is the clustering key: within the partition, rows are stored sorted by time descending, so “give me this user’s 20 newest messages” is a fast, contiguous scan.

-- Efficient: hits one partition, reads rows in clustering order
SELECT * FROM messages_by_user
WHERE user_id = ? LIMIT 20;

-- ILLEGAL / slow: no partition key means scanning the whole cluster
SELECT * FROM messages_by_user
WHERE body = 'hello';

Two design rules fall out of this. The partition key must spread data evenly (a key like country creates giant hot partitions; a key like user_id spreads load). And every fast query must include the partition key — you cannot filter freely by arbitrary columns the way SQL allows.

Query-First Modeling

In a relational database you design tables around entities and let the planner figure out queries. In Cassandra you do the reverse: you list your queries first, then create one table per query, each laid out so the query is a single-partition read. The same data is duplicated across several tables, each keyed differently.

If you need messages both by user and by conversation, you build two tables:

CREATE TABLE messages_by_user (
  user_id uuid, sent_at timestamp, message_id uuid, body text,
  PRIMARY KEY ( (user_id), sent_at, message_id )
);

CREATE TABLE messages_by_conversation (
  conversation_id uuid, sent_at timestamp, message_id uuid, body text,
  PRIMARY KEY ( (conversation_id), sent_at, message_id )
);

Writing a message inserts into both tables. This denormalization feels wasteful coming from SQL, but storage is cheap and the payoff is that every read is a fast single-partition lookup. There is no join engine to fall back on, so you trade write amplification and duplication for predictable read performance at any scale.

Note: A useful mantra for wide-column modeling: “joins and ad-hoc filters are not available, so design the table to be the answer.” If a new query appears that no existing table serves efficiently, the fix is usually a new table (or a materialized view), not a clever WHERE clause.

The Write Path

Wide-column stores are write-optimized, and the reason is their storage engine: the LSM-tree (Log-Structured Merge-tree). A write does not seek into a file to update a row in place. Instead:

1. Append the write to a commit log (durability).
2. Apply it to an in-memory table (the memtable).
3. Return success to the client.   ← write is done, very fast

Later, asynchronously:
4. Flush the memtable to an immutable on-disk file (SSTable).
5. Periodically merge/compact SSTables, dropping superseded values.

Because every write is an append (sequential disk I/O, no random seeks, no read-before-write), wide-column stores ingest writes at enormous rates. Updates and deletes are also just appends — a delete writes a tombstone marker that hides older values until compaction physically removes them. This is why these stores excel at time-series data, event logs, IoT telemetry, and any append-heavy firehose.

The cost lands on reads: a single row’s current value may be spread across the memtable and several SSTables, so a read must merge them. Good partition/clustering design and compaction keep this fast; sloppy design (or too many tombstones) makes reads crawl.

Tunable Consistency (Intro)

Wide-column stores replicate each partition to several nodes (the replication factor, say RF=3). On each request you choose a consistency level that says how many replicas must respond before the operation counts as successful.

RF = 3   (three copies of every row)

Write at ONE     → 1 replica must ack   (fast, weak durability)
Write at QUORUM  → 2 of 3 must ack      (balanced)
Read  at QUORUM  → 2 of 3 must respond  (balanced)
Read/Write at ALL→ all 3                (strong, low availability)

The famous guarantee: if R + W > RF (read replicas plus write replicas exceed the replication factor), a read is guaranteed to see the latest write, because the read and write quorums must overlap on at least one replica. With RF=3, writing at QUORUM (2) and reading at QUORUM (2) gives 2 + 2 > 3 — strong consistency, while still tolerating one dead node.

Warning: Choosing low consistency levels (ONE) for both reads and writes maximizes availability and speed but means a read can easily miss a recent write — 1 + 1 is not greater than 3. Decide consistency per operation based on how much staleness that specific read or write can tolerate. We dig into quorums, conflict resolution, and read repair in chapter 7.

Wide-column stores are the right tool when you have massive write volume, predictable access patterns, and a need for linear horizontal scaling with no single point of failure. They are the wrong tool when your queries are ad hoc, your relationships are complex, or your scale is modest enough for a relational database.

Finished reading?

Mark complete to earn 30 XP