Sharding and partitioning
# Why?
- The main reason for wanting to partition is scalability.
- For queries that operate on a single partition, each node can independently execute the queries for its own partition, so query throughput can be scaled by adding more nodes.
- Large, complex queries can potentially be parallelized across many nodes, although this gets significant harder.
# Partition and Replication
- Partition is usually combined with replication so that copies of each partition are stored on multiple nodes.
- Even though each record belongs to exactly one partition, it may still be stored on several different nodes for Fault tolerance
- leader-follower replication model
# Partition of Key-Value Data
- Our goal with partitioning is to spread the data and the query load evenly across nodes.
- If the partitioning is unfair:
- skewed
- less effective
- Bottle neck
# Partition by Key Range
assign a continuous range of keys (from some minimum to some maximum) to each partition.
If you know the boundaries between the range, you can easily determine which partition contains a given key.
The range of keys is not necessarily evenly spaced, because your date may not be evenly distributed.
In order to distribute the data evenly, the partition boundaries need to adapt to the data.
Within each partition, we can keep keys in sorted order (SSTable). Some advantages:
- range scans are easy
- You can treat the key as a concatenated index in order to fetch several related records in one query (Multi-column indexes)
Disadvantages:
- certain access pattern can lead to hot spots For example: consider an application that stores data from a network of sensors, where the key is the timestamp of measurement (year-month-day-hour-minute-second). Range scan are very useful in this case, because they let you easily fetch, say, all the readings from a particular month. But if the key is timestamp, all writes in a day end up going to the same partition, so that partition can be overloaded while others sit idle. To avoid this problem, you need to use something other than the timestamp as the first element of the key. For example, prefixing each timestamp with the sensor name. Now, when you want to fetch the values of multiple sensors within a time range, you need to perform a separate range query for each sensor name.
# Partition by Hash of Key
- the hash function need not be cryptographically strong