Cassandra Overview
What is Cassandra?
Replication Factor
Example Write[*]
Flush Process[*]
Compaction[*]
With each Flush, SSTables accumulate
Compaction periodically consolidates SSTables into a single file using merge sort
Merges duplicate keys using last-write-wins policy
Removes records marked for deletion
Disk IO[*]
Cassandra uses sequential instead of random IO, particularly important for HDDs vs SSDs
Relational DBMSs front-load IO but Cassandra back-loads it - defers IO until Compaction
File System needs to keep up with Compaction and its sequential IO; therefore, use local storage, NOT shared storage
Partitioning/Sharding[*]
Data are partitioned by hashing the primary key into a token
Each node is responsible for a range of tokens
Tokens can range from -2^63 to 2^63 but in this example, we use 1 to 100
Partitioning Example
Components
Zookeeper - a popular leader election system. Leader election is the process of designating a single service as the leader; all backups recognize the leader[*]
Aurora - a scheduler for batch jobs and long-running services
Persistent Volume - a storage volume that exists outside the task’s sandbox and will persist on the node even after the task dies or completes[*]