Categories


Authors

Cassandra Overview

Cassandra Overview

Cassandra.png

What is Cassandra?

  • Database Management System (DBMS)

  • Open-source (Apache)

  • Distributed - runs across many commodity servers[*]

  • Designed to handle large amounts of data

  • Fault-tolerant

    • No Single Point of Failure (SPOF)

    • Runs across multiple data centers

    • asynchronous, masterless replication

Fully Replicated[*]

Fully-Replicated.png

Replication Factor

Cassandra-Replication-Factor.png

Example Write[*]

Cassandra-Write-Example.png

Flush Process[*]

Cassandra-Flush-Process.png

Compaction[*]

Cassandra-Compaction.png
  • With each Flush, SSTables accumulate

  • Compaction periodically consolidates SSTables into a single file using merge sort

  • Merges duplicate keys using last-write-wins policy

  • Removes records marked for deletion

 

 

 

 

Disk IO[*]

  • Cassandra uses sequential instead of random IO, particularly important for HDDs vs SSDs

  • Relational DBMSs front-load IO but Cassandra back-loads it - defers IO until Compaction

  • File System needs to keep up with Compaction and its sequential IO; therefore, use local storage, NOT shared storage

Partitioning/Sharding[*]

Cassandra-Sharding.png
  • Data are partitioned by hashing the primary key into a token

  • Each node is responsible for a range of tokens

  • Tokens can range from -2^63 to 2^63 but in this example, we use 1 to 100

 

 

 

Partitioning Example

Cassandra-Sharding-Example.png

Running Cassandra on Mesos[*]

Mesos

  • An Operating System for the Data Center (i.e. DCOS)

  • Pools machine resources (CPU, memory, storage) across the data center and manages them holistically

How Uber Does It[*]

Cassandra-on-Mesos.png

Components

  • Zookeeper - a popular leader election system.  Leader election is the process of designating a single service as the leader; all backups recognize the leader[*]

  • Aurora - a scheduler for batch jobs and long-running services

  • Persistent Volume - a storage volume that exists outside the task’s sandbox and will persist on the node even after the task dies or completes[*]

Deeper Dives

Product Management

Product Management

Binary Reference Sheet

Binary Reference Sheet