Best Practices for Cassandra Cluster Sizing

1.0          Introduction

1.1            Purpose

This document is intended to provide some information about best practices for Cassandra and datastax enterprise Cluster sizing.

2.0          Cassandra Cluster sizing

Let’s talk about sizing our cluster. It is a big process, there are many rules to follow. We have to understand what we are doing.
The process

  • We will need to understand
  • Application data model and use case
  • Volume of data over time
  • Velocity of data: how fast it can go

2.1            Data Model and Use Case

  • What is the read/write bias? How many reads/ second and writes/second we are doing.
  • It is more read or more write?
  • What are my SLAs? The business process will give us the SLAs.
  • Do I need to tier out my data? We will be ok if my data disappear after a year or I need to keep it Indefinitely over time

2.2            Volume of data

  • Estimate the volume base on your application: You will to look at tables to know the different data type and amount of data per data type. How many rows per data you will have? Cassandra stores only data which you will give it.
  • Data per node

(Avg. amount of data per row) X (number of rows) / # nodes

2.3            Velocity of data

  • How many writes per second? If we have very writes per second, that can drive over you capacity
  • How many reads per second? It can follow over time with the writes
  • Volume of data per operation: If you are doing a million writes per second, how much volume of data is in each writes? Is it one KB writes or 1 MB writes

2.4            Factoring in replication

  • Multiply per node volume by Replication Factor
  • Example: RF=3

(Node volume 10G per day per node) x (RF)= 30G per node
If you know a volume data per node, we need to multiply that by the RF.

2.5            Testing limits

You need to understand also the hardware

  • Use Casandra-stress to simulate workload
  • Test production level servers: Use production level server
  • Monitor to find limits
  • Disk is the first thing to manage: How much disk do you have per node? , How much throughput can you get?
  • CPU is second: Do you see a lot of CPU when you run your operation? Do you have enough CPU?

3.0          Example sizing exercise

  • Let’s assume our application will create 100G of data per day
  • Writes will be 150k per second
  • Reads will be 100K per second

That we will expect to get a good idea how we can size our cluster.

3.1            Requirements before calculation

Replication factor of 3
Multiple Data Center: Here we have 2 DC
SLA: 25ms read latency 95th percentile (don’t use average, medium but only 95th percentile)

3.2            Testing summary

When we tested on production equipment, we find some limits. We knew we have to maintain:

  • Node capabilities to maintain < 25ms read
  • At maximum packet size, 50k writes/sec
  • At maximum packet size, 40k reads /sec
  • 4T per node to allow for compaction overhead

Now we have the estimation, what each node can handle to make the final calculation.

3.3            Sizing for Volume

  • (100G per day) X (30 days) X (6 months) = 18 T data
  • (18T data) X (RF 3) = 54 T total cluster load across the Cluster.
  • 3.4            How many nodes we will need to do that?

    • (54 T total data across the cluster) / (4T max per node) = 14 nodes
    • We have two data centers to manage)

    (14 nodes) X (2 data centers)  = 28 nodes total

    3.5            Sizing for Velocity

    • (150K writes/sec load) / (50k writes/sec per node) = 3 nodes
    • Volume capacity covers writes capacity: writes capacity is much lower we need on the Volume

    3.6            Future capacity

    • Validate your assumptions often
    • Monitor for change over time
    • Plan for increasing cluster size before you need it: We will be sur to have the hardware available, if we are on the cloud environment, it will be easy.
    • Be ready to draw down if needed with the decommission command