Skip to main content

Apache Cassandra: A System Architect's First Look

Константин Потапов
40 min

An expert Cassandra deep dive for Python/backend engineers: from the masterless architecture to a production-ready cluster with monitoring.

Apache Cassandra: A System Architect's First Look

When PostgreSQL Starts To Give Up

On the last project the monitoring showed 500ms p95 on queries to a table with 200M rows. Added indexes - went down to 400ms. Partitioning shaved off another 100ms. Then user growth ate all the progress within a month. Vertical scaling hit instance limits, horizontal sharding for PostgreSQL turned into a maintenance nightmare.

The problem was not PostgreSQL - it was doing its job. The issue is that the relational model with ACID guarantees has its price when you scale linearly. When your requirements are:

  • Linear scaling of write load (add a node -> grows proportionally)
  • High availability with no single point of failure (a node dies -> the system keeps working)
  • Geographic data distribution (data in multiple data centers)
  • Millions of writes per second with predictable latency

...then Cassandra is not just an alternative - it is a different class of solution.

Cassandra does NOT replace PostgreSQL. It is a tool for cases where the relational model becomes a bottleneck. If you do not have write-heavy loads of 100k+ writes per second or multi-datacenter replication requirements - stay with PostgreSQL.

What Cassandra Is in Plain English

Masterless architecture

Imagine PostgreSQL: one master writes, N replicas read. Master is down - the system is read-only until failover. This is the classic master-slave architecture with a single point of failure.

Cassandra works differently - peer-to-peer, every node is equal:

  • No master node - every node can accept both writes and reads
  • No single point of failure - 2 nodes out of 10 die, the other 8 keep working
  • Linear scalability - add a node, grows proportionally
  • Geographic distribution - nodes across data centers without complicated setup

Analogy: a torrent network

PostgreSQL is a centralized file server: one source, everybody downloads from it. The server dies -> the file is unavailable.

Cassandra is a torrent: the file is split into pieces, every peer stores several pieces. Even if 30% of peers drop out -> the file is still available, you download from the remaining ones.

Key concepts

Keyspace - equivalent of a database in PostgreSQL. Stores tables and defines a replication strategy.

Table - data structure, but not relational. A table in Cassandra stores data by key, not relations between entities.

Partition Key - defines which nodes store the data. All rows with the same partition key are guaranteed to be on the same nodes in the cluster.

CREATE TABLE users (
    user_id UUID,
    email TEXT,
    created_at TIMESTAMP,
    PRIMARY KEY (user_id)  -- user_id is the partition key
);
 
-- All rows for user_id=123 will be on the same nodes
-- Cassandra hashes user_id and decides which nodes store these data

Clustering Key - defines the order of rows inside a partition. Used for sorting and efficient range queries.

CREATE TABLE user_events (
    user_id UUID,
    event_time TIMESTAMP,
    event_type TEXT,
    payload TEXT,
    PRIMARY KEY (user_id, event_time)
    -- user_id = partition key (which nodes store the data)
    -- event_time = clustering key (how to sort inside the partition)
) WITH CLUSTERING ORDER BY (event_time DESC);
 
-- All events for user_id=123 live together (one partition)
-- Inside the partition they are sorted by event_time from new to old
-- Fetching the last 100 events is O(1) read without scanning the whole table

Replication Factor (RF) - how many copies of the data are stored. RF=3 means every row is physically stored on 3 different nodes.

Consistency Level - how many nodes must acknowledge an operation.

# Write with CL=QUORUM: wait for confirmation from the majority of replicas
session.execute(
    insert_query,
    consistency_level=ConsistencyLevel.QUORUM  # 2 of 3 for RF=3
)
 
# Write with CL=ONE: wait for any one replica
session.execute(
    insert_query,
    consistency_level=ConsistencyLevel.ONE  # Fast but risk of inconsistency
)
 
# Read with CL=ALL: wait for all replicas
session.execute(
    select_query,
    consistency_level=ConsistencyLevel.ALL  # Slow but freshest data
)

Consistency guarantees

Cassandra is an AP system per the CAP theorem:

  • Availability - the system always responds even if nodes fail
  • Partition tolerance - works even if there are network issues between nodes
  • Consistency - eventual consistency: data will become the same everywhere over time, but NOT instantly

This means Cassandra chooses availability over strict consistency. When a node is down or the network is flaky, the system keeps working, but different nodes can temporarily see different versions of data.

Comparison with PostgreSQL:

PostgreSQL (CP)Cassandra (AP)
Strong consistencyEventual consistency
Master dies -> read-onlyN nodes drop -> system keeps working
ACID transactionsNo transactions across partitions
Vertical scaling (up to 96TB)Horizontal scaling (hundreds of nodes)
Complex shardingAutomatic data distribution
Joins, foreign keysDenormalization, data duplication
Good for: OLTP, analyticsGood for: time-series, logs, IoT, metrics
CAP theorem in plain words:

It is impossible to guarantee all at once:

  1. Consistency - all nodes see the same data
  2. Availability - the system always responds
  3. Partition tolerance - it works despite network partitions

Cassandra chooses AP: the system stays available, but consistency is eventual. PostgreSQL chooses CP: consistency is strict, but if the master is down - availability is broken until recovery.

Why Cassandra Matters for a Backend Developer

Typical scenarios

1. Time-series data (metrics, logs, events):

Problem: PostgreSQL stores metrics for millions of devices. Every second 100k writes, 8.6B rows per day. The table bloats to 5TB, queries on old data are slow, manual partitioning becomes a nightmare.

-- PostgreSQL: scales poorly for time-series
CREATE TABLE device_metrics (
    device_id UUID,
    metric_time TIMESTAMP,
    cpu_usage FLOAT,
    memory_usage FLOAT,
    PRIMARY KEY (device_id, metric_time)
);
 
-- Problems:
-- 1. INSERT load hits a single master
-- 2. Time-based partitioning must be created manually
-- 3. Queries on old partitions are slow
-- 4. Sharding by device_id needs a proxy layer (Citus, Vitess)

With Cassandra this pattern is natural:

-- Cassandra: ideal for time-series
CREATE TABLE device_metrics (
    device_id UUID,
    metric_time TIMESTAMP,
    cpu_usage FLOAT,
    memory_usage FLOAT,
    PRIMARY KEY (device_id, metric_time)
) WITH CLUSTERING ORDER BY (metric_time DESC);
 
-- Advantages:
-- 1. INSERTs are distributed across all nodes automatically
-- 2. Data for each device_id live together (one partition)
-- 3. Inside the partition metrics are sorted by time
-- 4. Fetching the latest 1000 metrics of a device - O(1) read
-- 5. Add a node -> write throughput grows proportionally

Practical example:

from cassandra.cluster import Cluster
from datetime import datetime
import uuid
 
# Connect to the cluster
cluster = Cluster(['192.168.1.10', '192.168.1.11', '192.168.1.12'])
session = cluster.connect('monitoring')
 
# Write metrics - automatically distributed across nodes
device_id = uuid.uuid4()
for i in range(1000):
    session.execute(
        """
        INSERT INTO device_metrics (device_id, metric_time, cpu_usage, memory_usage)
        VALUES (%s, %s, %s, %s)
        """,
        (device_id, datetime.utcnow(), 45.2 + i * 0.1, 62.8 + i * 0.05)
    )
 
# Read the last 100 metrics - O(1) thanks to the clustering key
rows = session.execute(
    """
    SELECT metric_time, cpu_usage, memory_usage
    FROM device_metrics
    WHERE device_id = %s
    LIMIT 100
    """,
    (device_id,)
)
 
# All 100 rows are on the same nodes and sorted - fast read
for row in rows:
    print(f"{row.metric_time}: CPU {row.cpu_usage}%, MEM {row.memory_usage}%")

2. Logging and audit:

Problem: a centralized log server writes 500k lines per second. PostgreSQL hits disk IOPS limits, the master is overloaded, replication lags by hours.

# Cassandra for logging
CREATE TABLE application_logs (
    app_name TEXT,
    log_date DATE,
    log_time TIMESTAMP,
    level TEXT,
    message TEXT,
    PRIMARY KEY ((app_name, log_date), log_time)
) WITH CLUSTERING ORDER BY (log_time DESC);
 
# Composite partition key (app_name, log_date):
# - Logs of each app for a day stay together
# - Automatic "partition by day" without manual creation
# - Queries by app and date - O(1) read from a single partition

Advantages:

  • Logs for different apps and days land on different nodes (balanced load)
  • Writing 500k lines/sec is spread across the whole cluster
  • Old logs (last month) do not affect the speed of writing new ones
  • You can set TTL - logs older than 30 days are deleted automatically

3. Counters and real-time metrics:

Problem: need to count clicks on millions of articles in real time. PostgreSQL requires UPDATE ... SET clicks = clicks + 1 - every UPDATE locks the row, under high contention throughput drops.

-- Cassandra has native counter columns
CREATE TABLE article_stats (
    article_id UUID PRIMARY KEY,
    views COUNTER,
    clicks COUNTER,
    shares COUNTER
);
 
-- Increment without locks
UPDATE article_stats
SET views = views + 1
WHERE article_id = 550e8400-e29b-41d4-a716-446655440000;

How it works:

Cassandra uses a special COUNTER data type that supports atomic increments without locks. Several nodes can increment the counter simultaneously, Cassandra resolves conflicts via CRDT (Conflict-free Replicated Data Type).

Important: COUNTER columns have restrictions:

  • Cannot be mixed with regular columns (only COUNTER in the table)
  • Cannot read and increment in one transaction (no transactions)
  • On conflicts the maximum value wins (eventual consistency)

4. Sessions and caching:

Problem: Redis stores user sessions but under growth it hits the memory limit of a single instance. Redis Cluster mode is complex to configure and does not provide persistence out of the box.

-- Cassandra for sessions
CREATE TABLE user_sessions (
    session_id UUID PRIMARY KEY,
    user_id UUID,
    data TEXT,  -- JSON with session data
    expires_at TIMESTAMP
) WITH default_time_to_live = 86400;  -- TTL 24 hours
 
-- Advantages:
-- 1. Persistence on disk (sessions survive restarts)
-- 2. Automatic deletion via TTL
-- 3. Linear scaling - add a node, more memory for sessions
-- 4. Geographic distribution - sessions replicate between data centers

5. Queues and message passing:

Problem: need a task queue but Kafka is overkill (just enqueue/dequeue), and Redis Streams do not give persistence guarantees when a node fails.

-- Cassandra for simple queues
CREATE TABLE task_queue (
    queue_name TEXT,
    task_id TIMEUUID,
    payload TEXT,
    status TEXT,
    PRIMARY KEY (queue_name, task_id)
) WITH CLUSTERING ORDER BY (task_id ASC);
 
-- Enqueue
INSERT INTO task_queue (queue_name, task_id, payload, status)
VALUES ('email_notifications', now(), '{"to": "user@example.com"}', 'pending');
 
-- Dequeue (atomic update)
UPDATE task_queue
SET status = 'processing'
WHERE queue_name = 'email_notifications'
  AND task_id = <smallest pending task_id>
IF status = 'pending';  -- LWT for atomicity

Important: Lightweight transactions (LWT) are slow - use them only when atomicity is critical. For high-load queues prefer Kafka or RabbitMQ.

When Cassandra Is Overkill

Do NOT use Cassandra if:

  • Load < 10k writes per second (PostgreSQL + indexes is enough)
  • You need JOINs and complex analytical queries (use PostgreSQL or ClickHouse)
  • Strict consistency is critical (bank transactions -> PostgreSQL with ACID)
  • The team is not ready to think in terms of denormalization (Cassandra requires a different modeling mindset)
  • Infrastructure budget is tight (minimum production cluster is 3 nodes)

Use Cassandra if:

  • You need linear write scalability
  • High availability is critical (no single point of failure)
  • Geographic data distribution (multi-datacenter)
  • Time-series data: metrics, logs, events
  • Denormalization is acceptable (data duplication is not an issue)

Installing and Running Cassandra

Docker Compose (for dev/testing)

Create docker-compose.yml:

services:
  cassandra:
    image: cassandra:5.0
    container_name: cassandra-node1
    ports:
      - "9042:9042" # CQL port
      - "7199:7199" # JMX monitoring
    environment:
      - CASSANDRA_CLUSTER_NAME=DevCluster
      - CASSANDRA_DC=datacenter1
      - CASSANDRA_RACK=rack1
      - CASSANDRA_ENDPOINT_SNITCH=GossipingPropertyFileSnitch
      - MAX_HEAP_SIZE=512M
      - HEAP_NEWSIZE=128M
    volumes:
      - cassandra_data:/var/lib/cassandra
    healthcheck:
      test: ["CMD-SHELL", "cqlsh -e 'describe cluster'"]
      interval: 30s
      timeout: 10s
      retries: 5
 
volumes:
  cassandra_data:

Run:

# Start Cassandra
docker compose up -d
 
# Check status
docker compose ps
 
# Logs
docker compose logs -f cassandra
 
# Connect via cqlsh (CQL shell)
docker exec -it cassandra-node1 cqlsh
 
# Check the cluster in cqlsh
cqlsh> DESCRIBE CLUSTER;
cqlsh> SELECT * FROM system.local;

Docker Compose is only for dev/testing. In production use at least 3 nodes in different availability zones for fault tolerance.

Creating a Keyspace and a Table

Keyspace in Cassandra is the equivalent of a database in PostgreSQL. It defines the replication strategy.

-- Create a keyspace with SimpleStrategy (for dev)
CREATE KEYSPACE IF NOT EXISTS demo
WITH replication = {
    'class': 'SimpleStrategy',
    'replication_factor': 1
};
 
-- In production use NetworkTopologyStrategy
CREATE KEYSPACE IF NOT EXISTS production_app
WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'datacenter1': 3  -- 3 copies of data in datacenter1
};
 
-- Use the keyspace
USE demo;
 
-- Create a table
CREATE TABLE users (
    user_id UUID PRIMARY KEY,
    email TEXT,
    full_name TEXT,
    created_at TIMESTAMP,
    last_login TIMESTAMP
);
 
-- Create a table with a composite primary key
CREATE TABLE user_sessions (
    user_id UUID,
    session_id TIMEUUID,
    ip_address TEXT,
    user_agent TEXT,
    created_at TIMESTAMP,
    PRIMARY KEY (user_id, session_id)
) WITH CLUSTERING ORDER BY (session_id DESC);

What the parameters mean:

  • SimpleStrategy - simple replication, ignores data center topology (dev only)
  • NetworkTopologyStrategy - respects data centers and racks (production)
  • replication_factor: 1 - one copy of the data (dev), in production minimum 3
  • PRIMARY KEY (user_id, session_id) - user_id = partition key, session_id = clustering key

Basic operations (CQL)

CQL (Cassandra Query Language) is syntactically similar to SQL but with key differences. CQL hides the internal storage structure (wide rows, column families) behind a familiar table interface.

Key differences from SQL:

  • No JOINs - denormalize data
  • WHERE works only with partition key and clustering key (or ALLOW FILTERING)
  • No transactions across partitions
  • No foreign keys or constraints
  • GROUP BY and aggregations are limited

Core operations:

-- INSERT
INSERT INTO users (user_id, email, full_name, created_at)
VALUES (uuid(), 'constantin@potapov.me', 'Constantin Potapov', toTimestamp(now()));
 
-- SELECT
SELECT * FROM users WHERE user_id = 550e8400-e29b-41d4-a716-446655440000;
 
-- UPDATE
UPDATE users
SET last_login = toTimestamp(now())
WHERE user_id = 550e8400-e29b-41d4-a716-446655440000;
 
-- DELETE
DELETE FROM users WHERE user_id = 550e8400-e29b-41d4-a716-446655440000;
 
-- Batch operations (within one partition!)
BEGIN BATCH
    INSERT INTO users (user_id, email, full_name) VALUES (uuid(), 'user1@ex.com', 'User 1');
    INSERT INTO users (user_id, email, full_name) VALUES (uuid(), 'user2@ex.com', 'User 2');
APPLY BATCH;

BATCH in Cassandra is NOT a transaction! Batch guarantees atomicity only for rows within one partition (same partition key). For different partitions it is just grouping of queries without ACID guarantees.

Python Clients for Cassandra

Official DataStax Driver

The standard choice for production:

pip install cassandra-driver

Basic example:

from cassandra.cluster import Cluster, ExecutionProfile, EXEC_PROFILE_DEFAULT
from cassandra.policies import DCAwareRoundRobinPolicy, TokenAwarePolicy
from cassandra.query import dict_factory
import uuid
 
# Configure execution profile
profile = ExecutionProfile(
    load_balancing_policy=TokenAwarePolicy(
        DCAwareRoundRobinPolicy(local_dc='datacenter1')
    ),
    row_factory=dict_factory  # Return rows as dict
)
 
# Connect to the cluster
cluster = Cluster(
    contact_points=['192.168.1.10', '192.168.1.11', '192.168.1.12'],
    execution_profiles={EXEC_PROFILE_DEFAULT: profile},
    protocol_version=4,
)
 
session = cluster.connect('demo')
 
# Prepared statement for performance
insert_user = session.prepare(
    """
    INSERT INTO users (user_id, email, full_name, created_at)
    VALUES (?, ?, ?, toTimestamp(now()))
    """
)
 
# Insert data
user_id = uuid.uuid4()
session.execute(insert_user, (user_id, 'test@example.com', 'Test User'))
 
# Read data
rows = session.execute(
    "SELECT * FROM users WHERE user_id = %s",
    (user_id,)
)
 
for row in rows:
    print(f"User: {row['full_name']}, Email: {row['email']}")
 
# Close connection
cluster.shutdown()

Async version (aiocassandra)

For integration with FastAPI and other async frameworks:

pip install aiocassandra

FastAPI example:

from fastapi import FastAPI, HTTPException
from cassandra.cluster import Cluster
from cassandra.query import dict_factory
from aiocassandra import aiosession
from pydantic import BaseModel
from typing import Optional
import uuid
 
app = FastAPI()
 
# Initialize Cassandra on startup
cluster = Cluster(['192.168.1.10'])
session = cluster.connect('demo')
session.row_factory = dict_factory
 
# Make the session async
async_session = aiosession(session)
 
class User(BaseModel):
    email: str
    full_name: str
 
class UserResponse(User):
    user_id: uuid.UUID
    created_at: Optional[str]
 
@app.on_event("shutdown")
def shutdown_event():
    cluster.shutdown()
 
@app.post("/users", response_model=UserResponse)
async def create_user(user: User):
    """Create a user"""
    user_id = uuid.uuid4()
 
    query = """
    INSERT INTO users (user_id, email, full_name, created_at)
    VALUES (?, ?, ?, toTimestamp(now()))
    """
 
    await async_session.execute(query, (user_id, user.email, user.full_name))
 
    return UserResponse(
        user_id=user_id,
        email=user.email,
        full_name=user.full_name,
        created_at=None
    )
 
@app.get("/users/{user_id}", response_model=UserResponse)
async def get_user(user_id: uuid.UUID):
    """Get a user"""
    rows = await async_session.execute(
        "SELECT * FROM users WHERE user_id = %s",
        (user_id,)
    )
 
    users = list(rows)
    if not users:
        raise HTTPException(status_code=404, detail="User not found")
 
    return UserResponse(**users[0])

Prepared Statements: performance and safety

Prepared statements are a must-have pattern for production:

# Bad: parsing CQL on every request
session.execute(
    f"INSERT INTO users (user_id, email) VALUES ({user_id}, '{email}')"
)
 
# Good: prepare once, reuse many times
insert_stmt = session.prepare(
    "INSERT INTO users (user_id, email, full_name) VALUES (?, ?, ?)"
)
 
# Prepared statement is parsed once, then Cassandra knows the query structure
# Instead of parsing SQL on every insert - just substitute parameters
for user in users:
    session.execute(insert_stmt, (user.id, user.email, user.name))

Advantages:

  • CQL is parsed once (saves CPU on the server)
  • Protection from CQL injection
  • Cassandra caches the execution plan

Prepared statements are especially important in loops. If you insert 10k rows, a prepared statement saves 10k CQL parsings - a tangible latency difference.

Data Modeling in Cassandra

Main principle: Query First Design

In PostgreSQL we model entities and relations (ER diagrams) and then build queries.

In Cassandra it is the opposite: first list the queries, then model tables for those queries.

Modeling algorithm:

  1. List all application queries (for example: "get the last 100 events of a user")
  2. Define the partition key (what will be in WHERE without ALLOW FILTERING)
  3. Define the clustering key (sorting inside the partition)
  4. Denormalize data (duplication is acceptable)
  5. Create separate tables for different query patterns

Example: blog platform

Application queries:

  1. Get an article by ID
  2. Get all articles of an author sorted by date
  3. Get the latest 20 articles (homepage)
  4. Get all comments for an article

Bad model (PostgreSQL style):

-- Anti-pattern: trying to emulate JOIN via WHERE
CREATE TABLE articles (
    article_id UUID PRIMARY KEY,
    author_id UUID,
    title TEXT,
    content TEXT,
    published_at TIMESTAMP
);
 
CREATE TABLE comments (
    comment_id UUID PRIMARY KEY,
    article_id UUID,
    author_id UUID,
    text TEXT
);
 
-- Problem 1: cannot efficiently get author's articles
-- SELECT * FROM articles WHERE author_id = ? ALLOW FILTERING;
-- ALLOW FILTERING scans ALL partitions - a disaster for performance!
 
-- Problem 2: no sorting by published_at without ALLOW FILTERING

Correct model (Query First):

-- Query 1: article by ID
CREATE TABLE articles_by_id (
    article_id UUID PRIMARY KEY,
    author_id UUID,
    title TEXT,
    content TEXT,
    published_at TIMESTAMP
);
 
-- Query 2: author's articles sorted by date
CREATE TABLE articles_by_author (
    author_id UUID,
    published_at TIMESTAMP,
    article_id UUID,
    title TEXT,
    content TEXT,  -- Denormalization: duplicate content
    PRIMARY KEY (author_id, published_at)
) WITH CLUSTERING ORDER BY (published_at DESC);
 
-- Query 3: recent articles (homepage)
CREATE TABLE articles_recent (
    bucket TEXT,  -- For example 'home_page'
    published_at TIMESTAMP,
    article_id UUID,
    author_id UUID,
    title TEXT,
    PRIMARY KEY (bucket, published_at)
) WITH CLUSTERING ORDER BY (published_at DESC);
 
-- Query 4: comments for an article
CREATE TABLE comments_by_article (
    article_id UUID,
    comment_id TIMEUUID,  -- TIMEUUID for sorting by creation time
    author_id UUID,
    text TEXT,
    PRIMARY KEY (article_id, comment_id)
) WITH CLUSTERING ORDER BY (comment_id DESC);

Application code:

from cassandra.cluster import Cluster
import uuid
 
cluster = Cluster(['192.168.1.10'])
session = cluster.connect('blog')
 
# Publish an article - write to THREE tables (denormalization)
article_id = uuid.uuid4()
author_id = uuid.uuid4()
title = "Apache Cassandra: first look"
content = "Expert breakdown of Cassandra..."
published_at = datetime.utcnow()
 
# Table 1: for query by ID
session.execute(
    "INSERT INTO articles_by_id (article_id, author_id, title, content, published_at) VALUES (?, ?, ?, ?, ?)",
    (article_id, author_id, title, content, published_at)
)
 
# Table 2: for author's articles
session.execute(
    "INSERT INTO articles_by_author (author_id, published_at, article_id, title, content) VALUES (?, ?, ?, ?, ?)",
    (author_id, published_at, article_id, title, content)
)
 
# Table 3: for homepage
session.execute(
    "INSERT INTO articles_recent (bucket, published_at, article_id, author_id, title) VALUES (?, ?, ?, ?, ?)",
    ('home_page', published_at, article_id, author_id, title)
)
 
# Read author's articles - O(1) by partition key
rows = session.execute(
    "SELECT * FROM articles_by_author WHERE author_id = ? LIMIT 20",
    (author_id,)
)
 
# Read comments for an article - O(1) by partition key
rows = session.execute(
    "SELECT * FROM comments_by_article WHERE article_id = ?",
    (article_id,)
)

What happens here:

  • One record (an article) physically lives in three tables with different keys
  • This is denormalization - data duplication for read performance
  • Each query pattern has its own table with an optimal key
  • No ALLOW FILTERING - every query uses the partition key

Good modeling rules

1. One query = one table

If you need different access patterns for the same data (by ID, by author, by date) - create different tables.

2. Denormalization is normal

Data duplication in Cassandra is fine. This is a deliberate trade-off: disk space for read performance.

3. Avoid large partitions

A partition is all rows with the same partition key. Recommendation: partition < 100MB, < 100k rows.

-- Bad: all events of all users in one partition
CREATE TABLE all_events (
    event_type TEXT,  -- partition key
    event_id TIMEUUID,
    user_id UUID,
    payload TEXT,
    PRIMARY KEY (event_type, event_id)
);
-- Problem: event_type='login' may contain billions of rows -> gigantic partition
 
-- Good: partition per user
CREATE TABLE user_events (
    user_id UUID,     -- partition key
    event_id TIMEUUID,
    event_type TEXT,
    payload TEXT,
    PRIMARY KEY (user_id, event_id)
);
-- user_id splits data into millions of small partitions

4. Use TIMEUUID for time-ordered data

TIMEUUID is UUID v1 that contains a timestamp. Ideal for a clustering key in time-series data.

CREATE TABLE messages (
    chat_id UUID,
    message_id TIMEUUID,  -- Contains creation timestamp
    sender_id UUID,
    text TEXT,
    PRIMARY KEY (chat_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);
 
-- Advantages:
-- 1. Automatic sorting by creation time
-- 2. Uniqueness even with millions of writes per second
-- 3. Can extract timestamp: SELECT dateOf(message_id) FROM messages

5. TTL for automatic deletion

-- Insert with TTL (Time To Live) 24 hours
INSERT INTO user_sessions (session_id, user_id, data)
VALUES (uuid(), uuid(), '{"key": "value"}')
USING TTL 86400;
 
-- Cassandra will delete this row in 24 hours automatically
-- No need for cron jobs to clean old data

Pitfalls and fixes

1. Tombstones and deleting data

Problem: In Cassandra delete is not immediate. DELETE creates a tombstone (a deletion marker) that lives until the next compaction.

-- DELETE creates a tombstone
DELETE FROM users WHERE user_id = ?;
 
-- Tombstone remains on disk until compaction (10 days by default)
-- If you delete a million rows -> a million tombstones
-- During SELECT Cassandra reads tombstones -> query slowdown

Solution 1: Use TTL instead of DELETE for temporary data:

-- Instead of DELETE after 24 hours use TTL
INSERT INTO sessions (session_id, data) VALUES (?, ?)
USING TTL 86400;
 
-- Cassandra deletes on its own in 24 hours, fewer tombstones

Solution 2: Tune gc_grace_seconds for faster tombstone cleanup:

-- By default tombstones live 10 days (864000 seconds)
-- If you do not use manual repair, you can shorten to 1 day
ALTER TABLE sessions WITH gc_grace_seconds = 86400;

Solution 3: Monitor tombstone_warn_threshold and tombstone_failure_threshold:

# cassandra.yaml
tombstone_warn_threshold: 1000 # Warning when 1000 tombstones in a query
tombstone_failure_threshold: 100000 # Error at 100k tombstones

2. Secondary Indexes - when they are NOT the solution

Problem: The temptation to create a secondary index for queries not by partition key:

-- Created a table with partition key = user_id
CREATE TABLE users (
    user_id UUID PRIMARY KEY,
    email TEXT,
    full_name TEXT
);
 
-- Want to query by email -> create a secondary index
CREATE INDEX ON users (email);
 
-- The query works
SELECT * FROM users WHERE email = 'test@example.com';

Why this is bad:

  • Secondary index is not distributed - each node keeps a local index
  • A query by secondary index requires polling all nodes in the cluster
  • As the cluster grows to 100 nodes every query pings all 100
  • Latency grows linearly with the number of nodes

Fix: Create a separate table with the needed partition key:

-- Table for queries by email
CREATE TABLE users_by_email (
    email TEXT PRIMARY KEY,
    user_id UUID,
    full_name TEXT
);
 
-- Now query by email is O(1), uses partition key
SELECT * FROM users_by_email WHERE email = 'test@example.com';

3. ALLOW FILTERING - red flag

Problem: Cassandra does not allow WHERE on columns outside the partition or clustering key:

-- Error
SELECT * FROM users WHERE full_name = 'John';
-- Error: Cannot execute this query as it might involve data filtering
 
-- Cassandra suggests adding ALLOW FILTERING:
SELECT * FROM users WHERE full_name = 'John' ALLOW FILTERING;

Why ALLOW FILTERING is dangerous:

ALLOW FILTERING forces Cassandra to read all partitions and filter in memory. For a table with a million rows it means reading a million rows on every query.

Fix: Remodel the table for your query:

CREATE TABLE users_by_name (
    full_name TEXT,
    user_id UUID,
    email TEXT,
    PRIMARY KEY (full_name, user_id)
);

If you see ALLOW FILTERING in production code - it signals wrong data modeling. Redesign the table for the query pattern.

4. Hot Partitions

Problem: If the partition key is skewed, one partition can get most of the load.

-- Bad partition key: most requests go to one value
CREATE TABLE page_views (
    page_url TEXT,  -- partition key
    view_id TIMEUUID,
    user_id UUID,
    PRIMARY KEY (page_url, view_id)
);
 
-- Problem: homepage '/' gets 90% of the traffic
-- Partition '/' is overloaded while other nodes idle

Fix: Add bucketing:

-- Better: composite partition key
CREATE TABLE page_views (
    page_url TEXT,
    bucket INT,  -- Random number 0-99
    view_id TIMEUUID,
    user_id UUID,
    PRIMARY KEY ((page_url, bucket), view_id)
);
 
-- Now the homepage is spread across 100 partitions:
-- ('/', 0), ('/', 1), ..., ('/', 99)
-- Load is balanced across nodes

Write code:

import random
 
page_url = '/'
bucket = random.randint(0, 99)  # Random bucket 0-99
 
session.execute(
    "INSERT INTO page_views (page_url, bucket, view_id, user_id) VALUES (?, ?, ?, ?)",
    (page_url, bucket, uuid.uuid1(), user_id)
)

Read code:

# Read from all 100 buckets in parallel
from concurrent.futures import ThreadPoolExecutor
 
def read_bucket(bucket):
    return session.execute(
        "SELECT * FROM page_views WHERE page_url = ? AND bucket = ? LIMIT 100",
        (page_url, bucket)
    )
 
with ThreadPoolExecutor(max_workers=10) as executor:
    results = executor.map(read_bucket, range(100))
    all_views = [view for bucket_views in results for view in bucket_views]

5. Batch operations and performance

Problem: BATCH in Cassandra does NOT improve performance and often hurts it.

# Anti-pattern: batch across different partitions
batch = BatchStatement()
for user in users:
    batch.add(insert_stmt, (user.id, user.email, user.name))
session.execute(batch)
 
# Problem: Cassandra has to coordinate writes to all partitions
# Coordinator node becomes a bottleneck

Fix: Use async queries:

# Correct: parallel async inserts
from cassandra.query import SimpleStatement
 
futures = []
for user in users:
    future = session.execute_async(insert_stmt, (user.id, user.email, user.name))
    futures.append(future)
 
# Wait for all operations to complete
for future in futures:
    future.result()

When to use BATCH:

Only for atomic operations within one partition:

BEGIN BATCH
    INSERT INTO users_by_id (user_id, email) VALUES (?, ?);
    INSERT INTO users_by_email (email, user_id) VALUES (?, ?);
APPLY BATCH;
 
-- Both rows have the same partition key (user_id or email)
-- Cassandra guarantees atomicity

Monitoring and Observability

Key metrics

Node metrics:

  • Read Latency (p95, p99)
  • Write Latency (p95, p99)
  • Pending Compactions - number of pending compaction tasks
  • Heap Memory Usage
  • GC Pause Time

Table metrics:

  • SSTable Count - number of SSTable files on disk
  • Bloom Filter False Positive Rate
  • Tombstone Scanned - tombstones read during queries

Cluster metrics:

  • Hinted Handoff - number of deferred writes
  • Read Repair - frequency of inconsistency fixes on reads
  • Streaming - active data streams between nodes

Prometheus + Grafana

Cassandra exposes metrics via JMX. Use JMX Exporter for Prometheus:

# docker-compose.yml
services:
  cassandra:
    image: cassandra:5.0
    environment:
      - JVM_OPTS=-javaagent:/opt/jmx_exporter/jmx_prometheus_javaagent.jar=7070:/opt/jmx_exporter/config.yml
    ports:
      - "7070:7070" # Prometheus metrics
    volumes:
      - ./jmx_exporter:/opt/jmx_exporter
 
  prometheus:
    image: prom/prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"
 
  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin

prometheus.yml:

scrape_configs:
  - job_name: "cassandra"
    static_configs:
      - targets: ["cassandra:7070"]

Query logging

Enable slow query log in cassandra.yaml:

# cassandra.yaml
slow_query_log_timeout: 500ms # Log queries longer than 500ms

Logs go to /var/log/cassandra/debug.log:

WARN  [ReadStage-2] 2025-12-03 10:15:32,145 ReadCommand.java:178 -
Query SELECT * FROM articles_by_author WHERE author_id = ? took 1234ms

Production Checklist

Before deploy

  • Replication: at least RF=3 for production
  • Topology: nodes in different availability zones (minimum 3 AZ)
  • Snitch: configure GossipingPropertyFileSnitch or Ec2Snitch for multi-AZ
  • Compaction: choose a strategy (SizeTieredCompactionStrategy vs LeveledCompactionStrategy)
  • JVM Heap: 8-16GB (no more than 50% RAM, no more than 32GB)
  • GC: use G1GC instead of CMS
  • Monitoring: set alerts on latency and pending compactions
  • Backups: automate snapshots (daily)
  • Tests: load testing with realistic data volume

Security

# cassandra.yaml
authenticator: PasswordAuthenticator # Instead of AllowAllAuthenticator
authorizer: CassandraAuthorizer # Instead of AllowAllAuthorizer
 
# Enable SSL for client-to-node
client_encryption_options:
  enabled: true
  keystore: /path/to/keystore.jks
  keystore_password: changeit
 
# Enable SSL for node-to-node
server_encryption_options:
  internode_encryption: all
  keystore: /path/to/keystore.jks
  keystore_password: changeit

Create a limited role:

-- Create a role for the app
CREATE ROLE app_user WITH PASSWORD = 'secure_password' AND LOGIN = true;
 
-- Grant rights only to a specific keyspace
GRANT SELECT, MODIFY ON KEYSPACE production_app TO app_user;
 
-- Deny DROP and TRUNCATE
REVOKE DROP ON KEYSPACE production_app FROM app_user;

Capacity planning

Formula for disk space:

Total Disk = (Data Size) * (Replication Factor) * (Compaction Overhead)
Compaction Overhead = 1.5 for STCS, 1.1 for LCS

Example: 1TB of data, RF=3, STCS:

Total Disk = 1TB * 3 * 1.5 = 4.5TB

Formula for number of nodes:

Nodes = Total Disk / (Disk per Node * 0.5)
0.5 = headroom for compaction and repair

Example: 4.5TB total, 500GB SSD per node:

Nodes = 4.5TB / (500GB * 0.5) = 18 nodes

Formula for RAM:

RAM per Node = (SSTable Index Size * 2) + Heap + OS Cache
SSTable Index Size ~ 1-5% of data on the node
Heap = 8-16GB
OS Cache = at least 8GB

Takeaways

You have learned to:

  • Understand Cassandra's architecture - peer-to-peer, no master nodes
  • Model data - query-first, denormalization
  • Integrate with Python - cassandra-driver, async version
  • Avoid pitfalls - tombstones, ALLOW FILTERING, hot partitions
  • Monitor the cluster - metrics, logs, alerts
  • Plan capacity - disk, RAM, nodes

Next steps:

  1. Study Lightweight Transactions (LWT) for atomic operations
  2. Try Materialized Views for automatic denormalization
  3. Configure Multi-Datacenter replication for geo distribution
  4. Explore Change Data Capture (CDC) for event streaming
  5. Implement Time Window Compaction Strategy for time-series data

Useful links:

Cassandra is a powerful tool for building scalable distributed systems. Start with a simple use case (time-series metrics), adopt the Query First modeling approach, and gradually move to more complex scenarios with multi-datacenter replication.

Cassandra turns the scaling problem into adding nodes, but it requires rethinking how you model data. Start with one keyspace and one table - add complexity gradually.