Compaction
Apache Cassandra: Compaction
Cassandra writes all incoming data to a commit log and an in-memory data structure called a memtable. Once the memtable is full, its data is flushed to disk as an SSTable. Over time, as data is inserted or updated, multiple SSTables accumulate. Compaction is triggered to merge these SSTables, which helps with the following:
• Removing deleted (tombstoned) data.
• Rewriting data in sorted order for faster reads.
• Reclaiming disk space used by expired TTL (time-to-live) data.
During the compaction process, multiple SSTables are combined, and older versions of records are removed to create a new, smaller SSTable.
Key Types of Compaction in Cassandra
1. Size-Tiered Compaction Strategy (STCS):
• This is the default compaction strategy.
• It merges SSTables of similar size into larger SSTables over time.
• The goal is to reduce the total number of SSTables, improving read performance.
• Works well for write-heavy workloads but can cause latency spikes during large compactions.
Pros: Efficient for write-heavy workloads.
Cons: Can lead to high read latencies during large compactions.
CREATE TABLE users (
user_id UUID PRIMARY KEY,
name TEXT,
email TEXT
) WITH compaction = {
'class': 'SizeTieredCompactionStrategy',
'min_threshold': 4,
'max_threshold': 32
};
2. Leveled Compaction Strategy (LCS):
• More suitable for read-heavy workloads.
• Data is organized into “levels” of SSTables with each level having smaller, fixed-size SSTables.
• New data is written into Level 0 and gradually merged into higher levels.
• It reduces the number of SSTables read during queries, resulting in predictable read performance.
• LCS requires more disk space since data is stored in smaller, leveled SSTables.
Pros: Efficient for read-heavy workloads with predictable read performance.
Cons: Requires more disk space and generates more I/O overhead.
CREATE TABLE logs (
log_id UUID PRIMARY KEY,
event TEXT,
event_time TIMESTAMP
) WITH compaction = {
'class': 'LeveledCompactionStrategy',
'sstable_size_in_mb': 160
};
3. Time-Window Compaction Strategy (TWCS):
• Best for time-series data.
• Groups SSTables into time windows (e.g., daily, weekly).
• Old data is compacted into larger SSTables, and new data remains in smaller SSTables.
• Prevents data from being repeatedly rewritten, which is common in time-series applications.
CREATE TABLE temperature_readings (
sensor_id UUID,
timestamp TIMESTAMP,
temperature DECIMAL,
PRIMARY KEY (sensor_id, timestamp)
) WITH compaction = {
'class': 'TimeWindowCompactionStrategy',
'compaction_window_unit': 'DAYS',
'compaction_window_size': 1,
'base_time_seconds': 60
};
Pros: Ideal for time-series data, prevents excessive rewriting of old data.
Cons: More complex to configure and manage for non-time-series workloads.
2. Leveled Compaction Strategy (LCS):
• More suitable for read-heavy workloads.
• Data is organized into “levels” of SSTables with each level having smaller, fixed-size SSTables.
• New data is written into Level 0 and gradually merged into higher levels.
• It reduces the number of SSTables read during queries, resulting in predictable read performance.
• LCS requires more disk space since data is stored in smaller, leveled SSTables.
Pros: Efficient for read-heavy workloads with predictable read performance.
Cons: Requires more disk space and generates more I/O overhead.
CREATE TABLE logs (
log_id UUID PRIMARY KEY,
event TEXT,
event_time TIMESTAMP
) WITH compaction = {
'class': 'LeveledCompactionStrategy',
'sstable_size_in_mb': 160
};
3. Time-Window Compaction Strategy (TWCS):
• Best for time-series data.
• Groups SSTables into time windows (e.g., daily, weekly).
• Old data is compacted into larger SSTables, and new data remains in smaller SSTables.
• Prevents data from being repeatedly rewritten, which is common in time-series applications.
CREATE TABLE temperature_readings (
sensor_id UUID,
timestamp TIMESTAMP,
temperature DECIMAL,
PRIMARY KEY (sensor_id, timestamp)
) WITH compaction = {
'class': 'TimeWindowCompactionStrategy',
'compaction_window_unit': 'DAYS',
'compaction_window_size': 1,
'base_time_seconds': 60
};
Pros: Ideal for time-series data, prevents excessive rewriting of old data.
Cons: More complex to configure and manage for non-time-series workloads.
Comments
Post a Comment