What is log compaction?
Log compaction is a feature that allows for the automatic deletion of older or duplicate records in a topic’s partitions. This feature is useful for retaining only the most recent or unique records in a topic, while discarding older or duplicate records.
When log compaction is enabled for a topic, Kafka keeps track of the key of each record that is written to the topic. If a later record is written to the topic with the same key as an earlier record, the earlier record is considered to be a duplicate and is eligible for deletion.
When a compaction cycle occurs, Kafka compares the keys of all records in the log and retains only the most recent record for each key. Records that have been superseded by a newer record with the same key are deleted.
Log compaction is typically used in topics that store state information, such as the latest value for a particular key, and where it’s not necessary to keep all the history of the changes to that key.
It’s worth noting that log compaction is only applied on the key of the records, not on the value, it means that if you have records with the same key but with different values, only the most recent one will be kept, and the previous ones with the same key will be removed. Also, log compaction is not applied to all topic, it has to be enabled explicitly.
It’s important to be aware that log compaction can cause data loss, and it’s not recommended to use it in topics that store critical data, where it’s necessary to keep all the history of the data.