How to manage full disks in Kafka - E-MapReduce - Alibaba Cloud Documentation Center

This topic describes how to perform O&M operations when the disk space of a Kafka cluster is full. In this topic, E-MapReduce (EMR) Kafka 2.4.1 is used.

Business scenario

Kafka stores log data on a disk. If the disk space is full, the Kafka log directory on the disk becomes offline. In this case, the partition replicas on the disk cannot be read or written. This reduces the availability and fault tolerance of partitions. The load of other brokers is increased because the leader partition replica is migrated to other brokers. Therefore, you must resolve the issue at the earliest opportunity when the disk space is full.

Overview

This topic describes the O&M policies that can be used when the disk space of a Kafka cluster is full from the following two perspectives: monitoring and recovering full disks.

Monitor a full disk

Kafka service: You can configure alert rules for the OfflineLogDirectoryCount metric of EMR Kafka clusters in the CloudMonitor console to detect offline log directories in real time.

Recover a full disk

If the Kafka log directory on a disk becomes offline, you must first check whether the disk space is full.

When the disk space for a log directory is exhausted, consider the following O&M strategies:

Disk resizing: Increase disk capacity by resizing the cloud disk. This method is suitable for scenarios where brokers use attached cloud disks. For more information, see Recovery by disk resizing.
Intra-node partition migration: Migrate partitions from a full disk to other disks on the same node. This is suitable for scenarios with imbalanced disk usage on a broker node. For more information, see Recovery by intra-node partition migration.
Data cleanup: Delete log data from the full disk. This method is applicable when older data can be safely removed. For more information, see Recovery by data cleanup.

Resize a disk

Description

If the space of a disk is full on a broker, increase the disk space to meet relevant requirements by using this policy. The advantages of this policy are simple operations, low risks, and the ability to quickly resolve the issue of insufficient disk space.

Scenario

This policy is applicable to scenarios where disks are attached to a broker.

Procedure

Resize the data disks of the broker nodes in the E-MapReduce console. For more information, see Resize a disk.

Migrate partitions within a broker

Description

If the space of a disk is full on a broker, the Kafka log directory on the disk becomes offline. As a result, you cannot use the kafka-reassign-partitions.sh tool to migrate partitions. In this case, you can perform operations on the Elastic Compute Service (ECS) instance where the broker is deployed to move the partition replica data to other disks of the broker and modify the metadata in the corresponding Kafka data directory. This helps resolve the issue of insufficient disk space.

Scenario

This policy is applicable to scenarios where the disk usage is imbalanced on a broker due to the existence of a full disk and disks with relatively low usage.

Usage notes

This method only supports partition migration between disks within the same node.
Partition migration can cause I/O hotspots on disks, which may affect cluster performance. You must evaluate the impact of the data size and duration of each migration on your services.
Because this method involves non-standard operations, test it on a corresponding Kafka version before you apply it to a production cluster.

Procedure

If the space of a disk is full, the Kafka log directory on the disk is offline. In this case, you cannot use the kafka-reassign-partitions.sh tool to migrate partitions. This section describes how to perform non-standard operations to migrate partitions by directly moving files and modifying Kafka-related metadata.

Create a test topic.
1. Log on to the master node of the source Kafka cluster using SSH. For more information, see Log on to a cluster.
2. Run the following command to create a test topic. The partition replicas are distributed on broker 0 and broker 1.
```
kafka-topics.sh --bootstrap-server core-1-1:9092 --topic test-topic --replica-assignment 0:1 --create
```
  You can run the following command to view the topic details.
```
kafka-topics.sh --bootstrap-server core-1-1:9092 --topic test-topic --describe
```
  The returned information shows that broker 0 is in the In-Sync Replicas (ISR) list.
```
Topic: test-topic       PartitionCount: 1       ReplicationFactor: 2    Configs:
        Topic: test-topic       Partition: 0    Leader: 0       Replicas: 0,1   ISR: 0,1
```

Run the following command to simulate data writing.

kafka-producer-perf-test.sh --topic test-topic --record-size 1000 --num-records 600000000 --print-metrics --throughput 10240 --producer-props linger.ms=0 bootstrap.servers=core-1-1:9092

Modify log directory permissions on broker 0.
1. On the master node, switch to the emr-user account.
```
su emr-user
```
2. Log on to the corresponding core node without a password.
```
ssh core-1-1
```
3. Use sudo to obtain root permissions.
```
sudo su - root
```
4. Run the following command to find the disk where the partition is located.
```
sudo find / -name test-topic-0
```
  The following response indicates that the partition is in the /mnt/disk4/kafka/log directory.
```
/mnt/disk4/kafka/log/test-topic-0
```
5. Run the following command to set the permissions of the log directory on broker 0 to 000. This simulates a disk error that makes the directory unwritable.
```
sudo chmod 000 /mnt/disk4/kafka/log
```
6. Run the following command to check the status of test-topic.
```
kafka-topics.sh --bootstrap-server core-1-1:9092 --topic test-topic --describe
```
  The returned information shows that broker 0 is no longer in the ISR list.
```
Topic: test-topic       PartitionCount: 1       ReplicationFactor: 2    Configs:
        Topic: test-topic       Partition: 0    Leader: 1       Replicas: 0,1   ISR: 1
```
Stop the Kafka service on broker 0.
Stop the Kafka service on Broker 0 in the EMR console.
Run the following command to move the partition of test-topic on broker 0 to another disk on the same node.
```
mv /mnt/disk4/kafka/log/test-topic-0 /mnt/disk1/kafka/log/
```
Modify the metadata files.
Based on the metadata files in the /mnt/disk4/kafka/log source directory and the /mnt/disk1/kafka/log destination directory, the files to be modified include replication-offset-checkpoint and recovery-point-offset-checkpoint.
- Modify the replication-offset-checkpoint file by moving the entries related to test-topic from the replication-offset-checkpoint file in the original log directory to the replication-offset-checkpoint file in the target log directory, and modify the number of entries in the file. In the modified replication-offset-checkpoint file, the first line 0 is the version number, the second line 18 is the total number of entries in the file (this must match the actual number of entry lines), and the last line test-topic 0 4901378 is the migrated partition offset record. Sample content:
```
0
18
__consumer_offsets 22 0
__consumer_offsets 8 0
__consumer_offsets 21 0
__consumer_offsets 9 0
__consumer_offsets 35 0
__consumer_offsets 33 0
__consumer_offsets 23 0
__consumer_offsets 47 0
__consumer_offsets 2 0
__consumer_offsets 14 0
__consumer_offsets 45 0
__consumer_offsets 10 0
__consumer_offsets 4 0
__consumer_offsets 17 0
__consumer_offsets 30 0
__consumer_offsets 36 0
__consumer_offsets 48 0
test-topic 0 4901378
```
- Modify the recovery-point-offset-checkpoint file. Move the entries for test-topic from the recovery-point-offset-checkpoint file in the original log directory to the recovery-point-offset-checkpoint file in the destination log directory, and then update the total number of entries in the destination file. The following is an example of the modified recovery-point-offset-checkpoint file. In this example, 13 on the second line is the total number of entries, and test-topic 0 4952628 is the migrated entry for test-topic:
```
0
13
__consumer_offsets 22 0
__consumer_offsets 8 0
__consumer_offsets 21 0
__consumer_offsets 9 0
__consumer_offsets 35 0
__consumer_offsets 33 0
__consumer_offsets 23 0
__consumer_offsets 47 0
__consumer_offsets 2 0
__consumer_offsets 14 0
test-topic 0 4952628
__consumer_offsets 45 0
__consumer_offsets 10 0
```
Run the following command to restore the correct permissions for the log directory on the source broker 0.
```
sudo chmod 755 /mnt/disk4/kafka/log
```
Start the Kafka service on broker 0.
Start the Kafka service on Broker 0 in the EMR console.

Run the following command to verify that the cluster status is normal.

kafka-topics.sh --bootstrap-server core-1-1:9092 --topic test-topic --describe

Clear logs

Description

If the space of a disk is full on a broker, delete business log data in chronological order from the earliest data to the latest data until sufficient disk space is released. The data in the internal topics of a Kafka cluster cannot be deleted.

Scenario

This policy is applicable to scenarios where outdated business log data can be deleted from a full disk.

If the retention period of the data is not changed, the disk may become fully occupied soon. Therefore, this policy is generally applicable to scenarios where data surges due to special circumstances.

Usage notes

You cannot delete data from the topics whose names start with underscores (_).

Procedure

Log on to the affected machine.
Find the full disk and delete unnecessary business data.
Follow these principles for data cleanup:
- Do not directly delete Kafka's data directories to avoid unnecessary data loss.
- Identify topics that consume a large amount of space or are no longer needed. For selected partitions within these topics, start deleting the oldest log segments. This includes deleting the .log, .index, and .timeindex files for each segment. Do not delete data from internal topics, such as __consumer_offsets and _schema.
Restart the Kafka service on the affected broker to bring the log directory back online.