how to replace a damaged local disk in a cluster - E-MapReduce

When you use an E-MapReduce (EMR) cluster built on instance families with local disks, such as instance families with local SSDs (i series) and big data instance families (d series), you may receive a notification that a local disk is damaged. This topic describes how to replace a damaged local disk in the cluster.

Precautions

To resolve this issue, remove the abnormal node and add a new one. This method prevents long-term impacts on your business operations.
Data on the original disk is lost after the disk is replaced. Ensure that your data has a sufficient number of replicas and is backed up before you proceed.
The disk replacement process includes stopping services, unmounting the disk, mounting a new disk, and restarting services. The replacement is usually completed within five business days. Before you perform the steps in this topic, evaluate whether the service's disk usage and the cluster's load can support your business operations while the services are stopped.

Procedure

Log on to the ECS console to view event details. The details include the instance ID, status, damaged disk ID, event progress, and related operations.

Step 1: Get information about the damaged disk

Log on to the node that contains the damaged disk using Secure Shell (SSH). For more information, see Log on to a cluster.

Run the following command to view the block device information.

lsblk

The response is similar to the following.

NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vdd    254:48   0  5.4T  0 disk /mnt/disk3
vdb    254:16   0  5.4T  0 disk /mnt/disk1
vde    254:64   0  5.4T  0 disk /mnt/disk4
vdc    254:32   0  5.4T  0 disk /mnt/disk2
vda    254:0    0  120G  0 disk
└─vda1 254:1    0  120G  0 part /

Run the following command to view the disk information.

sudo fdisk -l

The returned message is similar to the following.

Disk /dev/vdd: 5905.6 GB, 5905580032000 bytes, 11534336000 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

From the output of the previous two steps, record the device name $device_name and the mount target $mount_path.
For example, if the device in the disk damage event is vdd, the device name is /dev/vdd and the mount target is /mnt/disk3.

Step 2: Isolate the damaged local disk

Stop the applications that read data from or write data to the damaged disk.
1. In the EMR console, click the cluster that contains the damaged disk. On the Cluster Services tab, find the EMR services that read from or write to the damaged disk. These services typically include storage services such as HDFS, HBase, and Kudu. In the area where the endpoint is deployed for the target service, choose > Stop to stop the service.
2. Run the following commands to stop the related management processes.
```
sudo crontab -l | grep -v "exporter_check.sh" | sudo crontab -
sudo service taihao_exporter stop
sudo service ilogtaild stop
sudo service ilogtaildclt stop
```
  Note: After you stop these management processes, the metric collection and log monitoring features for the node are affected. These features automatically recover after the disk is replaced and the processes are restarted.
You can also run the sudo fuser -mv $device_name command on the node to view the full list of processes that are using the disk, and then stop the services in the list from the EMR console.
Run the following command to set application-layer read and write fencing for the local disk.
```
sudo chmod 000 $mount_path
```
Run the following command to unmount the local disk.
```
sudo umount $device_name;sudo chmod 000 $mount_path
```
Important
If you do not unmount the disk, its device name may change after the disk is repaired and the fencing is removed. This may cause applications to read from or write to the wrong disk.
Update the fstab file.
1. Back up the existing /etc/fstab file.
2. Delete the record for the disk from the /etc/fstab file.
  For example, if the damaged disk in this topic is dev/vdd, delete the record for that disk.
You can start a stopped application.
On the Cluster Services tab of the cluster that contains the damaged disk, find the EMR services that you stopped in Step 2. Then, for each service, in the area where the endpoint is deployed, choose > Start.

Step 3: Replace the disk

Repair the disk in the ECS console. For more information, see Isolate or repair local disks.

Step 4: Mount the disk

After the disk is repaired, mount it to use it as a new disk.

Run the following command to normalize the device name.
```
device_name=`echo "$device_name" | sed 's/x//1'`
```
This command normalizes device names. For example, a device name such as /dev/xvdk is changed to /dev/vdk.
Run the following command to create a mount directory.
```
 mkdir -p "$mount_path"
```
Run the following command to mount the disk.
```
mount $device_name $mount_path;sudo chmod 755 $mount_path
```
If the disk fails to mount, perform the following steps:
1. Run the following command to format the disk.
```
fdisk $device_name << EOF
n
p
1

wq
EOF
```
2. Run the following command to mount the disk again.
```
mount $device_name $mount_path;sudo chmod 755 $mount_path
```
Run the following command to modify the fstab file.
```
echo "$device_name $mount_path $fstype defaults,noatime,nofail 0 0" >> /etc/fstab
```
Note
Run the which mkfs.ext4 command to check whether ext4 exists. If it exists, set $fstype to ext4. Otherwise, set $fstype to ext3.

Create a script file and select the script code based on the cluster type.

DataLake, DataFlow, OLAP, DataServing, and Custom clusters

while getopts p: opt
do
	case "${opt}" in
  	p) mount_path=${OPTARG};;
  esac
done

sudo mkdir -p $mount_path/flink
sudo chown flink:hadoop $mount_path/flink
sudo chmod 775 $mount_path/flink

sudo mkdir -p $mount_path/hadoop
sudo chown hadoop:hadoop $mount_path/hadoop
sudo chmod 755 $mount_path/hadoop

sudo mkdir -p $mount_path/hdfs
sudo chown hdfs:hadoop $mount_path/hdfs
sudo chmod 750 $mount_path/hdfs

sudo mkdir -p $mount_path/yarn
sudo chown root:root $mount_path/yarn
sudo chmod 755 $mount_path/yarn

sudo mkdir -p $mount_path/impala
sudo chown impala:hadoop $mount_path/impala
sudo chmod 755 $mount_path/impala

sudo mkdir -p $mount_path/jindodata
sudo chown root:root $mount_path/jindodata
sudo chmod 755 $mount_path/jindodata

sudo mkdir -p $mount_path/jindosdk
sudo chown root:root $mount_path/jindosdk
sudo chmod 755 $mount_path/jindosdk

sudo mkdir -p $mount_path/kafka
sudo chown root:root $mount_path/kafka
sudo chmod 755 $mount_path/kafka

sudo mkdir -p $mount_path/kudu
sudo chown root:root $mount_path/kudu
sudo chmod 755 $mount_path/kudu

sudo mkdir -p $mount_path/mapred
sudo chown root:root $mount_path/mapred
sudo chmod 755 $mount_path/mapred

sudo mkdir -p $mount_path/starrocks
sudo chown root:root $mount_path/starrocks
sudo chmod 755 $mount_path/starrocks

sudo mkdir -p $mount_path/clickhouse
sudo chown clickhouse:clickhouse $mount_path/clickhouse
sudo chmod 755 $mount_path/clickhouse

sudo mkdir -p $mount_path/doris
sudo chown root:root $mount_path/doris
sudo chmod 755 $mount_path/doris

sudo mkdir -p $mount_path/log
sudo chown root:root $mount_path/log
sudo chmod 755 $mount_path/log

sudo mkdir -p $mount_path/log/clickhouse
sudo chown clickhouse:clickhouse $mount_path/log/clickhouse
sudo chmod 755 $mount_path/log/clickhouse

sudo mkdir -p $mount_path/log/kafka
sudo chown kafka:hadoop $mount_path/log/kafka
sudo chmod 755 $mount_path/log/kafka

sudo mkdir -p $mount_path/log/kafka-rest-proxy
sudo chown kafka:hadoop $mount_path/log/kafka-rest-proxy
sudo chmod 755 $mount_path/log/kafka-rest-proxy

sudo mkdir -p $mount_path/log/kafka-schema-registry
sudo chown kafka:hadoop $mount_path/log/kafka-schema-registry
sudo chmod 755 $mount_path/log/kafka-schema-registry

sudo mkdir -p $mount_path/log/cruise-control
sudo chown kafka:hadoop $mount_path/log/cruise-control
sudo chmod 755 $mount_path/log/cruise-control

sudo mkdir -p $mount_path/log/doris
sudo chown doris:doris $mount_path/log/doris
sudo chmod 755 $mount_path/log/doris

sudo mkdir -p $mount_path/log/celeborn
sudo chown hadoop:hadoop $mount_path/log/celeborn
sudo chmod 755 $mount_path/log/celeborn

sudo mkdir -p $mount_path/log/flink
sudo chown flink:hadoop $mount_path/log/flink
sudo chmod 775 $mount_path/log/flink

sudo mkdir -p $mount_path/log/flume
sudo chown root:root $mount_path/log/flume
sudo chmod 755 $mount_path/log/flume

sudo mkdir -p $mount_path/log/gmetric
sudo chown root:root $mount_path/log/gmetric
sudo chmod 777 $mount_path/log/gmetric

sudo mkdir -p $mount_path/log/hadoop-hdfs
sudo chown hdfs:hadoop $mount_path/log/hadoop-hdfs
sudo chmod 755 $mount_path/log/hadoop-hdfs

sudo mkdir -p $mount_path/log/hbase
sudo chown hbase:hadoop $mount_path/log/hbase
sudo chmod 755 $mount_path/log/hbase

sudo mkdir -p $mount_path/log/hive
sudo chown root:root $mount_path/log/hive
sudo chmod 775 $mount_path/log/hive

sudo mkdir -p $mount_path/log/impala
sudo chown impala:hadoop $mount_path/log/impala
sudo chmod 755 $mount_path/log/impala

sudo mkdir -p $mount_path/log/jindodata
sudo chown root:root $mount_path/log/jindodata
sudo chmod 777 $mount_path/log/jindodata

sudo mkdir -p $mount_path/log/jindosdk
sudo chown root:root $mount_path/log/jindosdk
sudo chmod 777 $mount_path/log/jindosdk

sudo mkdir -p $mount_path/log/kyuubi
sudo chown kyuubi:hadoop $mount_path/log/kyuubi
sudo chmod 755 $mount_path/log/kyuubi

sudo mkdir -p $mount_path/log/presto
sudo chown presto:hadoop $mount_path/log/presto
sudo chmod 755 $mount_path/log/presto

sudo mkdir -p $mount_path/log/spark
sudo chown spark:hadoop $mount_path/log/spark
sudo chmod 755 $mount_path/log/spark

sudo mkdir -p $mount_path/log/sssd
sudo chown sssd:sssd $mount_path/log/sssd
sudo chmod 750 $mount_path/log/sssd

sudo mkdir -p $mount_path/log/starrocks
sudo chown starrocks:starrocks $mount_path/log/starrocks
sudo chmod 755 $mount_path/log/starrocks

sudo mkdir -p $mount_path/log/taihao_exporter
sudo chown taihao:taihao $mount_path/log/taihao_exporter
sudo chmod 755 $mount_path/log/taihao_exporter

sudo mkdir -p $mount_path/log/trino
sudo chown trino:hadoop $mount_path/log/trino
sudo chmod 755 $mount_path/log/trino

sudo mkdir -p $mount_path/log/yarn
sudo chown hadoop:hadoop $mount_path/log/yarn
sudo chmod 755 $mount_path/log/yarn

Data lake (Hadoop) clusters

while getopts p: opt
do
	case "${opt}" in
  	p) mount_path=${OPTARG};;
  esac
done

mkdir -p $mount_path/data
chown hdfs:hadoop $mount_path/data
chmod 1777 $mount_path/data

mkdir -p $mount_path/hadoop
chown hadoop:hadoop $mount_path/hadoop
chmod 775 $mount_path/hadoop

mkdir -p $mount_path/hdfs
chown hdfs:hadoop $mount_path/hdfs
chmod 755 $mount_path/hdfs

mkdir -p $mount_path/yarn
chown hadoop:hadoop $mount_path/yarn
chmod 755 $mount_path/yarn

mkdir -p $mount_path/kudu/master
chown kudu:hadoop $mount_path/kudu/master
chmod 755 $mount_path/kudu/master

mkdir -p $mount_path/kudu/tserver
chown kudu:hadoop $mount_path/kudu/tserver
chmod 755 $mount_path/kudu/tserver

mkdir -p $mount_path/log
chown hadoop:hadoop $mount_path/log
chmod 775 $mount_path/log

mkdir -p $mount_path/log/hadoop-hdfs
chown hdfs:hadoop $mount_path/log/hadoop-hdfs
chmod 775 $mount_path/log/hadoop-hdfs

mkdir -p $mount_path/log/hadoop-yarn
chown hadoop:hadoop $mount_path/log/hadoop-yarn
chmod 755 $mount_path/log/hadoop-yarn

mkdir -p $mount_path/log/hadoop-mapred
chown hadoop:hadoop $mount_path/log/hadoop-mapred
chmod 755 $mount_path/log/hadoop-mapred

mkdir -p $mount_path/log/kudu
chown kudu:hadoop $mount_path/log/kudu
chmod 755 $mount_path/log/kudu

mkdir -p $mount_path/run
chown hadoop:hadoop $mount_path/run
chmod 777 $mount_path/run

mkdir -p $mount_path/tmp
chown hadoop:hadoop $mount_path/tmp
chmod 777 $mount_path/tmp

Run the following commands to run the script file, create the service folders, and then delete the script. $file_path is the path to the script file.
```
chmod +x $file_path
sudo $file_path -p $mount_path
rm $file_path
```

Use the new disk.

In the EMR console, restart the services that run on the node.

Run the following commands to start the management processes.

sudo service taihao_exporter start
sudo service ilogtaild start
sudo service ilogtaildclt start
(sudo crontab -l; echo "*/5 * * * * bash /usr/local/taihao_exporter/exporter_check.sh") | sudo crontab -

Verify that the disk is working correctly.