All Products
Search
Document Center

:Random downtime of CentOS7 instances

Last Updated:May 10, 2022

Disclaimer: This article may contain information about third-party products. Such information is for reference only. Alibaba Cloud does not make a guarantee in any form of the performance and reliability of the third-party products, and potential impacts of operations on these products.

Problem description

When you use a Linux instance, if the instance meets the following three conditions, the system may be down.

  • Use CentOS7.5 or 7.6 public or custom images to create an instance.
  • The kernel version is 3.10.0-862 or 3.10.0-957.
  • Perform large file or directory operations frequently.

Cause

CentOS7.5 and 7.6 kernels are added with patches that support disk mq-deadline elevator. Due to bugs in this patch, the nr_phys_segments of disk requests may exceed the max_segments settings of disk parameters. After the virtio block driver code detects this error, it will actively trigger a kernel exception. Known kernel crashes include internal instance downtime and common kernel stacks caused by kernel crashes. Common kernel stacks are similar to the following.

[336627.578227] FS: 0000000000000000(0000) GS:ffff9612bfd00000(0000) knlGS:0000000000000000
[336627.579031] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[336627.579604] CR2: 00007f3c0aee4ac0 CR3: 00000000369c8000 CR4: 00000000003606e0
[336627.580317] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[336627.581029] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[336627.581740] Call Trace:
[336627.582004] [] ? __kmalloc+0x2e/0x230
[336627.582544] [] virtqueue_add_sgs+0x87/0xa0 [virtio_ring]
[336627.583256] [] __virtblk_add_req+0xc2/0x1d0 [virtio_blk]
[336627.583953] [] ? __blk_segment_map_sg+0x57/0x1a0
[336627.584580] [] ? __sbitmap_queue_get+0x2b/0xb0
[336627.585195] [] virtio_queue_rq+0x139/0x2b0 [virtio_blk]
[336627.585883] [] blk_mq_dispatch_rq_list+0x268/0x620
[336627.586526] [] ? elv_rb_del+0x2a/0x40
[336627.587064] [] blk_mq_do_dispatch_sched+0x7e/0x130
[336627.587707] [] blk_mq_sched_dispatch_requests+0x11e/0x1c0
[336627.588415] [] __blk_mq_run_hw_queue+0x50/0xc0
[336627.589031] [] blk_mq_run_work_fn+0x15/0x20
[336627.589621] [] process_one_work+0x17f/0x440
[336627.590212] [] worker_thread+0x126/0x3c0
[336627.590777] [] ? manage_workers.isra.25+0x2a0/0x2a0
[336627.592891] [] kthread+0xd1/0xe0
[336627.594856] [] ? insert_kthread_work+0x40/0x4
[336627.596928] [] ret_from_fork_nospec_begin+0x21/0x21
[336627.599043] [] ? insert_kthread_work+0x40/0x40
[336627.601105] Code: ff e9 06 fd ff ff 48 89 d9 44 89 f2 48 c7 c6 3c b3 22 c0 48 c7 c7 78 c0 22 c0 31 c0 e8 48 7c d7 c8 8b 43 60 e9 19 ff ff ff 0f 0b <0f> 0b e8 ea 07 00 00 8b 55 ac 48 c7 c6 88 b4 22 c0 48 c7 c7 a0
[336627.606857] RIP [] virtqueue_add+0x4a2/0x4d0 [virtio_ring]
[336627.609101] RSP
[336627.613759] ---[ end trace b23f6bcae8735444 ]---
[336627.615676] Kernel panic - not syncing: Fatal exception
[336628.691211] Shutting down cpus with NMI

Solution

Take note of the following items:

  • Before you perform high-risk operations such as modifying the specifications or data of an Alibaba Cloud instance, we recommend that you check the disaster recovery and fault tolerance capabilities of the instance to ensure data security.
  • Before you modify the specifications or data of an Alibaba Cloud instance, such as an Elastic Compute Service (ECS) instance or an ApsaraDB RDS instance, we recommend that you create snapshots or enable backups for the instance. For example, you can enable log backups for an ApsaraDB RDS instance.
  • If you have granted specific users the permissions on sensitive information, such as usernames and passwords, or submitted sensitive information in the Alibaba Cloud Management Console, we recommend that you modify the sensitive information at the earliest opportunity.
  1. You can use the new patch released on the Linux community website to fix the kernel bug. You can also download a new version of the kernel from the Community official website and manually compile and update the kernel in the Linux instance. Then, restart the instance.
  2. If the preceding operations are not convenient, you can restart the instance in the console and then use Alibaba Cloud to fix the problem.
    Note: Be sure to restart the instance in the console. Do not restart the instance in the system.

If the preceding solution is inconvenient, you can use the following method to solve the problem temporarily:

Method 1

Log on to the Linux instance through the management terminal, and run the following command to reduce the max_sectors_kb parameter of all disks in the current Linux instance. The modification of this parameter does not affect the performance.

Tip: The default value of the max_sectors_kb parameter is 512, which is changed to 384 here. According to the field environment, the modified parameter is less than the actual parameter.

echo 384 > `find / -name max_sectors_kb`

Method 2

Log on to the Linux instance through the management terminal, and run the following command to change the schedule parameter of all disks in the current Linux instance from mq-deadline to none. This parameter does not affect the performance.

echo "none" > `find / -name scheduler grep sys`

Applicable scope

  • Elastic Compute Service (ECS)