An Elastic Compute Service (ECS) instance that runs a Linux operating system goes down when the instance encounters exceptions such as out of memory (OOM) errors, blue screen freezes, or kernel panics, or when the instance receives system event notifications about operating system crashes. You can troubleshoot the issue by using the self-service diagnostics tool or system kernel logs.
Identify the cause of a downtime issue
You can use the following methods to identify the cause of the downtime issue.
Method 1: (Recommended) Use the self-service diagnostics tool
In the top navigation bar, select the region and resource group of the resource that you want to manage.
On the Instance Troubleshooting tab, choose Instance Connection Errors or Startup Exceptions > Instance Downtime, select the ID of the instance on which the downtime issue occurred, and then click Start.

Identify the cause of the downtime issue and resolve the issue based on the returned diagnostic result and solution.
Method 2: Use system event notifications
Go to ECS console - Events.
In the left-side navigation pane, click Unexpected O&M Events.
Click Diagnose Operating System Issue for the instance on which the downtime issue occurred.
Identify the cause of the downtime issue and resolve the issue based on the returned diagnostic result and solution.
Method 3: Use kdump to view kernel logs
If kdump is installed and configured on your instance, the instance generates the vmcore-dmesg.txt file when a downtime issue occurs on the instance. In the file, you can obtain the kernel logs that were generated when the downtime issue occurred. The call trace information (typically, the information that starts with Call Trace:) in the kernel logs can help you identify the cause of the downtime issue and resolve the issue.
Hands-on practice
If you want to practice this topic, click Verify the Guestos panic diagnostics capability.
Common downtime causes and solutions
A Linux ECS instance goes down and the "not syncing: Out of memory: system-wide panic_on_oom is enabled" message appears in a log
Problem description
A Linux ECS instance goes down at runtime and the "not syncing: Out of memory: system-wide panic_on_oom is enabled" message appears in a log. The following code shows an example of the call stack:
[3624965.306801] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled [3624965.307824] CPU: 5 PID: 8510 Comm: AliDetect Kdump: loaded Tainted: GOE ------------ T 3.10.0-1127.10.1.el7.x86_64 #1 [3624965.308923] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014 [3624965.309671] Call Trace: [3624965.309935] [<ffffffff8f37ffa5>] dump_stack+0x19/0x1b [3624965.310444] [<ffffffff8f379541>] panic+0xe8/0x21f [3624965.310913] [<ffffffff8edc26b5>] check_panic_on_oom+0x55/0x60 [3624965.311480] [<ffffffff8edc2aab>] out_of_memory+0x23b/0x4f0 [3624965.312027] [<ffffffff8f37b3e0>] __alloc_pages_slowpath+0x5db/0x729 [3624965.312628] [<ffffffff8edc91a6>] __alloc_pages_nodemask+0x436/0x450 [3624965.313233] [<ffffffff8ee18e78>] alloc_pages_current+0x98/0x110 [3624965.313808] [<ffffffff8edbe3d7>] __page_cache_alloc+0x97/0xb0 [3624965.314364] [<ffffffff8edc0f90>] filemap_fault+0x270/0x420 [3624965.314912] [<ffffffffc04ea7d6>] ext4_filemap_fault+0x36/0x50 [ext4] [3624965.315530] [<ffffffff8ededf4a>] __do_fault.isra.61+0x8a/0x100 [3624965.316095] [<ffffffff8edee4fc>] do_read_fault.isra.63+0x4c/0x1b0 [3624965.316680] [<ffffffff8edf5d60>] handle_mm_fault+0xa20/0xfb0 [3624965.317231] [<ffffffff8f38d653>] __do_page_fault+0x213/0x500 [3624965.317775] [<ffffffff8f38da26>] trace_do_page_fault+0x56/0x150 [3624965.318378] [<ffffffff8f38cfa2>] do_async_page_fault+0x22/0xf0 [3624965.318954] [<ffffffff8f3897a8>] async_page_fault+0x28/0x30Cause
An OOM error occurs on the instance due to insufficient memory resources. In addition, the
vm.panic_on_oomkernel parameter is set to 1 or 2.If the kernel parameter is set to 1, a kernel panic may occur or the OOM killer may start when the memory resources are insufficient.
If the kernel parameter is set to 2, a kernel panic is forcefully triggered when the memory resources are insufficient.
Solutions
Solution 1: Set the
vm.panic_on_oomkernel parameter to 0Set the
vm.panic_on_oomkernel parameter to 0 to start the OOM killer when the memory resources are insufficient.ImportantSetting the
vm.panic_on_oomkernel parameter to 0 may cause the system to start the OOM killer to terminate processes that consume a large amount of memory resources when the memory resources are insufficient. The termination operation may affect the system stability and the running applications. Before you change the parameter value, make sure that you understand the impacts of the change on your instance and evaluate the memory management and application requirements of the system.Connect to the ECS instance.
Run the following command to open the
/etc/sysctl.conffile:sudo vim /etc/sysctl.confPress the
Ikey and make the following configuration:vm.panic_on_oom = 0The configuration prevents the system from crashing when the memory resources are insufficient.
Press the
Ecskey and then enter:wqto save the file and exit the editor.Run the following command to load the change in the
sysctl.conffile:sudo sysctl -p
Solution 2: Optimize memory usage
ImportantBefore you perform the operations in the solution on a instance on which the issue occurred, we recommend that you create snapshots for the instance to back up data. This prevents data loss caused by accidental operations. For information about how to create a snapshot, see Create a snapshot.
In most cases, OOM errors occur due to insufficient memory resources. You can check whether the memory usage is normal based on your business requirements. Then, increase the memory capacity of the system or reduce memory usage by using one of the following methods:
Upgrade the instance type
This method allows you to obtain more memory resources. For more information, see Change instance types.
Optimize applications
Check and optimize the memory usage of applications. For example, you can reduce memory leaks, optimize algorithms, or modify configurations.
A Linux ECS instance goes down and the "RIP: tcp_create_openreq_child" message appears in a log
Problem description
A Linux ECS instance goes down at runtime and the "RIP: tcp_create_openreq_child" message appears in a log. The following code shows an example of the call stack:
[8343753.027138] Oops: 0000 [#1] SMP PTI [8343753.027431] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G OE 5.4.0-122-generic #138-Ubuntu [8343753.028127] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014 [8343753.028728] RIP: 0010:tcp_create_openreq_child+0x2fd/0x410 ... [8343753.036508] Call Trace: [8343753.036710] <IRQ> [8343753.036886] tcp_v4_syn_recv_sock+0x5a/0x400 [8343753.037234] tcp_get_cookie_sock+0x48/0x150 [8343753.037564] cookie_v4_check+0x581/0x6d0 [8343753.037880] tcp_v4_do_rcv+0x1a5/0x200 [8343753.038184] tcp_v4_rcv+0xc76/0xd10 [8343753.038551] ip_protocol_deliver_rcu+0x30/0x1b0 [8343753.038980] ip_local_deliver_finish+0x48/0x50 [8343753.039335] ip_local_deliver+0x73/0xf0Cause
The downtime issue is caused by the protection mechanism of the system that is triggered by null pointer reference errors due to bugs, such as errors or defects, in the operating system kernel. For more information, see Bug details.
Solution
Upgrade the kernel version of the instance operating system to 5.4.0-123.139 or later. For more information, see Upgrade the operating system kernel of a Linux ECS instance.
ImportantBefore you perform the operations in the solution on a instance on which the issue occurred, we recommend that you create snapshots for the instance to back up data. This prevents data loss caused by accidental operations. For information about how to create a snapshot, see Create a snapshot.
A Linux ECS instance goes down and the "sysrq_handle_crash" message appears in a log
Problem description
A Linux ECS instance goes down at runtime and restarts. In addition, the "RIP: sysrq_handle_crash" message appears in a log. The following code shows an example of the call stack:
[ 7262.769377] Modules linked in: tcp_diag inet_diag rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc intel_powerclamp iosf_mbi crc32_pclmul ppdev ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper virtio_balloon shpchp cryptd parport_pc parport i2c_piix4 pcspkr ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_net virtio_blk virtio_console cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm crct10dif_pclmul crct10dif_common crc32c_intel serio_raw drm ata_piix virtio_pci libata virtio_ring i2c_core virtio floppy [ 7262.774113] CPU: 1 PID: 3818 Comm: bash Not tainted 3.10.0-514.26.2.el7.x86_64 #1 [ 7262.774699] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014 [ 7262.775317] task: ffff88040d3d5e20 ti: ffff8803cb7ac000 task.ti: ffff8803cb7ac000 [ 7262.775904] RIP: 0010:[<ffffffff813ee1d6>] [<ffffffff813ee1d6>] sysrq_handle_crash+0x16/0x20 ... [ 7262.784790] Call Trace: [ 7262.784992] [<ffffffff813ee9f7>] __handle_sysrq+0x107/0x170 [ 7262.785450] [<ffffffff813eee6f>] write_sysrq_trigger+0x2f/0x40 [ 7262.785915] [<ffffffff8126be0d>] proc_reg_write+0x3d/0x80 [ 7262.786355] [<ffffffff811fe9fd>] vfs_write+0xbd/0x1e0 [ 7262.786759] [<ffffffff811ff51f>] SyS_write+0x7f/0xe0 [ 7262.787172] [<ffffffff81697809>] system_call_fastpath+0x16/0x1bCause
The following command is run on the instance to trigger the downtime issue:
echo c > /proc/sysrq-triggerSolution
In normal cases, do not run the
echo c > /proc/sysrq-triggercommand to trigger a downtime.ImportantThe
echo c > /proc/sysrq-triggercommand triggers a kernel crash and an immediate reboot. In most cases, the command is used in tests or to force a kernel crash if the system cannot be shut down in the normal way.
A Linux ECS instance goes down and the "RIP:get_target_pstate_use_performance" message appears in a log
Problem description
A Linux ECS instance goes down at runtime and the "RIP:get_target_pstate_use_performance" message appears in a log. The following code shows an example of the call stack:
[ 1.076899] divide error: 0000 [#1] SMP [ 1.077669] Modules linked in: [ 1.078302] CPU: 4 PID: 9 Comm: rcu_sched Not tainted 3.10.0-1127.19.1.el7.x86_64 #1 [ 1.079519] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8f19b21 04/01/2014 [ 1.080724] task: ffff91c8fa111070 ti: ffff91c8fa11c000 task.ti: ffff91c8fa11c000 [ 1.081919] RIP: 0010:[<ffffffff85dc3089>] [<ffffffff85dc3089>] get_target_pstate_use_performance+0x29/0xc0 [ 1.083355] RSP: 0000:ffff91c8fa11fb40 EFLAGS: 00010006 [ 1.093192] Call Trace: [ 1.093715] [<ffffffff85dc4081>] intel_pstate_update_util+0x161/0x310 [ 1.094550] [<ffffffff858e9523>] ? load_balance+0x1a3/0xa10 [ 1.095321] [<ffffffff858e4e87>] update_curr+0x127/0x1e0 [ 1.096123] [<ffffffff858e52a8>] dequeue_entity+0x28/0x5c0 [ 1.096894] [<ffffffff8586d3be>] ? kvm_sched_clock_read+0x1e/0x30 [ 1.097702] [<ffffffff858e5893>] dequeue_task_fair+0x53/0x660 [ 1.098490] [<ffffffff858debe5>] ? sched_clock_cpu+0x85/0xc0 [ 1.099266] [<ffffffff858d7a56>] deactivate_task+0x46/0xd0Cause
The downtime issue may occur because the
current_pstatefrequency value of the Intel pstate driver is initialized to 0 during the startup of the instance. When the system switches processes, it calls the Intel pstate driver to adjust the performance mode to adapt to changes in system load. If thecurrent_pstatefrequency value of the Intel pstate driver is 0, a divide-by-zero operation error may occur. The error can lead to a system crash.Solution
Upgrade the kernel version of the instance operating system to 4.18 or later. For more information, see Upgrade the operating system kernel of a Linux ECS instance.
ImportantBefore you perform the operations in the solution on a instance on which the issue occurred, we recommend that you create snapshots for the instance to back up data. This prevents data loss caused by accidental operations. For information about how to create a snapshot, see Create a snapshot.
A Linux ECS instance goes down and the "not syncing: Out of memory and no killable processes" message appears in a log
Problem description
A Linux ECS instance goes down at runtime and the "not syncing: Out of memory and no killable processes" message appears in a log. The following code shows an example of the call stack:
[217894.026467] Out of memory: Kill process 17807 (php-fpm) score 4 or sacrifice child [217894.027560] Killed process 17807 (php-fpm) total-vm:386252kB, anon-rss:6972kB, file-rss:144kB, shmem-rss:9020kB [217894.910947] php-fpm invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 [217894.912175] php-fpm cpuset=/ mems_allowed=0 [217894.913100] CPU: 0 PID: 18534 Comm: php-fpm Tainted: GOE ------------ 3.10.0-957.21.3.el7.x86_64 #1 [217894.914510] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014 [217894.915780] Call Trace: [217894.916607] [<ffffffff8ff63107>] dump_stack+0x19/0x1b [217894.917775] [<ffffffff8ff5db2a>] dump_header+0x90/0x229 [217894.918914] [<ffffffff8f901292>] ? ktime_get_ts64+0x52/0xf0 [217894.919979] [<ffffffff8f9584df>] ? delayacct_end+0x8f/0xb0 [217894.921026] [<ffffffff8f9ba834>] oom_kill_process+0x254/0x3d0 [217894.922097] [<ffffffff8f9ba2dd>] ? oom_unkillable_task+0xcd/0x120 [217894.923248] [<ffffffff8f9ba386>] ? find_lock_task_mm+0x56/0xc0 [217894.924364] [<ffffffff8f9bb076>] out_of_memory+0x4b6/0x4f0 [217894.925513] [<ffffffff8ff5e62e>] __alloc_pages_slowpath+0x5d6/0x724Cause
The system has run out of memory, and no processes can be terminated to free up memory. As a result, the system fails.
Solution
Check whether the memory usage is normal based on your business requirements. Then, increase the memory capacity of the system or reduce memory usage by using one of the following methods:
Upgrade the instance type
This method allows you to obtain more memory resources. For more information, see Change instance types.
Optimize applications
Check the ECS instance for processes that consume a large amount of memory resources and determine whether the memory usage is normal. If the memory usage is abnormal, optimize the memory usage. For example, you can reduce memory leaks, optimize algorithms, or modify configurations.
A Linux ECS instance goes down and the "RIP:__list_del_entry_valid.cold" message appears in a log
Problem description
A Linux ECS instance goes down at runtime and the "list_del corruption, ffff91bc2ad47048->prev is LIST_POISON2 (dead000000000200)" message appears in a log. The following code shows an example of the call stack:
[1072741.548729] list_del corruption, ffff91bc2ad47048->prev is LIST_POISON2 (dead000000000200) [1072741.549507] ------------[ cut here ]------------ [1072741.549886] kernel BUG at lib/list_debug.c:50! [1072741.550275] invalid opcode: 0000 [#1] SMP PTI [1072741.550646] CPU: 0 PID: 1583643 Comm: kworker/0:1 Tainted: G OE --------- - - 4.18.0-305.3.1.el8.x86_64 #1 [1072741.551468] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014 [1072741.552048] Workqueue: cgroup_destroy css_release_work_fn [1072741.552462] RIP: 0010:__list_del_entry_valid.cold.1+0x45/0x4c ... [1072741.560426] Call Trace: [1072741.560638] css_release_work_fn+0x3f/0x240 [1072741.560983] process_one_work+0x1a7/0x360 [1072741.561300] worker_thread+0x30/0x390 [1072741.561622] ? create_worker+0x1a0/0x1a0 [1072741.561933] kthread+0x116/0x130 [1072741.562195] ? kthread_flush_work_fn+0x10/0x10 [1072741.562557] ret_from_fork+0x35/0x40 [1072741.562843] Modules linked in: AliSecGuard(OE) nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_tables_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink intel_rapl_msr intel_rapl_common isst_if_common nfit libnvdimm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel rapl joydev pcspkr virtio_balloon i2c_piix4 ip_tables xfs libcrc32c ata_generic cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ata_piix libata crc32c_intel virtio_net net_failover serio_raw failover virtio_console virtio_blk [1072741.566968] Features: eBPF/event [1072741.567302] ---[ end trace 8f40bd2bf2a072e5 ]---Cause
The downtime issue is caused by a bug in the operating system kernel. The bug leads to a
list_delcorruption error, as indicated byLIST_POISON2 (dead000000000200). For more information, see Bug details.Solution
Upgrade the kernel version of the instance operating system to kernel-4.18.0-305.12.1.el8_4 or later. For more information, see Upgrade the operating system kernel of a Linux ECS instance.
ImportantBefore you perform the operations in the solution on a instance on which the issue occurred, we recommend that you create snapshots for the instance to back up data. This prevents data loss caused by accidental operations. For information about how to create a snapshot, see Create a snapshot.
A Linux ECS instance goes down and the "RIP:module_put" message appears in a log
Problem description
A Linux ECS instance goes down at runtime and the "RIP:module_put" message appears in a log. The following code shows an example of the call stack:
[86389.969666] CPU: 2 PID: 1426 Comm: Syn-1203-Tx Tainted: GOE ------------ 3.10.0-1160.53.1.el7.x86_64 #1 [86389.970626] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014 [86389.971377] task: ffff983118bfc200 ti: ffff982defd58000 task.ti: ffff982defd58000 [86389.972034] RIP: 0010:[<ffffffff8c91956d>] [<ffffffff8c91956d>] module_put+0x1d/0x80 ... [86389.979170] Call Trace: [86389.979378] [<ffffffff8ca53b40>] cdev_put+0x20/0x30 [86389.979768] [<ffffffff8ca5098f>] __fput+0x1ef/0x230 [86389.980151] [<ffffffff8ca50abe>] ____fput+0xe/0x10 [86389.980526] [<ffffffff8c8c299b>] task_work_run+0xbb/0xe0 [86389.980946] [<ffffffff8c8a1954>] do_exit+0x2d4/0xa30 [86389.981375] [<ffffffff8c91358f>] ? futex_wait+0x11f/0x280Cause
The use-after-free vulnerability is triggered when a system process uses or accesses the memory that was released. The vulnerability may activate the protection mechanism of the operating system or cause data errors. As a result, the system crashes.
NoteUse-after-free is a common type of software vulnerability that occurs when a program incorrectly uses or accesses memory that was released. The preceding scenario may cause unpredictable behaviors, such as crashes, data corruptions, data leaks, or executions of malicious code.
Solution
Upgrade the kernel version of the instance operating system to kernel-4.18.0-305.12.1.el8_4 or later. For more information, see Upgrade the operating system kernel of a Linux ECS instance.
ImportantBefore you perform the operations in the solution on a instance on which the issue occurred, we recommend that you create snapshots for the instance to back up data. This prevents data loss caused by accidental operations. For information about how to create a snapshot, see Create a snapshot.
A Linux ECS instance goes down and the "containerd: page allocation failure" message appears in a log
Problem description
A Linux ECS instance goes down at runtime and the "containerd: page allocation failure" message appears in a log. The following code shows an example of the call stack:
[1558839.130515] ------------[ cut here ]------------ [1558839.131215] kernel BUG at lib/idr.c:1163! [1558839.131797] invalid opcode: 0000 [#1] SMP [1558839.132411] Modules linked in: binfmt_misc AliSecGuard(OE) AliSecProcFilter64(OE) AliSecNetFlt64(OE) xt_CT xt_multiport ipt_rpfilter iptable_raw ip_set_hash_net ip_set_hash_ip ipip tunnel4 ip_tunnel veth ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6_tables iptable_mangle nf_conntrack_netlink xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_mark xt_addrtype xt_set ip_set_bitmap_port ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_hash_ipport ip_set nfnetlink dummy xt_comment iptable_nat nf_nat_ipv4 nf_nat iptable_filter tcp_diag inet_diag overlay(T) sunrpc nfit ppdev libnvdimm iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd joydev virtio_balloon pcspkr parport_pc parport i2c_piix4 nf_conntrack_ipv4 nf_defrag_ipv4 ip_vs_sh ip_vs_wrr [1558839.141715] ip_vs_rr ip_vs nf_conntrack libcrc32c br_netfilter bridge stp llc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_net virtio_console virtio_blk cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ata_piix libata crct10dif_pclmul crct10dif_common crc32c_intel serio_raw virtio_pci virtio_ring floppy virtio drm_panel_orientation_quirks [1558839.147553] CPU: 6 PID: 21465 Comm: kworker/6:0 Tainted: G OE ------------ T 3.10.0-957.21.3.el7.x86_64 #1 [1558839.149181] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014 [1558839.150656] Workqueue: events free_work [1558839.151766] task: ffff8fbc4d6e9040 ti: ffff8fb8b898c000 task.ti: ffff8fb8b898c000 [1558839.153196] RIP: 0010:[<ffffffff967774e1>] [<ffffffff967774e1>] ida_simple_remove+0x41/0x50 ... [1558839.171901] Call Trace: [1558839.173133] [<ffffffff966306c4>] __mem_cgroup_free+0x234/0x250 [1558839.174750] [<ffffffff966306f5>] free_work+0x15/0x20 [1558839.176259] [<ffffffff964b9ebf>] process_one_work+0x17f/0x440 [1558839.177872] [<ffffffff964baf56>] worker_thread+0x126/0x3c0 [1558839.179421] [<ffffffff964bae30>] ? manage_workers.isra.25+0x2a0/0x2a0 [1558839.181092] [<ffffffff964c1da1>] kthread+0xd1/0xe0 [1558839.182839] [<ffffffff964c1cd0>] ? insert_kthread_work+0x40/0x40 [1558839.184543] [<ffffffff96b75c37>] ret_from_fork_nospec_begin+0x21/0x21 [1558839.186238] [<ffffffff964c1cd0>] ? insert_kthread_work+0x40/0x40 ...Cause
The downtime issue is caused by a bug in the operating system kernel. When the memory control group (memcg) feature is enabled, all registered kernel memory caches are added to the memcg_caches[] array. If no memory resources are available, a memory insufficiency issue occurs and the system may crash.
Solution
We recommend that you upgrade the kernel version of CentOS 7.7 to kernel-3.10.0-1062.el7 or later and upgrade the kernel version of CentOS 7.6 to kernel-3.10.0-957.27.2.el7 or later. For more information, see Upgrade the operating system kernel of a Linux ECS instance.
ImportantBefore you perform the operations in the solution on a instance on which the issue occurred, we recommend that you create snapshots for the instance to back up data. This prevents data loss caused by accidental operations. For information about how to create a snapshot, see Create a snapshot.
A Linux ECS instance goes down and the "RIP:blk_mq_rq_timed_out" message appears in a log
Problem description
A Linux ECS instance goes down at runtime and the "RIP:blk_mq_rq_timed_out" message appears in a log. The following code shows an example of the call stack:
[8837401.113325] BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0 [8837401.114219] IP: [<ffffffffae575638>] blk_mq_rq_timed_out+0x18/0xa0 [8837401.114892] PGD 8000000885d08067 PUD e1beda067 PMD 0 [8837401.115471] Oops: 0000 [#1] SMP [8837401.115855] Modules linked in: AliSecNetFlt64(OE) AliSecGuard(OE) AliSecProcFilter64(OE) xt_multiport veth ipt_rpfilter ip6t_rpfilter ip6t_MASQUERADE nf_nat_masquerade_ipv6 xt_set iptable_raw ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_filter ip6table_raw ip6_tables ip_set_hash_ip ip_set_hash_net ip_set sch_htb xt_nat xt_statistic ipt_REJECT nf_reject_ipv4 nf_tables iptable_mangle xt_comment xt_mark ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat tcp_diag inet_diag nfsv3 nfs_acl nfs lockd grace fscache overlay(T) sunrpc nfit libnvdimm iosf_mbi crc32_pclmul ppdev virtio_balloon joydev ghash_clmulni_intel parport_pc aesni_intel parport lrw gf128mul glue_helper i2c_piix4 ablk_helper pcspkr cryptd ip_vs_rr ip_vs_sh ip_vs_wrr ip_vs nf_conntrack ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_net net_failover virtio_console virtio_blk failover cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ata_piix libata crct10dif_pclmul crct10dif_common crc32c_intel serio_raw virtio_pci virtio_ring floppy drm_panel_orientation_quirks virtio libcrc32c br_netfilter bridge stp llc [last unloaded: AliSecNetFlt64] [8837401.130281] CPU: 0 PID: 163944 Comm: kworker/0:1H Kdump: loaded Tainted: G OE ------------ T 3.10.0-1160.80.1.el7.x86_64 #1 [8837401.133029] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8a46cfe 04/01/2014 [8837401.134621] Workqueue: kblockd blk_mq_timeout_work [8837401.135916] task: ffff88258a0b6300 ti: ffff8820c2b9c000 task.ti: ffff8820c2b9c000 [8837401.137422] RIP: 0010:[<ffffffffae575638>] [<ffffffffae575638>] blk_mq_rq_timed_out+0x18/0xa0 [8837401.139091] RSP: 0018:ffff8820c2b9fd18 EFLAGS: 00010246 [8837401.140371] RAX: 0000000000000000 RBX: ffff8819b6ad0000 RCX: 0000000000000000 [8837401.141838] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8819b6ad0000 [8837401.143314] RBP: ffff8820c2b9fd20 R08: 000000030ec11230 R09: df98ad67960c8828 [8837401.144732] R10: df98ad67960c8828 R11: ffff8822d9e17f00 R12: ffff8819b6863240 [8837401.146161] R13: 0000000000000002 R14: 0000000000000020 R15: 0000000000000002 [8837401.147605] FS: 0000000000000000(0000) GS:ffff8829bfc00000(0000) knlGS:0000000000000000 [8837401.149177] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [8837401.150426] CR2: 00000000000000d0 CR3: 00000003e570a000 CR4: 00000000003606f0 [8837401.151844] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [8837401.153287] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [8837401.154667] Call Trace: [8837401.155579] [<ffffffffae57572c>] blk_mq_check_expired+0x6c/0x80 [8837401.157057] [<ffffffffae578dac>] bt_iter+0x5c/0x70 [8837401.158357] [<ffffffffae57984b>] blk_mq_queue_tag_busy_iter+0x13b/0x320 [8837401.159675] [<ffffffffae2e84c9>] ? pick_next_entity+0xa9/0x190 [8837401.160968] [<ffffffffae5756c0>] ? blk_mq_rq_timed_out+0xa0/0xa0 [8837401.162414] [<ffffffffae5756c0>] ? blk_mq_rq_timed_out+0xa0/0xa0 [8837401.163748] [<ffffffffae57428b>] blk_mq_timeout_work+0x8b/0x180 [8837401.165062] [<ffffffffae2c319f>] process_one_work+0x17f/0x440 [8837401.166329] [<ffffffffae2c42e6>] worker_thread+0x126/0x3c0 [8837401.167541] [<ffffffffae2c41c0>] ? manage_workers.isra.26+0x2b0/0x2b0 [8837401.169048] [<ffffffffae2cb4d1>] kthread+0xd1/0xe0 [8837401.170311] [<ffffffffae2cb400>] ? insert_kthread_work+0x40/0x40 [8837401.171514] [<ffffffffae9c51f7>] ret_from_fork_nospec_begin+0x21/0x21 [8837401.172861] [<ffffffffae2cb400>] ? insert_kthread_work+0x40/0x40 [8837401.174091] Code: 83 84 c6 80 00 00 00 01 e8 f6 fe ff ff 5d c3 cc cc cc cc 0f 1f 44 00 00 55 48 89 e5 53 48 8b 57 58 48 8b 47 38 48 89 fb 83 e2 02 <48> 8b 80 d0 00 00 00 74 4c 48 83 78 10 00 74 50 48 ba 00 00 00 [8837401.178255] RIP [<ffffffffae575638>] blk_mq_rq_timed_out+0x18/0xa0 [8837401.179436] RSP <ffff8820c2b9fd18> [8837401.180300] CR2: 00000000000000d0Cause
The downtime issue is caused by a bug in the operating system kernel. A program accesses a null pointer, which triggers a memory access error. The error causes the instance to crash and go down. For more information, see Bug details.
Solution
Upgrade the kernel version of the instance operating system to kernel-3.10.0-1160.88.1.el7 or later. For more information, see Upgrade the operating system kernel of a Linux ECS instance.
ImportantBefore you perform the operations in the solution on a instance on which the issue occurred, we recommend that you create snapshots for the instance to back up data. This prevents data loss caused by accidental operations. For information about how to create a snapshot, see Create a snapshot.
A Linux ECS instance goes down and the "RIP:strnlen" message appears in a log
Problem description
A Linux ECS instance goes down at runtime and the "RIP:strnlen" message appears in a log. The following code shows an example of the call stack:
[86390.829326] BUG: unable to handle kernel paging request at 0000000100620100 [86390.829510] IP: [<ffffffff9ed7f2ad>] strnlen+0xd/0x40 [86390.829632] PGD 0 [86390.829685] Oops: 0000 [#1] SMP [86390.829766] Modules linked in: AliSecGuard(OE) binfmt_misc xt_conntrack iptable_filter iptable_nat nf_nat_ipv4 arc4 emp(OE) nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat nf_conntrack eudp(E) libcrc32c ppdev intel_powerclamp crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd parport_pc virtio_balloon parport i2c_piix4 pcspkr ip_tables ext4 mbcache jbd2 cirrus drm_kms_helper syscopyarea sysfillrect virtio_net virtio_console virtio_blk sysimgblt fb_sys_fops ttm crct10dif_pclmul crct10dif_common drm crc32c_intel serio_raw floppy virtio_pci virtio_ring virtio drm_panel_orientation_quirks [86390.831199] CPU: 2 PID: 1311 Comm: KeepAlive Tainted: G OE ------------ 3.10.0-957.el7.x86_64 #1 [86390.831410] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 9e9f1cc 04/01/2014 [86390.831580] task: ffff97c77add9040 ti: ffff97c77ade0000 task.ti: ffff97c77ade0000 [86390.831742] RIP: 0010:[<ffffffff9ed7f2ad>] [<ffffffff9ed7f2ad>] strnlen+0xd/0x40 ...... [86390.833643] Call Trace: [86390.833699] [<ffffffff9ed8105b>] string.isra.7+0x3b/0xf0 [86390.833805] [<ffffffff9ed82771>] vsnprintf+0x201/0x6a0 [86390.833908] [<ffffffff9ed82c1d>] vscnprintf+0xd/0x30 [86390.834011] [<ffffffff9ea9a24b>] vprintk_emit+0x11b/0x510 [86390.834143] [<ffffffff9ea9a8a9>] ? vprintk_default+0x29/0x40 [86390.834277] [<ffffffff9ed77ef0>] ? kobject_put+0x50/0x60 [86390.834407] [<ffffffff9ea9a65f>] vprintk+0x1f/0x30 [86390.834517] [<ffffffff9ea975ef>] __warn+0x7f/0x100 [86390.834618] [<ffffffff9ea976cf>] warn_slowpath_fmt+0x5f/0x80 [86390.834746] [<ffffffffc02e2b64>] ? close_eudp_mmap_dev+0x1b4/0x200 [eudp] [86390.834896] [<ffffffff9ed77ef0>] kobject_put+0x50/0x60 [86390.835013] [<ffffffff9ec466f8>] cdev_put+0x18/0x30 [86390.835125] [<ffffffff9ec4350a>] __fput+0x21a/0x260 [86390.835232] [<ffffffff9ec4363e>] ____fput+0xe/0x10 [86390.835340] [<ffffffff9eabe79b>] task_work_run+0xbb/0xe0 [86390.835459] [<ffffffff9ea9dc61>] do_exit+0x2d1/0xa40 [86390.835568] [<ffffffff9ea9e44f>] do_group_exit+0x3f/0xa0 [86390.835695] [<ffffffff9eaaf24e>] get_signal_to_deliver+0x1ce/0x5e0 [86390.835830] [<ffffffff9ea2b527>] do_signal+0x57/0x6f0 [86390.835942] [<ffffffff9eac57e0>] ? hrtimer_get_res+0x50/0x50 [86390.836068] [<ffffffff9ea2bc32>] do_notify_resume+0x72/0xc0 [86390.836202] [<ffffffff9f175124>] int_signal+0x12/0x17 ...Cause
A third-party module named eudp is installed in the system. The module has bugs, such as passing incorrect parameters to the strnlen function. The bugs cause the instance to go down.
Solution
We recommend that you uninstall the eudp module.
ImportantBefore you perform the operations in the solution on a instance on which the issue occurred, we recommend that you create snapshots for the instance to back up data. This prevents data loss caused by accidental operations. For information about how to create a snapshot, see Create a snapshot.
A Linux ECS instance goes down and the "RIP:filp_close" message appears in a log
Problem description
A Linux ECS instance goes down at runtime and the "RIP:filp_close" message appears in a log. The following code shows an example of the call stack:
[ 1891.552008] BUG: unable to handle kernel NULL pointer dereference at 0000000000000036 [ 1891.552149] IP: [<ffffffff8801c67e>] filp_close+0xe/0x90 [ 1891.552239] PGD 40819b067 PUD 40819a067 PMD 0 [ 1891.552321] Oops: 0000 [#1] SMP [ 1891.552380] Modules linked in: AliSecGuard(OE) AliSecNetFlt64(OE) tampercore(OE) tampercfg(OE) ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_mangle iptable_security iptable_raw nf_conntrack libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter intel_powerclamp crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper ablk_helper cryptd parport_pc parport i2c_piix4 shpchp virtio_balloon pcspkr ip_tables ext4 mbcache jbd2 cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm virtio_net virtio_console virtio_blk drm crct10dif_pclmul crct10dif_common virtio_pci crc32c_intel virtio_ring i2c_core serio_raw virtio floppy [ 1891.553945] CPU: 3 PID: 2778 Comm: AliHips Tainted: G OE ------------ 3.10.0-862.14.4.el7.x86_64 #1 [ 1891.554107] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 9e9f1cc 04/01/2014 [ 1891.554228] task: ffff88d4cd7e4f10 ti: ffff88d4c5af8000 task.ti: ffff88d4c5af8000 [ 1891.554346] RIP: 0010:[<ffffffff8801c67e>] [<ffffffff8801c67e>] filp_close+0xe/0x90 ...... [ 1891.555727] Call Trace: [ 1891.555772] [<ffffffffc08d0d7c>] is_pathsite+0x1ac/0x400 [tampercore] [ 1891.555878] [<ffffffff88055e1a>] ? bh_lru_install+0x18a/0x1e0 [ 1891.555974] [<ffffffff880563fc>] ? __find_get_block+0xbc/0x120 [ 1891.556069] [<ffffffff8805648d>] ? __getblk+0x2d/0x300 [ 1891.556160] [<ffffffffc02d956b>] ? search_dir+0x8b/0x120 [ext4] [ 1891.556258] [<ffffffff87ebeed5>] ? wake_up_bit+0x25/0x30 [ 1891.556345] [<ffffffff88055b2d>] ? __brelse+0x3d/0x50 [ 1891.556432] [<ffffffffc02d9a69>] ? ext4_find_entry+0x299/0x570 [ext4] [ 1891.556536] [<ffffffff880380cd>] ? __d_instantiate+0x2d/0xe0 [ 1891.556629] [<ffffffff88037446>] ? _d_rehash+0x36/0x40 [ 1891.556712] [<ffffffff88037473>] ? d_rehash+0x23/0x40 [ 1891.556795] [<ffffffff8803866c>] ? d_splice_alias+0xdc/0x120 [ 1891.556891] [<ffffffffc02da368>] ? ext4_lookup+0x118/0x170 [ext4] [ 1891.556993] [<ffffffff8802b2b3>] ? lookup_fast+0xb3/0x230 [ 1891.557080] [<ffffffff8802ca48>] ? link_path_walk+0x238/0x8b0 [ 1891.558026] [<ffffffff8809769b>] ? proc_pid_permission+0x9b/0xc0 [ 1891.558976] [<ffffffff8802dfea>] ? path_lookupat+0x7a/0x8b0 [ 1891.559917] [<ffffffffc08d20db>] tamperhack_mkdir.part.4+0x12b/0x190 [tampercore] [ 1891.560888] [<ffffffffc08d2185>] tamperhack_mkdir+0x45/0x50 [tampercore] [ 1891.561828] [<ffffffff8852579b>] system_call_fastpath+0x22/0x27 [ 1891.562736] Code: ff 00 00 00 00 e9 d3 fe ff ff 0f 1f 00 b8 ea ff ff ff eb 9d e8 c4 7c e7 ff 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 <48> 8b 47 38 48 89 fb 48 85 c0 74 5b 48 8b 47 28 49 89 f4 48 85 [ 1891.564925] RIP [<ffffffff8801c67e>] filp_close+0xe/0x90Cause
A third-party module named Tampercore is installed in the system. The module has bugs that trigger errors when the
filp_closefunction is called. The errors cause the instance to go down.Solution
We recommend that you uninstall or upgrade the Tampercore module.
ImportantBefore you perform the operations in the solution on a instance on which the issue occurred, we recommend that you create snapshots for the instance to back up data. This prevents data loss caused by accidental operations. For information about how to create a snapshot, see Create a snapshot.
A Linux ECS instance goes down and the "VFS: Unable to mount root fs on unknown-block" message appears in a log
Problem description
A Linux ECS instance repeatedly goes down during startup and cannot be used as expected. In addition, the "VFS: Unable to mount root fs on unknown-block" message appears in a log. The following code shows an example of the call stack:
[ 1.573197] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0) [ 1.574179] CPU: 4 PID: 1 Comm: swapper/0 Not tainted 3.10.0-1160.6.1.el7.x86_64 #1 [ 1.575045] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8f19b21 04/01/2014 [ 1.575900] Call Trace: [ 1.576246] [<ffffffff8f381400>] dump_stack+0x19/0x1b [ 1.576845] [<ffffffff8f37a958>] panic+0xe8/0x21f [ 1.577433] [<ffffffff8f98b794>] mount_block_root+0x291/0x2a0 [ 1.578122] [<ffffffff8f98b7f6>] mount_root+0x53/0x56 [ 1.578719] [<ffffffff8f98b935>] prepare_namespace+0x13c/0x174 [ 1.579425] [<ffffffff8f98b412>] kernel_init_freeable+0x222/0x249 [ 1.580150] [<ffffffff8f98ab28>] ? initcall_blacklist+0xb0/0xb0 [ 1.580838] [<ffffffff8f36fa90>] ? rest_init+0x80/0x80 [ 1.581462] [<ffffffff8f36fa9e>] kernel_init+0xe/0x100 [ 1.582073] [<ffffffff8f394df7>] ret_from_fork_nospec_begin+0x21/0x21 [ 1.582814] [<ffffffff8f36fa90>] ? rest_init+0x80/0x80Cause
The root file system (rootfs) is damaged because the kernel update is unexpectedly stopped or fails. As a result, the instance cannot find the file system of the root partition.
Solution
We recommend that you replace the system disk of the instance or roll back disks by using snapshots that you created for the disks. For more information, see Replace the operating system of an instance or Roll back a disk by using a snapshot.
ImportantBefore you perform the operations in the solution on a instance on which the issue occurred, we recommend that you create snapshots for the instance to back up data. This prevents data loss caused by accidental operations. For information about how to create a snapshot, see Create a snapshot.
A Linux ECS instance goes down and the "RIP:virtio_check_driver_offered_feature" message appears in a log
Problem description
A Linux ECS instance goes down at runtime and the "RIP:virtio_check_driver_offered_feature" message appears in a log. The following code shows an example of the call stack:
[55686.388353] BUG: unable to handle kernel NULL pointer dereference at 0000000000000090 [55686.389223] IP: [<ffffffffc0047450>] virtio_check_driver_offered_feature+0x10/0x90 [virtio] [55686.390030] PGD 229af2067 PUD 21cbac067 PMD 0 [55686.390514] Oops: 0000 [#1] SMP [55686.390867] Modules linked in: unix_diag AliSecGuard(OE) udp_diag tcp_diag inet_diag joydev binfmt_misc xfs libcrc32c dm_mod kvm_amd kvm irqbypass crc32_pclmul ppdev ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper parport_pc ablk_helper cryptd virtio_balloon pcspkr parport i2c_piix4 ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_net virtio_blk virtio_console cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crct10dif_pclmul crct10dif_common ata_piix crc32c_intel virtio_pci libata serio_raw virtio_ring virtio drm_panel_orientation_quirks floppy [55686.396603] CPU: 0 PID: 19222 Comm: fdisk Kdump: loaded Tainted: G OE ------------ 3.10.0-1062.1.2.el7.x86_64 #1 [55686.397848] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8c24b4c 04/01/2014 [55686.398578] task: ffff964836e8e2a0 ti: ffff964860370000 task.ti: ffff964860370000 [55686.399303] RIP: 0010:[<ffffffffc0047450>] [<ffffffffc0047450>] virtio_check_driver_offered_feature+0x10/0x90 [virtio] .... [55686.406216] Call Trace: [55686.406473] [<ffffffffc0102b4c>] virtblk_ioctl+0x3c/0x70 [virtio_blk] [55686.407098] [<ffffffff955608b5>] __blkdev_driver_ioctl+0x25/0x40 [55686.407697] [<ffffffffc03b5024>] dm_blk_ioctl+0x74/0xb0 [dm_mod] [55686.408289] [<ffffffff955612fa>] blkdev_ioctl+0x28a/0xa20 [55686.408817] [<ffffffff95488771>] block_ioctl+0x41/0x50 [55686.409319] [<ffffffff9545d9e0>] do_vfs_ioctl+0x3a0/0x5a0 [55686.409845] [<ffffffff95305a82>] ? ktime_get+0x52/0xe0 [55686.410345] [<ffffffff955024ec>] ? security_file_ioctl+0x1c/0x20 [55686.410930] [<ffffffff9545dc81>] SyS_ioctl+0xa1/0xc0 [55686.411429] [<ffffffff9598cede>] system_call_fastpath+0x25/0x2a [55686.411999] Code: d5 89 de 48 c7 c7 e0 93 04 c0 e8 4c 98 53 d5 5b 5d c3 66 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 8b 8f a0 00 00 00 48 89 e5 <8b> 91 90 00 00 00 85 d2 74 2c 48 8b 81 88 00 00 00 39 30 74 59 [55686.414738] RIP [<ffffffffc0047450>] virtio_check_driver_offered_feature+0x10/0x90 [virtio]Cause
Logical Volume Manager (LVM) is used and a logical volume (LV) is associated with a device that was deleted, such as the
vdcdevice. LVM retains the configurations of the device. When you run the commands that are related to the device, such as theblkidorfdiskcommand, the instance crashes.Solutions
Solution 1: Run LVM commands to delete the configurations of the nonexistent device to ensure that LVM contains correct device settings.
Solution 2: Upgrade the kernel version to kernel-3.10.0-1160.6.1.el7 or later. For more information, see Upgrade the operating system kernel of a Linux ECS instance.
A Linux ECS instance goes down and the "Out of memory and no killable processes" message appears in a log
Problem description
A Linux ECS instance goes down at runtime and the "Out of memory and no killable processes" message appears in a log. The following code shows an example of the call stack:
[28663.625353] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name [28663.625363] [ 1799] 0 1799 26512 245 56 3 0 -1000 sshd [28663.625367] [29219] 0 29219 10832 126 26 3 0 -1000 systemd-udevd [28663.625375] Kernel panic - not syncing: Out of memory and no killable processes... [28663.634374] CPU: 1 PID: 3578 Comm: kworker/u176:4 Tainted: G OE 3.10.0-1062.9.1.el7.x86_64 #1 [28663.676873] Call Trace: [28663.679312] [<ffffffff8139f342>] dump_stack+0x63/0x81 [28663.684421] [<ffffffff811b2245>] panic+0xf8/0x244 [28663.689184] [<ffffffff811b98db>] out_of_memory+0x2eb/0x550 [28663.694726] [<ffffffff811be254>] __alloc_pages_may_oom+0x114/0x1c0 [28663.700959] [<ffffffff811bedb3>] __alloc_pages_slowpath+0x7d3/0xa40 [28663.707279] [<ffffffff811bf229>] __alloc_pages_nodemask+0x209/0x260 [28663.713599] [<ffffffff81216535>] alloc_pages_current+0x95/0x140 [28663.719573] [<ffffffff811ba5ee>] __get_free_pages+0xe/0x40 [28663.725113] [<ffffffff81075dae>] pgd_alloc+0x1e/0x160 [28663.730225] [<ffffffff810875e4>] mm_init+0x184/0x240 [28663.735249] [<ffffffff81088102>] mm_alloc+0x52/0x60 [28663.740186] [<ffffffff81257640>] do_execveat_common.isra.37+0x250/0x780 [28663.759839] [<ffffffff81257b9c>] do_execve+0x2c/0x30 [28663.764864] [<ffffffff810a231b>] call_usermodehelper_exec_async+0xfb/0x150 [28663.777246] [<ffffffff81741dd9>] ret_from_fork+0x39/0x50Causes
When the operating system kernel of the instance fails to allocate memory to processes and attempts to terminate some processes to release memory, no processes that are running on the instance can be terminated. As a result, the instance goes down. The issue may occur due to the following reasons:
Memory leaks occur in the system kernel.
The processes whose
oom_score_adjvalue is set to-1000consume excessive memory resources and cannot be terminated. As a result, the system has insufficient available memory resources.NoteThe oom_score_adj parameter is used to adjust the termination priority of a process when an OOM issue occurs. The kernel selects processes to be terminated based on the OOM score (oom_score) of each process. A process that has a lower OOM score is more likely to be terminated.
Solutions
Check whether memory leaks exist in the system kernel.
For more information, see What do I do if an instance has a high percentage of slab_unreclaimable memory?
Check whether the
oom_score_adjvalue is properly configured.Run the following command to obtain the process ID (PID) of a process. You can run the
ps,top, orpgrepcommand to obtain the PID of a process.ps aux | grep <Process name>Replace the
<Process name>parameter with the name of the process that you want to query.Run the following command to check the
oom_score_adjvalue:cat /proc/<PID>/oom_score_adjReplace the
<PID>parameter with the PID that you obtained.Evaluate whether the OOM behavior of a process is reasonable based on the
oom_score_adjvalue and your environment and requirements. If theoom_score_adjvalue is-1000, the process is unlikely to be selected by the kernel for OOM termination. As a result, the system has insufficient available memory resources.
A Linux ECS instance goes down and the "Objects remaining in kmalloc" message appears in a log
Problem description
When you use the memory cgroup kmem feature on an instance, the instance goes down and an alert log appears in the operating system kernel of the instance, as shown in the following code. The following code shows an example of the call stack:
[80569.393775] BUG kmalloc-256(15:94ef869ce655ebab64b08cd78ee00d16c20efd5737493b48293de41fe41b04a0) (Tainted: P B W OE ------------ T): Objects remaining in kmalloc-256(15:94ef869ce655ebab64b08cd78ee00d16c20efd5737493b48293de41fe41b04a [80569.397756] ----------------------------------------------------------------------------- [80569.397756] [80569.400724] INFO: Slab 0xffffea0001e94a00 objects=32 used=1 fp=0xffff88007a528000 flags=0x1fffff00004080 [80569.402702] CPU: 21 PID: 26626 Comm: dockerd Tainted: P B W OE ------------ T 3.10.0-693.2.2.el7.x86_64 #1 [80569.404898] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8f19b21 04/01/2014 [80569.406747] ffffea0001e94a00 000000004eb9a19f ffff883afee53aa0 ffffffff816a3db1 [80569.408833] ffff883afee53b78 ffffffff811dbf54 ffffffff00000020 ffff883afee53b88 [80569.410731] ffff883afee53b38 656a624f8190fff8 616d657220737463 6e6920676e696e69 [80569.412630] Call Trace: [80569.414005] [<ffffffff816a3db1>] dump_stack+0x19/0x1b [80569.415627] [<ffffffff811dbf54>] slab_err+0xb4/0xe0 [80569.417204] [<ffffffff811e0623>] ? __kmalloc+0x1e3/0x230 [80569.420419] [<ffffffff811e1939>] kmem_cache_close+0x149/0x2e0 [80569.422006] [<ffffffff811e1ae4>] __kmem_cache_shutdown+0x14/0x80 [80569.423606] [<ffffffff811a6874>] kmem_cache_destroy+0x44/0xf0 [80569.425149] [<ffffffff811f6019>] kmem_cache_destroy_memcg_children+0x89/0xb0 [80569.426800] [<ffffffff811a6849>] kmem_cache_destroy+0x19/0xf0 [80569.428309] [<ffffffff8123b18e>] bioset_free+0xce/0x110 [80569.431306] [<ffffffffc06d0b43>] dm_destroy+0x13/0x20 [dm_mod] [80569.432803] [<ffffffffc06d69be>] dev_remove+0x11e/0x180 [dm_mod] [80569.435851] [<ffffffffc06d7015>] ctl_ioctl+0x1e5/0x500 [dm_mod] [80569.437363] [<ffffffffc06d7343>] dm_ctl_ioctl+0x13/0x20 [dm_mod] [80569.438882] [<ffffffff8121524d>] do_vfs_ioctl+0x33d/0x540 [80569.443291] [<ffffffff812154f1>] SyS_ioctl+0xa1/0xc0 [80569.446228] [<ffffffff816b5009>] system_call_fastpath+0x16/0x1bCause
When you use the memory cgroup kmem feature on an instance,
kmem_cache_destroycleans upmemcg cacheand checks whether therefcountparameter is set to 0 before kmem_cache_destroy destroyskmem_cache. If therefcountparameter is not set to 0, several tasks attempt to use thememcg cacheofkmem_cacheto allocate slab memory. In this case,raceconditions occur and cause the instance to go down.Solution
We recommend that you disable the memory cgroup kmem feature on your ECS instance. Perform the following steps:
Run the following command to open the /etc/default/grub file:
vim /etc/default/grubPress the I key to enter Insert mode and add the following content to the line that starts with
GRUB_CMDLINE_LINUX:cgroup.memory=nokmem
Press the Esc key to exit Insert mode, enter
:wq, and then press the Enter key to save and close the file.Run the following command to update GRand Unified Bootloader (GRUB):
grub2-mkconfig -o /boot/grub2/grub.cfgRun the following command to restart the instance:
reboot
If the memory cgroup kmem feature cannot be disabled in the operating system of your instance by running commands, we recommend that you do not configure the
memory.kmem.limit_in_bytesparameter in any program on your instance. This ensures that the memory cgroup kmem feature remains disabled.
A Linux ECS instance goes down at runtime and the "unable to handle kernel NULL pointer dereference" message appears in a log
Problem description
A Linux ECS instance crashes during the runtime, and the "unable to handle kernel NULL pointer dereference" message appears in a log. The following code shows an example of the call stack:
[8794845.086660] BUG: unable to handle kernel NULL pointer dereference at (null) [8794845.088500] IP: [<ffffffff8128f89c>] kref_get+0xc/0x30 [8794845.089355] PGD 812ca2067 PUD 6dd707067 PMD 0 [8794845.090303] Oops: 0000 [#1] SMP [8794845.091005] last sysfs file: /sys/devices/system/cpu/online [8794845.091861] CPU 3 [8794845.092212] Modules linked in: ysec_firewall_kmod(U) tcp_diag inet_diag nf_conntrack_netlink nfnetlink nf_conntrack_ipv6 nf_defrag_ipv6 ip6_tables xt_multiport nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables ipv6 virtio_balloon virtio_net virtio_console i2c_piix4 i2c_core ext4 jbd2 mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ysec_firewall_kmod] [8794845.101913] [8794845.102621] Pid: 21908, comm: ysec_hids_mod_l Tainted: G W --------------- 2.6.32-504.16.2.el6.x86_64 #1 Alibaba Cloud Alibaba Cloud ECS [8794845.105481] RIP: 0010:[<ffffffff8128f89c>] [<ffffffff8128f89c>] kref_get+0xc/0x30 [8794845.107400] RSP: 0018:ffff88045f5a3e38 EFLAGS: 00010292 [8794845.108628] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000fffffff3 [8794845.110501] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [8794845.112371] RBP: ffff88045f5a3e48 R08: 0000000000000000 R09: ffff88050f507f00 [8794845.114133] R10: 0000000000000003 R11: 0000000000000206 R12: ffffffff8161b040 [8794845.115994] R13: 0000000000000040 R14: 00007f4b457f94d0 R15: 0000000000000000 [8794845.117865] FS: 00007f4b457fb700(0000) GS:ffff880030380000(0000) knlGS:0000000000000000 [8794845.119846] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [8794845.121055] CR2: 0000000000000000 CR3: 00000006f6837000 CR4: 00000000001406e0 [8794845.122807] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [8794845.124685] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [8794845.126558] Process ysec_hids_mod_l (pid: 21908, threadinfo ffff88045f5a2000, task ffff8806d43acab0) [8794845.128689] Stack: [8794845.129414] ffff88045f5a3e68 0000000000000000 ffff88045f5a3e68 ffffffff810d6ae6 [8794845.131107] <d> ffffffff8161b040 ffff8806c03a3520 ffff88045f5a3ef8 ffffffff81203898 [8794845.133479] <d> 00007f4b457f9510 0000000000000000 ffff88045f5a3eb8 ffffffff8128c635 [8794845.136365] Call Trace: [8794845.137127] [<ffffffff810d6ae6>] pidns_get+0x26/0x30 [8794845.138367] [<ffffffff81203898>] proc_ns_readlink+0xc8/0x180 [8794845.139665] [<ffffffff8128c635>] ? _atomic_dec_and_lock+0x55/0x80 [8794845.141008] [<ffffffff811ab151>] ? touch_atime+0x71/0x1a0 [8794845.142268] [<ffffffff81193b0e>] sys_readlinkat+0xfe/0x120 [8794845.143536] [<ffffffff81193b4b>] sys_readlink+0x1b/0x20 [8794845.144695] [<ffffffff8100b072>] system_call_fastpath+0x16/0x1bCause
The kernel or a driver accesses illegal memory.
Solutions
Solution 1: Upgrade the kernel version to a later version. For more information, see Upgrade the operating system kernel of a Linux ECS instance.
ImportantBefore you perform the operations in the solution on a instance on which the issue occurred, we recommend that you create snapshots for the instance to back up data. This prevents data loss caused by accidental operations. For information about how to create a snapshot, see Create a snapshot.
Solution 2: Check whether unreliable third-party software or drivers are installed and uninstall the software or drivers. For more information, see How do I view the third-party software and drivers installed on an ECS instance?
A Linux ECS instance goes down and the "unable to handle kernel paging request at" message appears in a log
Problem description
A Linux ECS instance crashes during the runtime, and the "unable to handle kernel paging request at" message appears in a log. The following code shows an example of the call stack:
[85899.344803] BUG: unable to handle kernel paging request at ffffffffc0b0ceef [85899.345643] IP: [<ffffffffc0b0ceef>] 0xffffffffc0b0ceef [85899.346119] PGD 24f212067 PUD 24f214067 PMD 24e421067 PTE 0 [85899.346670] Oops: 0010 [#1] SMP [85899.346982] Modules linked in: nfnetlink_queue nfnetlink_log bluetooth rfkill ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink xt_addrtype br_netfilter tcp_diag inet_diag xt_set ip_set_hash_ip tampercfg(OE) overlay(T) ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_mangle iptable_security iptable_raw nf_conntrack libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter iosf_mbi ppdev virtio_balloon crc32_pclmul parport_pc ghash_clmulni_intel parport shpchp i2c_piix4 aesni_intel lrw gf128mul glue_helper joydev [85899.354796] ablk_helper pcspkr cryptd ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_net virtio_console virtio_blk cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ata_piix libata crct10dif_pclmul crct10dif_common crc32c_intel virtio_pci i2c_core serio_raw virtio_ring floppy virtio [last unloaded: tampercore] [85899.358255] CPU: 2 PID: 1 Comm: systemd Tainted: G OE ------------ T 3.10.0-862.14.4.el7.x86_64 #1 [85899.359264] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014 [85899.360050] task: ffff9880fa2c0000 ti: ffff9880fa2c8000 task.ti: ffff9880fa2c8000 [85899.360817] RIP: 0010:[<ffffffffc0b0ceef>] [<ffffffffc0b0ceef>] 0xffffffffc0b0ceef [85899.361636] RSP: 0018:ffff9880fa2cbd30 EFLAGS: 00010246 [85899.362181] RAX: 0000000000000000 RBX: 000055a50e52e3c0 RCX: 0000000000000000 [85899.362913] RDX: 0000000180080006 RSI: fffff786c5c52800 RDI: 0000000040000000 [85899.363645] RBP: ffff9880fa2cbf48 R08: ffff9880f14a0000 R09: 0000000180080005 [85899.364372] R10: 00000000f14a3001 R11: fffff786c5c52800 R12: ffff9880fa2cbd30 [85899.365107] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [85899.365840] FS: 00007fa181b3a940(0000) GS:ffff9883bfc80000(0000) knlGS:0000000000000000 [85899.366669] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [85899.367257] CR2: ffffffffc0b0ceef CR3: 000000024ed44000 CR4: 00000000003606e0 [85899.367992] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [85899.368728] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [85899.369453] Call Trace: [85899.369726] [<ffffffffa392579b>] system_call_fastpath+0x22/0x27 [85899.370339] Code: Bad RIP value. [85899.370729] RIP [<ffffffffc0b0ceef>] 0xffffffffc0b0ceef [85899.371292] RSP <ffff9880fa2cbd30> [85899.373188] CR2: ffffffffc0b0ceefCause
The kernel or a driver accesses illegal memory.
Solutions
Solution 1: Upgrade the kernel version to a later version. For more information, see Upgrade the operating system kernel of a Linux ECS instance.
ImportantBefore you perform the operations in the solution on a instance on which the issue occurred, we recommend that you create snapshots for the instance to back up data. This prevents data loss caused by accidental operations. For information about how to create a snapshot, see Create a snapshot.
Solution 2: Check whether unreliable third-party software or drivers are installed and uninstall the software or drivers. For more information, see How do I view the third-party software and drivers installed on an ECS instance?