This topic describes the cause of and solutions to the issue that a memory leak occurs on the Intel Software Guard Extension (SGX) driver of an Elastic Compute Service (ECS) instance that runs Alibaba Cloud Linux 2.
Problem description
When a memory leak occurs on the Intel SGX driver of an Alibaba Cloud Linux 2 instance with the following configurations, system memory may be exhausted:
Image: Alibaba Cloud Linux 2.1903 LTS 64-bit.
Kernel version:
kernel-4.19.91-23.al7or earlier. You can run theuname -rcommand to view the kernel version.Instance family: c7t, r7t, or g7t.
Most memory is occupied by the test process application of Intel SGX. The error information is displayed as follows:
[ 71.938733] systemd-journal invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
[ 71.938735] systemd-journal cpuset=/ mems_allowed=0
[ 71.938738] CPU: 0 PID: 415 Comm: systemd-journal Not tainted 4.19.91-23.al7.x86_64 #1
[ 71.938738] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 0.0.0 02/06/2015
[ 71.938739] Call Trace:
[ 71.938746] dump_stack+0x66/0x8b
[ 71.938749] dump_global_header+0x12/0x10f
[ 71.938750] oom_kill_process+0x2cf/0x310
[ 71.938752] out_of_memory+0xf7/0x4c0
[ 71.938754] __alloc_pages_nodemask+0xf07/0xfd0
[ 71.938757] ? blk_flush_plug_list+0xd7/0x220
[ 71.938759] pagecache_get_page+0x8c/0x350
[ 71.938761] filemap_fault+0x37e/0x6e0
[ 71.938764] ext4_filemap_fault+0x2c/0x3b
[ 71.938766] __do_fault+0x38/0x170
[ 71.938768] do_fault+0x2eb/0x640
[ 71.938769] __handle_mm_fault+0x621/0xa20
[ 71.938772] ? apic_timer_interrupt+0xa/0x20
[ 71.938774] handle_mm_fault+0x106/0x1c0
[ 71.938776] __do_page_fault+0x1ba/0x480
[ 71.938778] do_page_fault+0x32/0x140
[ 71.938780] ? async_page_fault+0x8/0x30
[ 71.938781] async_page_fault+0x1e/0x30
[ 71.938782] RIP: 0033:0x55a1ca49516f
[ 71.938786] Code: Bad RIP value.
[ 71.938787] RSP: 002b:00007ffcd58b22b0 EFLAGS: 00010246
[ 71.938788] RAX: 0000000000000000 RBX: 000055a1cbcc4400 RCX: a1fcdcf819d7e1e5
[ 71.938788] RDX: 00007f3d4d72a000 RSI: 000055a1cbcc2060 RDI: 000055a1cbcc4400
[ 71.938789] RBP: a1fcdcf819d7e1e5 R08: 00007ffcd58b23b0 R09: 00007ffcd58b23a8
[ 71.938790] R10: 000055a1ca49a935 R11: 00000000d1ba4319 R12: 000055a1cbcc4400
[ 71.938790] R13: 0000000000000011 R14: 000055a1cbcc2060 R15: a1fcdcf819d7e1e5
[ 71.938791] Task in / killed as a result of limit of host
[ 71.938792] Mem-Info:
[ 71.938795] active_anon:85 inactive_anon:410619 isolated_anon:0
active_file:150 inactive_file:353 isolated_file:0
unevictable:0 dirty:0 writeback:0 unstable:0
slab_reclaimable:6038 slab_unreclaimable:17336
mapped:98 shmem:403568 pagetables:1793 bounce:0
free:12881 free_pcp:440 free_cma:0
[ 71.938797] Node 0 active_anon:340kB inactive_anon:1642476kB active_file:600kB inactive_file:1412kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:392kB dirty:0kB writeback:0kB shmem:1614272kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2048kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[ 71.938798] Node 0 DMA free:7408kB min:392kB low:488kB high:584kB active_anon:0kB inactive_anon:8312kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15908kB mlocked:0kB kernel_stack:0kB pagetables:16kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 71.938800] lowmem_reserve[]: 0 1761 1761 1761 1761
[ 71.938801] Node 0 DMA32 free:44116kB min:44660kB low:55824kB high:66988kB active_anon:340kB inactive_anon:1633492kB active_file:688kB inactive_file:1812kB unevictable:0kB writepending:0kB present:1914960kB managed:1826408kB mlocked:0kB kernel_stack:2208kB pagetables:7156kB bounce:0kB free_pcp:1760kB local_pcp:1396kB free_cma:0kB
[ 71.938804] lowmem_reserve[]: 0 0 0 0 0
[ 71.938805] Node 0 DMA: 0*4kB 2*8kB (UM) 2*16kB (UE) 0*32kB 1*64kB (E) 3*128kB (UME) 3*256kB (UME) 2*512kB (ME) 3*1024kB (UME) 1*2048kB (E) 0*4096kB = 7408kB
[ 71.938810] Node 0 DMA32: 233*4kB (UMEH) 158*8kB (UMEH) 177*16kB (UMEH) 79*32kB (UEH) 34*64kB (UMEH) 16*128kB (UMEH) 6*256kB (E) 3*512kB (UE) 3*1024kB (ME) 3*2048kB (UME) 5*4096kB (M) = 44548kB
[ 71.938815] Node 0 enormouspages_total=0 enormouspages_free=0 enormouspages_surp=0 enormouspages_size=1048576kB
[ 71.938816] Node 0 enormouspages_total=0 enormouspages_free=0 enormouspages_surp=0 enormouspages_size=2048kB
[ 71.938816] 404127 total pagecache pages
[ 71.938817] 0 pages in swap cache
[ 71.938818] Swap cache stats: add 0, delete 0, find 0/0
[ 71.938818] Free swap = 0kB
[ 71.938819] Total swap = 0kB
[ 71.938819] 482739 pages RAM
[ 71.938820] 0 pages HighMem/MovableOnly
[ 71.938820] 22160 pages reserved
[ 71.938820] 0 pages cma reserved
[ 71.938821] 0 pages hwpoisoned
[ 71.938821] Tasks state (memory values in pages):
[ 71.938822] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[ 71.938824] [ 415] 0 415 11814 85 147456 0 0 systemd-journal
[ 71.938826] [ 439] 0 439 11430 228 118784 0 -1000 systemd-udevd
[ 71.938827] [ 550] 0 550 22654 218 212992 0 0 rngd
[ 71.938828] [ 554] 81 554 15051 155 167936 0 -900 dbus-daemon
[ 71.938829] [ 573] 0 573 48803 120 180224 0 0 gssproxy
[ 71.938830] [ 585] 0 585 6598 91 98304 0 0 systemd-logind
[ 71.938831] [ 587] 0 587 4456 115 61440 0 0 assist_daemon
[ 71.938832] [ 597] 32 597 17316 135 188416 0 0 rpcbind
[ 71.938833] [ 601] 0 601 31598 153 106496 0 0 crond
[ 71.938834] [ 606] 997 606 29454 129 143360 0 0 chronyd
[ 71.938835] [ 616] 0 616 27553 33 57344 0 0 agetty
[ 71.938836] [ 819] 0 819 25740 516 221184 0 0 dhclient
[ 71.938837] [ 887] 0 887 121900 708 430080 0 0 rsyslogd
[ 71.938838] [ 953] 0 953 10512 391 102400 0 0 AliYunDunUpdate
[ 71.938839] [ 1078] 0 1078 32317 732 274432 0 0 AliYunDun
[ 71.938840] [ 1235] 0 1235 28237 261 266240 0 -1000 sshd
[ 71.938841] [ 1283] 0 1283 39209 337 348160 0 0 sshd
[ 71.938842] [ 1292] 0 1292 29086 317 90112 0 0 bash
[ 71.938843] [ 1310] 0 1310 87597 530 311296 0 -900 abrt-dbus
[ 71.938844] [ 1397] 0 1397 39209 347 348160 0 0 sshd
[ 71.938845] [ 1399] 0 1399 29080 279 81920 0 0 bash
[ 71.938846] [ 1430] 0 1430 27028 25 77824 0 0 dmesg
[ 71.938847] [ 1431] 0 1431 8392985 92 3219456 0 0 app
[ 71.938848] [ 1432] 0 1432 39209 339 356352 0 0 sshd
[ 71.938849] [ 1434] 0 1434 29053 276 81920 0 0 bash
[ 71.938850] [ 1470] 0 1470 2146 23 57344 0 0 systemd-cgroups
[ 71.938851] [ 1471] 0 1471 2146 23 57344 0 0 systemd-cgroups
[ 71.938852] [ 1472] 0 1472 2146 23 53248 0 0 systemd-cgroups
[ 71.938853] [ 1473] 0 1473 2143 15 57344 0 0 systemd-cgroups
[ 71.938854] Out of memory: Kill process 1431 (app) score 1 or sacrifice child
[ 71.939026] Killed process 1431 (app) total-vm:33571940kB, anon-rss:320kB, file-rss:48kB, shmem-rss:0kB
[ 71.942922] oom_reaper: reaped process 1431 (app), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kBCause
The sgx_encl_mm_release_deferred function in Arch, X86, Kernel, CPU, SGX, and encl.c fails to properly process the reference count of the Encl structure. When a process that uses Enclave Page Cache (EPC) memory is forked, the reference count of Encl remains non-zero, resulting in an encrypted memory (EPC) leak. After physical memory is exhausted, shared memory is used to swap out the encrypted memory, eventually exhausting the non-encrypted memory.
Solutions
Kernel upgrades may cause compatibility and stability issues. Review the kernel features in release notes for Alibaba Cloud Linux 2 and exercise caution when you upgrade the kernel version.
The restart operation temporarily stops the instance, which may interrupt running services and lead to data loss. Therefore, back up critical instance data and then restart the instance during off-peak hours.
If the instance's kernel version is
4.19.91-23.al7.x86_64or earlier, perform the following steps:Upgrade the kernel to the latest version.
sudo yum update kernelRestart the instance for the new kernel version to take effect.
sudo reboot
If the instance's kernel version is
4.19.91-23.al7.x86_64, install a kernel live patch.sudo yum install -y kernel-hotfix-5577959-`uname -r | awk -F"-" '{print $NF}'`