This topic describes the cause of the following issue and how to resolve the issue: In specific cases, a memory leak occurs on the Intel Software Guard Extension (SGX) driver of an Elastic Compute Service (ECS) instance that runs Alibaba Cloud Linux 2.
Problem description
In specific cases, a memory leak occurs on the Intel SGX driver of an Alibaba Cloud Linux 2 instance that meets the following conditions. The memory leak eventually causes system memory exhaustion. Most of the memory is occupied by the test process application of Intel SGX, and out-of-memory (OOM) information similar to the following one is displayed:
Image: Alibaba Cloud Linux 2.1903 LTS 64-bit
Kernel: kernel-4.19.91-23.al7 or earlier
Instance type: c7t, r7t, or g7t
[ 71.938733] systemd-journal invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
[ 71.938735] systemd-journal cpuset=/ mems_allowed=0
[ 71.938738] CPU: 0 PID: 415 Comm: systemd-journal Not tainted 4.19.91-23.al7.x86_64 #1
[ 71.938738] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 0.0.0 02/06/2015
[ 71.938739] Call Trace:
[ 71.938746] dump_stack+0x66/0x8b
[ 71.938749] dump_global_header+0x12/0x10f
[ 71.938750] oom_kill_process+0x2cf/0x310
[ 71.938752] out_of_memory+0xf7/0x4c0
[ 71.938754] __alloc_pages_nodemask+0xf07/0xfd0
[ 71.938757] ? blk_flush_plug_list+0xd7/0x220
[ 71.938759] pagecache_get_page+0x8c/0x350
[ 71.938761] filemap_fault+0x37e/0x6e0
[ 71.938764] ext4_filemap_fault+0x2c/0x3b
[ 71.938766] __do_fault+0x38/0x170
[ 71.938768] do_fault+0x2eb/0x640
[ 71.938769] __handle_mm_fault+0x621/0xa20
[ 71.938772] ? apic_timer_interrupt+0xa/0x20
[ 71.938774] handle_mm_fault+0x106/0x1c0
[ 71.938776] __do_page_fault+0x1ba/0x480
[ 71.938778] do_page_fault+0x32/0x140
[ 71.938780] ? async_page_fault+0x8/0x30
[ 71.938781] async_page_fault+0x1e/0x30
[ 71.938782] RIP: 0033:0x55a1ca49516f
[ 71.938786] Code: Bad RIP value.
[ 71.938787] RSP: 002b:00007ffcd58b22b0 EFLAGS: 00010246
[ 71.938788] RAX: 0000000000000000 RBX: 000055a1cbcc4400 RCX: a1fcdcf819d7e1e5
[ 71.938788] RDX: 00007f3d4d72a000 RSI: 000055a1cbcc2060 RDI: 000055a1cbcc4400
[ 71.938789] RBP: a1fcdcf819d7e1e5 R08: 00007ffcd58b23b0 R09: 00007ffcd58b23a8
[ 71.938790] R10: 000055a1ca49a935 R11: 00000000d1ba4319 R12: 000055a1cbcc4400
[ 71.938790] R13: 0000000000000011 R14: 000055a1cbcc2060 R15: a1fcdcf819d7e1e5
[ 71.938791] Task in / killed as a result of limit of host
[ 71.938792] Mem-Info:
[ 71.938795] active_anon:85 inactive_anon:410619 isolated_anon:0
active_file:150 inactive_file:353 isolated_file:0
unevictable:0 dirty:0 writeback:0 unstable:0
slab_reclaimable:6038 slab_unreclaimable:17336
mapped:98 shmem:403568 pagetables:1793 bounce:0
free:12881 free_pcp:440 free_cma:0
[ 71.938797] Node 0 active_anon:340kB inactive_anon:1642476kB active_file:600kB inactive_file:1412kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:392kB dirty:0kB writeback:0kB shmem:1614272kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2048kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[ 71.938798] Node 0 DMA free:7408kB min:392kB low:488kB high:584kB active_anon:0kB inactive_anon:8312kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15908kB mlocked:0kB kernel_stack:0kB pagetables:16kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 71.938800] lowmem_reserve[]: 0 1761 1761 1761 1761
[ 71.938801] Node 0 DMA32 free:44116kB min:44660kB low:55824kB high:66988kB active_anon:340kB inactive_anon:1633492kB active_file:688kB inactive_file:1812kB unevictable:0kB writepending:0kB present:1914960kB managed:1826408kB mlocked:0kB kernel_stack:2208kB pagetables:7156kB bounce:0kB free_pcp:1760kB local_pcp:1396kB free_cma:0kB
[ 71.938804] lowmem_reserve[]: 0 0 0 0 0
[ 71.938805] Node 0 DMA: 0*4kB 2*8kB (UM) 2*16kB (UE) 0*32kB 1*64kB (E) 3*128kB (UME) 3*256kB (UME) 2*512kB (ME) 3*1024kB (UME) 1*2048kB (E) 0*4096kB = 7408kB
[ 71.938810] Node 0 DMA32: 233*4kB (UMEH) 158*8kB (UMEH) 177*16kB (UMEH) 79*32kB (UEH) 34*64kB (UMEH) 16*128kB (UMEH) 6*256kB (E) 3*512kB (UE) 3*1024kB (ME) 3*2048kB (UME) 5*4096kB (M) = 44548kB
[ 71.938815] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[ 71.938816] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 71.938816] 404127 total pagecache pages
[ 71.938817] 0 pages in swap cache
[ 71.938818] Swap cache stats: add 0, delete 0, find 0/0
[ 71.938818] Free swap = 0kB
[ 71.938819] Total swap = 0kB
[ 71.938819] 482739 pages RAM
[ 71.938820] 0 pages HighMem/MovableOnly
[ 71.938820] 22160 pages reserved
[ 71.938820] 0 pages cma reserved
[ 71.938821] 0 pages hwpoisoned
[ 71.938821] Tasks state (memory values in pages):
[ 71.938822] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[ 71.938824] [ 415] 0 415 11814 85 147456 0 0 systemd-journal
[ 71.938826] [ 439] 0 439 11430 228 118784 0 -1000 systemd-udevd
[ 71.938827] [ 550] 0 550 22654 218 212992 0 0 rngd
[ 71.938828] [ 554] 81 554 15051 155 167936 0 -900 dbus-daemon
[ 71.938829] [ 573] 0 573 48803 120 180224 0 0 gssproxy
[ 71.938830] [ 585] 0 585 6598 91 98304 0 0 systemd-logind
[ 71.938831] [ 587] 0 587 4456 115 61440 0 0 assist_daemon
[ 71.938832] [ 597] 32 597 17316 135 188416 0 0 rpcbind
[ 71.938833] [ 601] 0 601 31598 153 106496 0 0 crond
[ 71.938834] [ 606] 997 606 29454 129 143360 0 0 chronyd
[ 71.938835] [ 616] 0 616 27553 33 57344 0 0 agetty
[ 71.938836] [ 819] 0 819 25740 516 221184 0 0 dhclient
[ 71.938837] [ 887] 0 887 121900 708 430080 0 0 rsyslogd
[ 71.938838] [ 953] 0 953 10512 391 102400 0 0 AliYunDunUpdate
[ 71.938839] [ 1078] 0 1078 32317 732 274432 0 0 AliYunDun
[ 71.938840] [ 1235] 0 1235 28237 261 266240 0 -1000 sshd
[ 71.938841] [ 1283] 0 1283 39209 337 348160 0 0 sshd
[ 71.938842] [ 1292] 0 1292 29086 317 90112 0 0 bash
[ 71.938843] [ 1310] 0 1310 87597 530 311296 0 -900 abrt-dbus
[ 71.938844] [ 1397] 0 1397 39209 347 348160 0 0 sshd
[ 71.938845] [ 1399] 0 1399 29080 279 81920 0 0 bash
[ 71.938846] [ 1430] 0 1430 27028 25 77824 0 0 dmesg
[ 71.938847] [ 1431] 0 1431 8392985 92 3219456 0 0 app
[ 71.938848] [ 1432] 0 1432 39209 339 356352 0 0 sshd
[ 71.938849] [ 1434] 0 1434 29053 276 81920 0 0 bash
[ 71.938850] [ 1470] 0 1470 2146 23 57344 0 0 systemd-cgroups
[ 71.938851] [ 1471] 0 1471 2146 23 57344 0 0 systemd-cgroups
[ 71.938852] [ 1472] 0 1472 2146 23 53248 0 0 systemd-cgroups
[ 71.938853] [ 1473] 0 1473 2143 15 57344 0 0 systemd-cgroups
[ 71.938854] Out of memory: Kill process 1431 (app) score 1 or sacrifice child
[ 71.939026] Killed process 1431 (app) total-vm:33571940kB, anon-rss:320kB, file-rss:48kB, shmem-rss:0kB
[ 71.942922] oom_reaper: reaped process 1431 (app), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Cause
The sgx_encl_mm_release_deferred functions in Arch, x86, Kernel, CPU, SGX, and encl.c fail to properly process the reference count of the Encl structure. When the processes that occupy Enclave Page Cache (EPC) call fork(), the reference count of Encl does not return to zero, which causes EPC to leak. After the physical memory is exhausted, the shared memory is used instead of EPC, and then the non-EPC memory is exhausted.
Solutions
To resolve the issue, perform the following operations:
Log on to the instance and run the following command to view the kernel version. If the kernel version is later than 4.19.91-23.al7.x86_64, the following operations are not applicable.
uname -r
A command output similar to the following one is displayed.
4.19.91-23.al7.x86_64
Select one of the following solutions based on your kernel version:
To resolve the issue for a kernel version earlier than 4.19.91-23.al7.x86_64, perform the following steps:
Update your kernel version to the latest version.
yum update kernel
Restart the instance for the update to take effect.
reboot
Update kernel hotfixes.
If the issue exists on the latest kernel version, use the following solution.
To resolve the issue for kernel version 4.19.91-23.al7.x86_64, run the following installation command to install a kernel hotfix:
yum install -y kernel-hotfix-5577959-`uname -r | awk -F"-" '{print $NF}'`