Simulate an out-of-memory (OOM) scenario on an Elastic Compute Service (ECS) instance by injecting a process that continuously consumes memory. This drill helps you test if business processes are terminated, evaluate system recovery capabilities, and verify the effectiveness of monitoring and alerting mechanisms. Based on the drill results, you can develop response strategies to ensure that your system can quickly resume normal operation after an OOM event in the production environment, reducing the risk of business interruption.
Implementation principle
This solution uses the Cloud Assistant plugin ACS-ECS-HighMemory. Before injecting the fault, the plugin calculates the amount of memory to allocate. Then, it starts the trigger_oom injection process. This process consumes memory at a specific rate until it reaches the target memory usage. When an OOM event occurs, the operating system selects a process to terminate based on its score. A process's score is calculated based on the memory it occupies and its oom_score_adj value. Therefore, during fault injection, you can adjust the oom_score_adj parameter of the injection process. This allows the operating system to terminate either the business process or the injection process when an OOM event occurs. The oom_score_adj parameter accepts values from -1000 to 1000. The default value is 0. A higher value makes the process more likely to be terminated. A value of -1000 prevents the OOM Killer from terminating the process.
Usage guide
Prerequisites
Cloud Assistant Agent is installed on the ECS instance for which you want to perform a drill.
The status of Cloud Assistant is Normal on the ECS instance. For more information, see View the status of Cloud Assistant and handle anomalies.
Fault injection
Log on to the ECS instance.
For more information, see Log on to a Linux instance using Workbench.
As a user with sudo privileges, run the Cloud Assistant plugin
ACS-ECS-HighMemory.sudo acs-plugin-manager --exec --plugin ACS-ECS-HighMemory --params inject,[score=paramA],[percent=paramB],[rate=paramC],[duration=paramD]The square brackets [] in the command format indicate optional parameters. Do not include the brackets when you run the command.
For example, to set the memory usage to 90% and the duration to 120 s, run the following command:
sudo acs-plugin-manager --exec --plugin ACS-ECS-HighMemory --params inject,percent=90,duration=120Parameters:
score (optional): Sets the
oom_score_adjfor the injection process. The value can range from -1000 to 1000. This parameter determines whether the injection process or the business process is terminated when an Out of Memory (OOM) event occurs. To terminate the business process, set a negative score for the injection process, such as -100. If the business process is not terminated as a result, decrease the score.percent (optional): Specifies the target memory usage as a percentage of the total system memory. If you do not specify this parameter, an OOM event is triggered by default.
rate (optional): Specifies the rate of memory consumption in MB/s. The default value is 0, which indicates no limit.
duration (optional): Specifies the duration in seconds for which to maintain the target memory usage after it is reached. The memory is automatically released after the timeout. The default value is 300 s.
Verify that the fault was injected successfully.
Scenario without an OOM event:
Run the
topcommand. If the system memory usage reaches the target percentage, the fault was injected successfully.On the instance monitoring page of the ECS console or in the CloudMonitor console, if the memory usage reaches the target percentage, the fault was injected successfully.

Scenario with an OOM event: Search for logs that contain
Out of memoryin the system log.dmesg -T | grep "Out of memory"The following output indicates that an OOM event occurred on the ECS instance, which means the fault was injected successfully. Check if the terminated process is the one you intended to terminate. If not, adjust the
scoreparameter.
Fault recovery
If you specified a target memory usage, you can use one of the following methods to remove the injected fault.
Method 1 (Recommended): Run the fault recovery command on the ECS instance. Verify that the memory usage drops to the level it was at before the fault injection.
sudo acs-plugin-manager --exec --plugin ACS-ECS-HighMemory --params recoverMethod 2: Wait for the fault to be automatically released after the timeout. By default, the system automatically releases the memory 300 s after the target memory usage is reached.
If you did not specify a target memory usage, an OOM event is triggered. The system usually recovers automatically. However, you may need to restart the ECS instance to prevent other processes from being unexpectedly terminated.
Drill example
Inject a fault to achieve a memory usage of 90% at a rate of 20 MB/s for a duration of 120 s.
sudo acs-plugin-manager --exec --plugin ACS-ECS-HighMemory --params inject,percent=90,rate=20,duration=120If output similar to the following is returned, the fault was successfully injected.

Verify the injection result.
Check using the top command
Run the
topcommand. Then, pressMto sort processes by memory usage and pressmto display the memory usage progress bar.The output shows that the current memory usage is 90.2%, and the
trigger_oominjection process is using 84.4% of the memory.
After the timeout, the system automatically releases the memory. The memory usage returns to its pre-drill level, and the injection process exits.

Check in the instance monitoring details in the console

In the instance details in the console, view the memory usage. After the fault is injected, the system memory usage increases at the specified rate. It reaches the target of 90% and remains there for 120 s. Then, the memory usage drops to its pre-injection level, and the drill is complete.
Common OOM causes and solutions
High memory usage can cause system stuttering and slow down internal service responses. To resolve this issue, you can troubleshoot and analyze the causes of high memory usage. For more information, see What do I do if the memory usage of a Linux instance is high?.
An OOM event may occur because the instance has insufficient available memory or a resource is frequently requested, which leads to resource exhaustion. To resolve this issue, you can analyze the cause of the OOM event. For more information, see How do I handle OOM issues in a Linux instance?.