All Products
Search
Document Center

Elastic Compute Service:Out-of-memory (OOM) drill

Last Updated:Dec 16, 2025

Simulate an out-of-memory (OOM) scenario on an Elastic Compute Service (ECS) instance by injecting a process that continuously consumes memory. This drill helps you test if business processes are terminated, evaluate system recovery capabilities, and verify the effectiveness of monitoring and alerting mechanisms. Based on the drill results, you can develop response strategies to ensure that your system can quickly resume normal operation after an OOM event in the production environment, reducing the risk of business interruption.

What are OOM and the OOM Killer?

Out-of-memory (OOM) is a scenario where the available memory in an operating system is insufficient to fulfill the memory requests of a process. This can lead to process or system instability. To handle this scenario, the Linux operating system uses a kernel mechanism called the OOM Killer. The OOM Killer scores each process (oom_score) and terminates processes with low priority and high memory usage first. This action frees up memory and prevents uncontrolled business and system crashes caused by memory exhaustion.

Implementation principle

This solution uses the Cloud Assistant plugin ACS-ECS-HighMemory. Before injecting the fault, the plugin calculates the amount of memory to allocate. Then, it starts the trigger_oom injection process. This process consumes memory at a specific rate until it reaches the target memory usage. When an OOM event occurs, the operating system selects a process to terminate based on its score. A process's score is calculated based on the memory it occupies and its oom_score_adj value. Therefore, during fault injection, you can adjust the oom_score_adj parameter of the injection process. This allows the operating system to terminate either the business process or the injection process when an OOM event occurs. The oom_score_adj parameter accepts values from -1000 to 1000. The default value is 0. A higher value makes the process more likely to be terminated. A value of -1000 prevents the OOM Killer from terminating the process.

Usage guide

Prerequisites

Fault injection

  1. Log on to the ECS instance.

    For more information, see Log on to a Linux instance using Workbench.

  2. As a user with sudo privileges, run the Cloud Assistant plugin ACS-ECS-HighMemory.

    sudo acs-plugin-manager --exec --plugin ACS-ECS-HighMemory --params inject,[score=paramA],[percent=paramB],[rate=paramC],[duration=paramD]

    The square brackets [] in the command format indicate optional parameters. Do not include the brackets when you run the command.

    For example, to set the memory usage to 90% and the duration to 120 s, run the following command:sudo acs-plugin-manager --exec --plugin ACS-ECS-HighMemory --params inject,percent=90,duration=120

    Parameters:

    • score (optional): Sets the oom_score_adj for the injection process. The value can range from -1000 to 1000. This parameter determines whether the injection process or the business process is terminated when an Out of Memory (OOM) event occurs. To terminate the business process, set a negative score for the injection process, such as -100. If the business process is not terminated as a result, decrease the score.

    • percent (optional): Specifies the target memory usage as a percentage of the total system memory. If you do not specify this parameter, an OOM event is triggered by default.

    • rate (optional): Specifies the rate of memory consumption in MB/s. The default value is 0, which indicates no limit.

    • duration (optional): Specifies the duration in seconds for which to maintain the target memory usage after it is reached. The memory is automatically released after the timeout. The default value is 300 s.

  3. Verify that the fault was injected successfully.

    • Scenario without an OOM event:

      • Run the top command. If the system memory usage reaches the target percentage, the fault was injected successfully.

      • On the instance monitoring page of the ECS console or in the CloudMonitor console, if the memory usage reaches the target percentage, the fault was injected successfully.

        image

    • Scenario with an OOM event: Search for logs that contain Out of memory in the system log.

      dmesg -T | grep "Out of memory" 

      The following output indicates that an OOM event occurred on the ECS instance, which means the fault was injected successfully. Check if the terminated process is the one you intended to terminate. If not, adjust the score parameter.

      image

Fault recovery

If you specified a target memory usage, you can use one of the following methods to remove the injected fault.

  • Method 1 (Recommended): Run the fault recovery command on the ECS instance. Verify that the memory usage drops to the level it was at before the fault injection.

    sudo acs-plugin-manager --exec --plugin ACS-ECS-HighMemory --params recover
  • Method 2: Wait for the fault to be automatically released after the timeout. By default, the system automatically releases the memory 300 s after the target memory usage is reached.

If you did not specify a target memory usage, an OOM event is triggered. The system usually recovers automatically. However, you may need to restart the ECS instance to prevent other processes from being unexpectedly terminated.

Drill example

  1. Inject a fault to achieve a memory usage of 90% at a rate of 20 MB/s for a duration of 120 s.

    sudo acs-plugin-manager --exec --plugin ACS-ECS-HighMemory --params inject,percent=90,rate=20,duration=120

    If output similar to the following is returned, the fault was successfully injected.

    image

  2. Verify the injection result.

    Check using the top command

    1. Run the top command. Then, press M to sort processes by memory usage and press m to display the memory usage progress bar.

      The output shows that the current memory usage is 90.2%, and the trigger_oom injection process is using 84.4% of the memory.

      image

    2. After the timeout, the system automatically releases the memory. The memory usage returns to its pre-drill level, and the injection process exits.

      image

    Check in the instance monitoring details in the console

    image

    In the instance details in the console, view the memory usage. After the fault is injected, the system memory usage increases at the specified rate. It reaches the target of 90% and remains there for 120 s. Then, the memory usage drops to its pre-injection level, and the drill is complete.

Common OOM causes and solutions

  • High memory usage can cause system stuttering and slow down internal service responses. To resolve this issue, you can troubleshoot and analyze the causes of high memory usage. For more information, see What do I do if the memory usage of a Linux instance is high?.

  • An OOM event may occur because the instance has insufficient available memory or a resource is frequently requested, which leads to resource exhaustion. To resolve this issue, you can analyze the cause of the OOM event. For more information, see How do I handle OOM issues in a Linux instance?.