Alibaba Cloud Linux 2 that uses the kernel version kernel-4.19.91-24.al7 or later supports the group identity feature. You can use the group identity feature to configure different identities for CPU control groups (cgroups) to define the priorities of process tasks in the cgroups.

Prerequisites

Note Alibaba Cloud Linux 2 that uses the kernel version kernel-4.19.91-26, kernel-4.19.91-26.1, kernel-4.19.91-26.2, or kernel-4.19.91-26.3 does not support the group identity feature because the feature has been disabled in the kernel. You can run the uname -r command to view the kernel version of Alibaba Cloud Linux 2.
  • For Alibaba Cloud Linux 2 that uses a kernel version in the range of kernel-4.19.91-25.1.al7 to kernel-4.19.91-25.5.al7, if you use the group identity feature, downtime occurs. You must upgrade the kernel version to kernel-4.19.91-25.6.al7 or later. For more information, see the FAQ section of this topic.
  • For Alibaba Cloud Linux 2 that uses the kernel version kernel-4.19.91-26.4 or later, /proc/sys/kernel/sched_group_identity_enabled is added for you to enable the group identity feature. Before you can use the group identity feature, you must run the echo 1 > /proc/sys/kernel/sched_group_identity_enabled command to enable the feature.

Background information

When latency-sensitive tasks and computing tasks are deployed on the same instance, the Linux kernel scheduler must provide more scheduling opportunities to high-priority tasks to minimize scheduling latency and the impacts of low-priority tasks on kernel scheduling. For this scenario, Alibaba Cloud Linux 2 provides the group identity feature and adds interfaces used to configure scheduling priorities to CPU cgroups. Tasks with different priorities have the following characteristics:
  • High-priority tasks have the minimum wakeup latency.
  • Low-priority tasks do not affect the performance of high-priority tasks.
    • The wakeup of low-priority tasks does not affect the performance of high-priority tasks.
    • Low-priority tasks do not affect the performance of high-priority tasks by sharing hardware units.

How the group identity feature works

The group identity feature allows you to configure identities for CPU cgroups to define the priorities of tasks in the cgroups. The group identity feature relies on a dual red-black tree architecture. A low-priority red-black tree is added based on the red-black tree of the Completely Fair Scheduler (CFS) scheduling queue to store low-priority tasks.

When the kernel schedules the tasks that have identities, the kernel processes the tasks based on their priorities. The following table describes the identities in descending order of priority.
Identity Description
ID_HIGHCLASS Identifies a high-priority task. A high-priority task has more opportunities to preempt resources than a normal- or low-priority task.
When the CFS schedules high-priority tasks, the following situations may occur:
  • If a high-priority task wakes up while a low-priority task is running, the high-priority task can unconditionally preempt resources from the low-priority task.
  • If a high-priority task wakes up while a normal-priority task is running and the virtual runtime (vruntime) of the high-priority task is less than that of the normal-priority task, the high-priority task can ignore the original scheduling policy and preempt resources. The original scheduling policy specifies that a task cannot preempt resources when its runtime on a CPU is less than the minimum runtime.
  • When tasks queue up to run, if a low- or normal-priority task is running, a high-priority task whose vruntime is less than that of the running task can ignore the original scheduling policy and preempt resources. The original scheduling policy specifies that a task cannot preempt resources when its runtime on a CPU is less than the minimum runtime.
ID_NORMAL Identifies a normal-priority task. A normal-priority task has more opportunities to preempt resources than a low-priority task.
When the CFS schedules normal-priority tasks, the following situations may occur:
  • If a normal-priority task wakes up while a low-priority task is running, the normal-priority task can unconditionally preempt resources from the low-priority task.
  • When tasks queue up to run, if a low-priority task is running, a normal-priority task whose vruntime is less than that of the running task can ignore the original scheduling policy and preempt resources. The original scheduling policy specifies that a task cannot preempt resources when its runtime on a CPU is less than the minimum runtime.
ID_UNDERCLASS Identifies a low-priority task.

When the CFS schedules low-priority tasks, the following situation may occur:

If an ID_SMT_EXPELLER task runs on the peer CPU of the simultaneous multithreading (SMT) scheduler, it evicts low-priority tasks from the peer CPU.

The preceding identities apply based on the resource management policies of CPU cgroups.
  • For tasks in cgroups of the same level, identity priorities take effect.
  • For tasks in cgroups of different levels, the identity priority of the higher-level task does not take effect, and that of the lower-level task takes effect.
  • For tasks with the same identity priority, resources are competed in compliance with CFS policies. Note that the runtime of tasks identified by ID_UNDERCLASS or ID_NORMAL is not ensured to reach the minimum value.
Other identities
Identity Description
ID_SMT_EXPELLER Identifies the SMT expeller. When an SMT expeller runs on an SMT CPU, it evicts the tasks that are identified by ID_UNDERCLASS from the peer CPU.
ID_IDLE_SEEKER Specifies that when a task wakes up, the task attempts to find idle CPUs within the limits of scheduler policies.
ID_IDLE_SAVER Used with the sched_idle_saver_wmarkkernel parameter. You can use sched_idle_saver_wmark to set a water mark for CPU idle time. When a task identified by ID_IDLE_SAVER wakes up, the task attempts to find an idle CPU whose idle time exceeds the specified water mark.

Interfaces

  • Interfaces used to configure identities
    The group identity feature provides two interfaces for you to configure task identities: /sys/fs/cgroup/cpu/$cg/cpu.identity and /sys/fs/cgroup/cpu/$cg/cpu.bvt_warp_ns. The $cg variable specifies the child cgroup directory node where a task is located. Before you use the interfaces, take note of the following items:
    • The cpu.bvt_warp_ns interface is a quick configuration interface. The written value of this interface can be converted to the value of identity.
    • Both the cpu.identity and cpu.bvt_warp_ns interfaces can be used to change the identities of cgroups.
    • The value of identity is written by using the cpu.identity interface overwrites the last value of identity written by using the cpu.bvt_warp_ns interface. The value of the cpu.bvt_warp_ns interface remains unchanged.
    • The value of identity is written by using the cpu.bvt_warp_ns interface overwrites the last value of identity written by using the cpu.identity interface. The value of the cpu.identity interface remains unchanged.
    • You can use one of the interfaces to configure task identities. We recommend that you do not configure both of the interfaces.
    • If you are unfamiliar with operations related to the operating system kernel, we recommend that you do not use the cpu.identity interface.
    The following table describes the interfaces.
    Interface Description
    cpu.identity The default value is 0, which indicates the ID_NORMAL identity.
    The interface is a 5-bit field. Valid values of each bit: 0 and 1. 0 indicates not to assume the identity. 1 indicates to assume the identity. Description of each bit:
    • If the interface value is left empty, it indicates the ID_NORMAL identity.
    • Bit 0: indicates the ID_UNDERCLASS identity.
    • Bit 1: indicates the ID_HIGHCLASS identity.
    • Bit 2: indicates the ID_SMT_EXPELLER identity.
    • Bit 3: indicates the ID_IDLE_SAVER identity.
    • Bit 4: indicates the ID_IDLE_SEEKER identity.

    For example, if you want to set the identity of a cgroup to ID_HIGHCLASS and ID_IDLE_SEEKER, set bit 1 and bit 4 to 1 and the other bits to 0 to obtain a binary value of 10010, which is converted to a decimal value of 18. Then, run the echo 18 > /sys/fs/cgroup/cpu/ $cg /cpu.identity command to write 18 to cpu.identity.

    cpu.bvt_warp_ns The default value is 0, which indicates the ID_NORMAL identity. Valid values:
    • 2: indicates the ID_SMT_EXPELLER, ID_IDLE_SEEKER, and ID_HIGHCLASS identities. The corresponding value in cpu.identity is 22.
    • 1: indicates the ID_HIGHCLASS and ID_IDLE_SEEKER identities. The corresponding value in cpu.identity is 18.
    • 0: indicates the ID_NORMAL identity. The corresponding value in cpu.identity is 0.
    • -1: indicates the ID_UNDERCLASS and ID_IDLE_SAVER identities. The corresponding value in cpu.identity is 9.
    • -2: indicates the ID_UNDERCLASS and ID_IDLE_SAVER identities. The corresponding value in cpu.identity is 9.
  • Interfaces used to enable or disable scheduling features
    You can run the following command to view the default settings of kernel scheduling features by using the sched_features interface:
    cat /sys/kernel/debug/sched_features
    The following table describes the scheduling features.
    Scheduling feature Description Default value
    ID_IDLE_AVG This feature is used with the ID_IDLE_SAVER identity to count the runtime of ID_UNDERCLASS tasks towards idle time. This prevents resource wastes by ensuring that no CPUs remain idle when only ID_UNDERCLASS tasks are running. ID_IDLE_AVG: indicates that the feature is enabled.
    ID_RESCUE_EXPELLEE This feature is used in load balancing scenarios. If tasks cannot find available CPU resources, CPUs that are evicting ID_UNDERCLASS tasks are used for balancing loads. This feature helps ID_UNDERCLASS tasks get out of the evicted state as soon as possible. ID_RESCUE_EXPELLEE: indicates that the feature is enabled.
    ID_EXPELLEE_NEVER_HOT After this feature is enabled, when a task that is being evicted decides whether to migrate to another CPU, hot cache will not be a reason for migration requests to be denied. This feature helps ID_UNDERCLASS tasks get out of the evicted state as soon as possible. NO_ID_EXPELLEE_NEVER_HOT: indicates that the feature is disabled.
    ID_LOOSE_EXPEL When this feature is enabled, CPUs do not update their eviction states every time they select tasks but have the states automatically updated at the time specified by the sched_expel_update_interval kernel parameter. The configuration of this feature affects only state updates when CPUs select tasks, not the updates of IPI interrupts. NO_ID_LOOSE_EXPEL: indicates that the feature is disabled.
    ID_LAST_HIGHCLASS_STAY When this feature is enabled, the last ID_HIGHCLASS task that runs on a CPU cannot be migrated to another CPU. ID_LAST_HIGHCLASS_STAY: indicates that the feature is enabled.
    ID_EXPELLER_SHARE_CORE
    • When this feature is enabled, ID_SMT_EXPELLER tasks can preferentially run on physical cores on which ID_SMT_EXPELLER tasks are already running.
    • When this feature is disabled, ID_SMT_EXPELLER tasks are distributed across physical cores so that they do not interfere with each other.
    ID_EXPELLER_SHARE_CORE: indicates that the feature is enabled.
  • Interfaces used by sysctl to configure kernel parameters
    Some capabilities of the group identity feature depend on the values of kernel parameters. The following table describes the parameters.
    Kernel parameter Description Unit Default value
    /proc/sys/kernel/sched_expel_update_interval The interval at which the eviction state is automatically updated when a CPU selects tasks. This parameter is valid only when the ID_LOOSE_EXPEL feature is enabled. ms 10
    /proc/sys/kernel/sched_expel_idle_balance_delay The minimum idle balance interval when a CPU is evicting tasks. A value of -1 indicates that idle balance is not allowed.

    If only ID_UNDERCLASS tasks exist on a CPU and the tasks are being evicted, the CPU is idle. Idle balance is performed on this CPU to improve load-balancing effects. However, this may damage ID_UNDERCLASS tasks. You can set the sched_expel_idle_balance_delay parameter to alleviate this issue.

    ms -1
    /proc/sys/kernel/sched_idle_saver_wmark The watermark for CPU idle time. When an ID_IDLE_SAVER task wakes up, the task attempts to find an idle CPU whose idle time exceeds the specified watermark. ns 0
    /proc/sys/kernel/sched_group_identity_enabled For the kernel version kernel-4.19.91-26.4 or later, /proc/sys/kernel/sched_group_identity_enabled is added for you to enable the group identity feature. Before you can use the group identity feature, you must run the echo 1 > /proc/sys/kernel/sched_group_identity_enabled command to enable the feature.

    After the group identity feature is enabled, if the cpu.bvt_warp_ns or cpu.identity value of the cgroup is not zero, data cannot be written to the /proc/sys/kernel/sched_group_identity_enabled interface.

    N/A 0

Information output

When you use the group identity feature, you can run the following command to view various parameters:
cat /proc/sched_debug
The following table describes the output parameters.
Parameter Description
nr_high_running The number of ID_HIGHCLASS tasks that are running on the current CPU.
nr_under_running The number of ID_UNDERCLASS tasks that are running on the current CPU.
nr_expel_immune The number of non-ID_UNDERCLASS tasks that are running on the current CPU.
smt_expeller Indicates whether ID_SMT_EXPELLER tasks are running on the current CPU. A value of 1 indicates that ID_SMT_EXPELLER tasks are running on the current CPU. A value of 0 indicates that no ID_SMT_EXPELLER tasks are running on the current CPU.
on_expel Indicates whether ID_SMT_EXPELLER tasks are running on the peer SMT CPU. A value of 1 indicates that ID_SMT_EXPELLER tasks are running on the current CPU. A value of 0 indicates that no ID_SMT_EXPELLER tasks are running on the current CPU.
high_exec_sum The cumulative runtime of ID_HIGHCLASS tasks on the current CPU.
under_exec_sum The cumulative runtime of ID_UNDERCLASS tasks on the current CPU.
h_nr_expel_immune The number of non-ID_UNDERCLASS tasks that are running on cfs_rq.
expel_start The difference between the minimum vruntimes of the two red-black trees when the CPU starts to evict tasks.
expel_spread The cumulative difference between the minimum vruntimes of the two red-black trees caused by CPU eviction states.
min_under_vruntime The minimum vruntime of the low-priority red-black tree.

FAQ

How do I upgrade a kernel version in the range of kernel-4.19.91-25.1.al7 to kernel-4.19.91-25.5.al7 to kernel-4.19.91-25.6.al7 or later?

Solution:
  1. Log on to the instance.

    For more information, see Connect to a Linux instance by using a password or key.

  2. Run the following command to query the kernel version:
    uname -r
  3. Run the following command to upgrade the kernel version:
    yum update kernel
  4. Run the following command to restart the instance to make the new kernel version take effect:
    reboot