×
Community Blog Plugsched: Live Update Linux Kernel Scheduler

Plugsched: Live Update Linux Kernel Scheduler

This article gives an overview of OpenAnolis Plugsched and its benefits.

By Cloud Kernel Developers of OpenAnolis.

Plugsched is a SDK that enables live updating the Linux kernel scheduler. It can dynamically replace the scheduler subsystem without rebooting the system or applications, with milliseconds downtime. Plugsched can help developers to dynamically add, delete and modify kernel scheduling features in the production environment, which allows customizing the scheduler for different specific scenarios. The live-update capability also enables rollback.

The live-update of the scheduler based on Plugsched can obtain better modifiable capabilities without modifying the kernel source code and support the old online kernel version. If the reserve fields are added to the key data structures of the kernel scheduler code in advance, the ability to modify the data structure can be obtained, improving the modifiable capability.

Open-Source Link of Plugsched

What is the background of Plugsched or the problem it wants to solve? There are four points:

  • Different policies fit for different scenarios: The application is extremely rich, and the applications' characteristics are also ever-changing(throughput-oriented workloads, us-scale latency critical workloads, soft real-time, and energy efficiency requirements), which makes the optimization of scheduling policies is complex, and an one-fit-all strategy does not exist. So, it is necessary to allow users to customize the scheduler for different scenarios.
  • Scheduler evolved slowly : Linux kernel has been evolved and iterated for many years, and has a heavy code base. Scheduler is one of the core subsystems of the kernel and its structure is complex and tightly coupled with other OS subsystems, which makes the development and debugging even harder. Linux rarely merges new scheduling classes, and would be especially unlikely to accept a scenario-specific or non-generic scheduler. The scheduler is developed slowly in upstream.
  • Updating kernel is hard: The scheduler is built into the kernel, so applying changes to the scheduler requires updating the kernel. The kernel release cycle is usually several months, which makes the changes not able to be deployed quickly. Furthermore, updating kernel is even more expensive in the cluster, which involves application migration and machine downtime.
  • Unable to update a subsystem: kpatch and livepatch are live update techniques of function granularity, which have weak expressive ability and cannot implement complex code changes. For the eBPF, it is widely used in kernel network, but doesn't support the scheduler well yet, and even if it were, it would only allow small modifications to the scheduling policies, and the expressive ability is also weak.

Plugsched can extract the scheduler from the kernel as a kernel module to update the built-in scheduler in the kernel. New features and optimizations could be developed more agilely using the module, and can be applied to production environment without stopping businesses.

1
Figure 1: Plugsched Business without Interruption

Plugsched has the following six advantages:

  • Decoupling from Kernel Release: The scheduler version is decoupled from the kernel version, and different businesses scenarios can use different scheduling strategies. The continuous O&M capability is established to accelerate scheduling problem repair and policy optimization landing. It can improve the innovation and accelerate the technology evolution and iteration of the scheduler subsystem.
  • Strong Modifiable Ability: It can achieve complex scheduling characteristics and optimization strategies.
  • Simple Maintenance: It does not modify the kernel source code or modify the kernel code slightly, which keeps the kernel mainline clean. It independently maintains non-generic scheduling policies outside the kernel code tree, and packages them into RPMs to release and apply.
  • Simple to Use: Containerized SDK development environment, one-click generation of RPM, simple and efficient development and testing.
  • Backward Compatible: Supports older kernel versions so that existing online businesses can enjoy the benefits of new technologies in time.
  • Efficient Performance: Milliseconds downtime and negligible overhead.

Plugsched Application Case

Compared with kpatch and livepatch, plugsched has stronger modification ability and a wider range of live update such as a subsystem, but the kpatch and livepatch are live update techniques of function granularity. Plugsched is capable of bugfix, performance optimization, or feature addition, deletion, or modification. Due to the strong modifiable capability, it can be applied to the following scenarios:

  • Quickly develop, verify and release new features, and merge them into the kernel mainline after being stable.
  • Customize and optimize for specific business scenarios, publish and maintain non-generic scheduler features using RPM packages.
  • Customize and optimize for specific business scenarios, publish and maintain non-generic scheduler features using RPM packages.

Application Case 1: New Group Identity Scheduling Feature

Group Identity is a scheduling feature used by Alibaba Cloud in hybrid scenarios. Based on the CFS scheduler, it adds a Red-black tree to store low-priority tasks and assigns a default priority to each cgroup. Users can configure their priorities. When high-priority tasks exist in the queue, low-priority tasks stop running. We use Plugsched to live update the scheduler of an old version kernel of the anck 4.19 (without Group Identity) and transplant the Group Identity to the generated scheduler module, involving seven files and 2500 + line modifiers.

After installing this scheduler module, create two CPU cgroups in the system (cgroup A and B), bind the same CPU, set the highest and lowest priorities respectively, and create a busy loop task respectively. Theoretically, when a task is executed in cgroup A, the task in B stops running. Then, use the top tool to check the CPU utilization rate and find that there is only one busy loop task whose utilization rate is 100%, indicating that the Group Identity feature in the module has taken effect. However, after the module is dynamically removed, two busy loop tasks each occupy 50% CPU, indicating that the module is invalid.

Application Case 2: Decoupling from Kernel Release and Customizing the Scheduler

The old version of the kernel used by an Alibaba Cloud customer has excessive CPU utilization due to the unreasonable load statistics algorithm of the kernel scheduler. The bugfixes have been merged into the kernel mainline, but the new kernel version has not been released, and the business side does not intend to update the kernel. This is because a large number of businesses are deployed in the cluster, and the cost of updating the kernel is high.

Besides, the customer's kernel developers have targeted optimization for their mixed business scenarios (Group Identity scheduling features) and want to merge the optimization content into the kernel mainline. Alibaba Cloud kernel developers found that the optimized content has performance regression in other scenarios and is a non-generic optimization. It is not allowed to merge the optimized content into the mainline.

As a result, the customer's kernel developers used Plugsched to port all the optimized fixes to the scheduler module and deploy them on a large scale. This case can reflect the advantages of Plugsched decoupling from kernel release and customized scheduler.

How to use the Plugsched

Plugsched currently supports Anolis OS 7.9 ANCK by default, and other OS need to adjust the boundary configrations. In order to reduce the complexity of building a running environment, we provide container images and Dockerfiles, and developers do not need to build a development environment by themselves. For convenience, we purchased an Alibaba Cloud ECS (64CPU + 128GB) and installed the Anolis OS 7.9 ANCK. We will live update the kernel scheduler.

1.  Log into the cloud server, and install some neccessary basic software packages.

# yum install anolis-repos -y
# yum install podman kernel-debuginfo-$(uname -r) kernel-devel-$(uname -r) --enablerepo=Plus-debuginfo --enablerepo=Plus -y

2.  Create a temporary working directory and download the source code of the kernel.

# mkdir /tmp/work
# uname -r
4.19.91-25.2.an7.x86_64
# cd /tmp/work
# wget https://mirrors.openanolis.cn/anolis/7.9/Plus/source/Packages/kernel-4.19.91-25.2.an7.src.rpm

3.  Startup the container, and spawn a shell.

# podman run -itd --name=plugsched -v /tmp/work:/tmp/work -v /usr/src/kernels:/usr/src/kernels -v /usr/lib/debug/lib/modules:/usr/lib/debug/lib/modules docker.io/plugsched/plugsched-sdk
# podman exec -it plugsched bash
# cd /tmp/work

4.  Extract kernel source code.

# plugsched-cli extract_src kernel-4.19.91-25.2.an7.src.rpm ./kernel

5.  Boundary analysis and extraction.

# plugsched-cli init 4.19.91-25.2.an7.x86_64 ./kernel ./scheduler

6.  The extracted scheduler code is in ./scheduler/kernel/sched/mod. Add a new sched_feature and package it into a rpm.

diff --git a/scheduler/kernel/sched/mod/core.c b/scheduler/kernel/sched/mod/core.c
index 9f16b72..21262fd 100644
--- a/scheduler/kernel/sched/mod/core.c
+++ b/scheduler/kernel/sched/mod/core.c
@@ -3234,6 +3234,9 @@ static void __sched notrace __schedule(bool preempt)
    struct rq *rq;
    int cpu;

+    if (sched_feat(PLUGSCHED_TEST))
+        printk_once("I am the new scheduler: __schedule\n");
+
    cpu = smp_processor_id();
    rq = cpu_rq(cpu);
    prev = rq->curr;
diff --git a/scheduler/kernel/sched/mod/features.h b/scheduler/kernel/sched/mod/features.h
index 4c40fac..8d1eafd 100644
--- a/scheduler/kernel/sched/mod/features.h
+++ b/scheduler/kernel/sched/mod/features.h
@@ -1,4 +1,6 @@
/* SPDX-License-Identifier: GPL-2.0 */
+SCHED_FEAT(PLUGSCHED_TEST, false)
+
/*
 * Only give sleepers 50% of their service deficit. This allows
 * them to run sooner, but does not allow tons of sleepers to
# plugsched-cli build /tmp/work/scheduler

7.  Copy the scheduler rpm to the host, exit the container, and view the current sched_features.

# cp /usr/local/lib/plugsched/rpmbuild/RPMS/x86_64/scheduler-xxx-4.19.91-25.2.an7.yyy.x86_64.rpm /tmp/work
# exit
exit
# cat /sys/kernel/debug/sched_features
GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_DOUBLE_TICK NONTASK_CAPACITY TTWU_QUEUE NO_SIS_AVG_CPU SIS_PROP NO_WARN_DOUBLE_CLOCK RT_PUSH_IPI RT_RUNTIME_SHARE NO_LB_MIN ATTACH_AGE_LOAD WA_IDLE WA_WEIGHT WA_BIAS NO_WA_STATIC_WEIGHT UTIL_EST ID_IDLE_AVG ID_RESCUE_EXPELLEE NO_ID_EXPELLEE_NEVER_HOT NO_ID_LOOSE_EXPEL ID_LAST_HIGHCLASS_STAY

8.  Install the scheduler rpm and then the new feature is added but closed.

# rpm -ivh /tmp/work/scheduler-xxx-4.19.91-25.2.an7.yyy.x86_64.rpm
# lsmod | grep scheduler
scheduler             503808  1
# dmesg | tail -n 10
[ 2186.213916] cni-podman0: port 1(vethfe1a04fa) entered forwarding state
[ 6092.916180] Hi, scheduler mod is installing!
[ 6092.923037] scheduler: total initialization time is        6855921 ns
[ 6092.923038] scheduler module is loading
[ 6092.924136] scheduler load: current cpu number is               64
[ 6092.924137] scheduler load: current thread number is           667
[ 6092.924138] scheduler load: stop machine time is            249471 ns
[ 6092.924138] scheduler load: stop handler time is            160616 ns
[ 6092.924138] scheduler load: stack check time is              85916 ns
[ 6092.924139] scheduler load: all the time is                1097321 ns
# cat /sys/kernel/debug/sched_features
NO_PLUGSCHED_TEST GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_DOUBLE_TICK NONTASK_CAPACITY TTWU_QUEUE NO_SIS_AVG_CPU SIS_PROP NO_WARN_DOUBLE_CLOCK RT_PUSH_IPI RT_RUNTIME_SHARE NO_LB_MIN ATTACH_AGE_LOAD WA_IDLE WA_WEIGHT WA_BIAS NO_WA_STATIC_WEIGHT UTIL_EST ID_IDLE_AVG ID_RESCUE_EXPELLEE NO_ID_EXPELLEE_NEVER_HOT NO_ID_LOOSE_EXPEL ID_LAST_HIGHCLASS_STAY

9.  Open the new feature and we can see "I am the new scheduler: __schedule" in dmesg.

# echo PLUGSCHED_TEST > /sys/kernel/debug/sched_features
# dmesg | tail -n 5
[ 6092.924138] scheduler load: stop machine time is            249471 ns
[ 6092.924138] scheduler load: stop handler time is            160616 ns
[ 6092.924138] scheduler load: stack check time is              85916 ns
[ 6092.924139] scheduler load: all the time is                1097321 ns
[ 6512.539300] I am the new scheduler: __schedule

10.  Remove the scheduler rpm and then the new feature will be removed.

# rpm -e scheduler-xxx
# dmesg | tail -n 8
[ 6717.794923] scheduler module is unloading
[ 6717.809110] scheduler unload: current cpu number is               64
[ 6717.809111] scheduler unload: current thread number is           670
[ 6717.809112] scheduler unload: stop machine time is            321757 ns
[ 6717.809112] scheduler unload: stop handler time is            142844 ns
[ 6717.809113] scheduler unload: stack check time is              74938 ns
[ 6717.809113] scheduler unload: all the time is               14185493 ns
[ 6717.810189] Bye, scheduler mod has be removed!
#
# cat /sys/kernel/debug/sched_features
GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_DOUBLE_TICK NONTASK_CAPACITY TTWU_QUEUE NO_SIS_AVG_CPU SIS_PROP NO_WARN_DOUBLE_CLOCK RT_PUSH_IPI RT_RUNTIME_SHARE NO_LB_MIN ATTACH_AGE_LOAD WA_IDLE WA_WEIGHT WA_BIAS NO_WA_STATIC_WEIGHT UTIL_EST ID_IDLE_AVG ID_RESCUE_EXPELLEE NO_ID_EXPELLEE_NEVER_HOT NO_ID_LOOSE_EXPEL ID_LAST_HIGHCLASS_STAY

Note: Cannot unload the scheduler module directly using the "rmmod" command! You should use the "rpm or yum" standard command to remove the scheduler package.

Now, we know what Plugsched is and its application case, but how does it work?

The scheduler subsystem is built into the kernel, not an independent module. And it's highly coupled to other parts of the kernel. Plugsched takes advantage of the idea of modularization: it provides a boundary analyzer that determines the boundary of the scheduler subsystem and extracts the scheduler from the kernel code into a separate directory. Developers can modify the extracted scheduler code and compile it into a new scheduler module and dynamically replace the old scheduler in the running system. The boundary analysis and code extraction of the subsystem need to process functions and data, and then generate an independent module.

For functions, the scheduler module exports some key functions, which can be entered into the module through these functions, which are called interface functions. By replacing these functions in the kernel, the kernel can bypass the original execution logic and enter the new scheduler module, thereby completing the function update. Functions compiled in the scheduler module are either interface functions, or insiders. Other functions are all called outsiders.

For data, the important data of scheduler, such as runqueue state and sched class state, can be automatically reinitialized using state rebuild technology, and these data are private data, while others are shared data. Plugsched allows users to manually define the private data for flexibility, which retains definitions of these data in the module but requires initialization.

plugsched classifies struct fields which is accessed only by the scheduler as inner-fields, others as non-inner-fields. The scheduler module allows modifying the semantics of inner fields, and forbids to modify the semantics of non-inner fields. And the scheduler module even allows modifing the size of the whole data structure if all fileds are inner fileds. Last but most important, we recommend using reserved fields of data structures, rather than modifying existing ones.

Plugsched Design

Plugsched mainly consists of two parts, and the first one is the boundary analysis and code extraction of the scheduler subsystem, and the second one is the live updating of the scheduler module, which are the core of whole design. The design architecture is shown in as follows:

2
Figure 2: Architecture of Plugsched

The first is the scheduler module boundary analysis and code extraction. The scheduler itself is not a module, so it is necessary to determine the boundary of the scheduler for modularization. The boundary analyzer extracts the scheduler code from the kernel source code to the specified directory as the code base according to the boundary configuration information (includes source code files, the interface functions, etc). And then, developers can modify the code and customize the scheduler. Finally,scheduler module is compiled and packaged to a RPM, which can be installed into the system. After installation, the module will replace the original scheduler built in the kernel. The installation will go through the following key steps.

  • Symbol Relocation: relocate the undefined symbols in scheduler module.
  • Stack Safety Check: Like kpatch, stack inspection must be performed before function redirection, otherwise the system may crash. Plugsched optimizes stack inspection in parallel, which improves efficiency and reduces downtime.
  • Redirections: Dynamically replace interface functions in kernel with corresponding functions in module.
  • Scheduler State Rebuild: Synchronize the state between the new and old scheduler automatically, which greatly simplifies the maintenance of data state consistency.

Summary

On the whole, Plugsched frees the scheduler from the kernel. Developers can customize the scheduler specifically, not being limited to the kernel generic scheduler. Kernel maintenance becomes easier since developers only need to pay attention to the development and iteration of the generic scheduler, and the customized scheduler can be released using RPM packages. Kernel scheduler code will become clear, and be no longer confused with scenarios of optimization.

Plugsched will support new versions of the kernel and other platforms, optimize its ease of use, and provide more application cases in the future. And, welcome to Plugsched!

0 1 1
Share on

OpenAnolis

83 posts | 5 followers

You may also like

Comments

OpenAnolis

83 posts | 5 followers

Related Products

  • Alibaba Cloud Linux

    Alibaba Cloud Linux is a free-to-use, native operating system that provides a stable, reliable, and high-performance environment for your applications.

    Learn More
  • EMAS HTTPDNS

    EMAS HTTPDNS is a domain name resolution service for mobile clients. It features anti-hijacking, high accuracy, and low latency.

    Learn More
  • Managed Service for Prometheus

    Multi-source metrics are aggregated to monitor the status of your business and services in real time.

    Learn More
  • Red Hat Enterprise Linux

    Take advantage of the cost effectiveness, scalability, and flexibility of Alibaba Cloud's infrastructure and services, as well as the proven reliability of Red Hat Enterprise Linux and Alibaba Cloud's support backed by Red Hat Global Support Services.

    Learn More