A printk deadlock refers to a situation in which multiple threads in the system wait for each other to release resources and cannot continue when the printk function in the Linux kernel is called to print logs. printk deadlocks adversely affect the operation of system applications and services and cause kernel downtime. Therefore, preventing and resolving printk deadlocks at the earliest opportunity is critical to ensuring system stability and reliability. This topic describes why a downtime issue occurs in Alibaba Cloud Linux due to a printk deadlock and how to resolve the issue.
Problem description
When a downtime issue occurs in the kernel of the Alibaba Cloud Linux operating system, the analysis of the vmcore file reveals the following symptoms.
When the downtime issue occurs, a dump file named vmcore is generated. You can view kernel logs in the vmcore file, obtain the calltrace information that starts with "Call Trace:" to analyze the cause, and then troubleshoot the issue.
The kernel logs displayed after the dmesg command is run contains
warninglogs that are related to scheduling and work_queue.The calltrace information of a specific process has the following characteristics:
The functions that were last called are intended to acquire spinlocks, such as the
_raw_spin_lock,queued_spin_lock_slowpath, andraw_spin_rq_lock_netstedfunctions.The system calls the
printkfunction,console_unlockfunction, and the preceding functions that acquire spinlocks in sequence.
Cause
The issue is caused by a deadlock that occurs when the printk mechanism of the Linux community is used. A printk deadlock is unlikely to occur, but may occur in kernel version 5.10.134-16.3 of Alibaba Cloud Linux 3.
Why does a printk deadlock occur?
If the printk function is called to print kernel logs after the kernel holds the spinlock of work_queue or runqueue (rq), printk calls the underlying Direct Rendering Manager (DRM) driver to attempt to lock work_queue or rq again. As a result, a printk deadlock occurs, which causes a system downtime issue.
NoteFor information about how the DRM driver attempts to lock an object, refer to the drm/fb-helper: Add fb_deferred_io support patch.
Why do
warninglogs related to scheduling and work_queue appear?When the kernel holds the spinlock of work_queue or rq and calls the printk function to print logs, printk prints warning messages about scheduling and work_queue to kernel logs. This is why the demesg command output contains warning logs for scheduling and work_queue.
Why does kernel version
5.10.134-16.3of Alibaba Cloud Linux 3 have a higher probability of printk deadlocks?Warninglogs for scheduling and work_queue are printed only in a few scenarios. In kernel version5.10.134-16.3of Alibaba Cloud Linux 3, the asynchronous unthrottle feature that is backported from the Linux community has a regression defect. The defect increases the probability of printingwarninglogs and results in a high probability of printk deadlocks in Alibaba Cloud Linux 3.
Affected versions
printk deadlocks are a known issue introduced in the drm/fb-helper: Add fb_deferred_io support patch in Linux 4.10 released by the Linux community.
The issue exists in kernel versions
4.19and5.10of Alibaba Cloud Linux.The probability of this issue is high in kernel version
5.10.134-16.3of Alibaba Cloud Linux 3.
Solution
Run the following command to change the log level to prevent the printk function from printing warning logs to a serial port.
If your journal system captures logs on a serial port but not kernel logs printed by running the dmesg command, exercise caution when you change the log level.
Change in the log level causes the
warninglogs on the serial port to be missing, but does not affect thewarninglogs in the kernel logs that are printed by using the dmesg command.
sysctl -w kernel.printk="<console_loglevel> <default_message_loglevel> <minimum_console_loglevel> <default_console_loglevel>" >> /etc/sysctl.confValid values of the kernel.printk parameter:
console_loglevel: specifies a log level. The printk function prints the logs whose log levels are higher than the specified log level on the serial port.default_message_loglevel: specifies the default log level. If no log level is specified, the printk function is called to print logs of the default level.minimum_console_loglevel: specifies the minimum value of theconsole_loglevelparameter.default_console_loglevel: specifies the default value of theconsole_loglevelparameter.
Linux defines eight log levels of the kernel logs. A lower log level indicates a higher priority. To prevent warning logs from being printed on the serial port, we recommend that you set the console_loglevel parameter to a value less than or equal to 4. Sample command:
sysctl -w kernel.printk="4 4 1 7" >> /etc/sysctl.confThe following log levels are available:
#define LOGLEVEL_EMERG 0 /* system is unusable */
#define LOGLEVEL_ALERT 1 /* action must be taken immediately */
#define LOGLEVEL_CRIT 2 /* critical conditions */
#define LOGLEVEL_ERR 3 /* error conditions */
#define LOGLEVEL_WARNING 4 /* warning conditions */
#define LOGLEVEL_NOTICE 5 /* normal but significant condition */
#define LOGLEVEL_INFO 6 /* informational */
#define LOGLEVEL_DEBUG 7 /* debug-level messages */