Alibaba Cloud Linux 3 supports Unified Kernel Fault Event Framework (UKFEF) in the kernel of the 5.10.60-9.al8.x86_64 version. UKFEF is used to collect the exception events on Alibaba Cloud Linux 3 that may cause risks and generate event reports in a unified format. This topic describes the events collected by UKFEF, the methods used by UKFEF to generate event reports, and UKFEF control interfaces.

Background information

An operating system may display some characteristics or send recognizable error messages before serious issues occur. During O&M, you can use such information to predict faults and take precautions. However, the information is generated in different formats and is distributed among different modules. As a result, you may encounter the following issues when you collect exception events on an operating system:
  • Expertise on the operating system is required to parse exception events and determine their impacts.
  • Exception events are displayed in a variety of formats, which increases the difficulty of automated O&M. Automated O&M matches the collected exception events against formats and then performs data cleansing to filter out unneeded data based on actual requirements.

To resolve the preceding issues, Alibaba Cloud Linux 3 provides UKFEF at the kernel layer. UKFEF collects information from a variety of exception events that may give rise to risks, determines the severity of the events, and then generates event reports in a unified format. These reports include the scenarios in which the issues occur and the recommended risk levels. This makes it easier to identify system exceptions during O&M. UKFEF also classifies known exception events and provides system risk reports that were unavailable in previous kernel versions.

UKFEF generates reports based on multiple dimensions such as the type, impact, and statistics of exception events. This allows you to efficiently diagnose exceptions during O&M. In addition, event reports are generated by using multiple methods to prevent data loss.

Event description

The following table describes the event types and event levels classified by UKFEF and the methods for generating event reports.
Event information Description
Event type UKFEF collects the following common events on the operating system kernel:
  • soft lockup
  • Read-Copy Update (RCU) stall
  • hung task
  • global Out of Memory (OOM)
  • cgroup OOM
  • page allocation failure
  • list corruption
  • bad mm_struct
  • I/O error
  • EXT4-fs error
  • Machine Check Exception (MCE)
  • fatal signal
  • warning
  • panic
Event level UKFEF classifies exception events into three levels:
  • Slight: The exception event does not affect the normal running of the operating system, but services deployed in the operating system may experience jitter. You can continue to observe for changes in the exception event.
  • Normal: The exception event may occur in the current application process. We recommend that you take measures such as terminating, restarting, or migrating the current application process.
  • Fatal: The exception event may cause fatal damages to the operating system. We recommend that you immediately migrate your business.
Event report UKFEF generates event reports by using the following methods:
  • Use kernel logs to show the details of a single event. Example:
    class Fault event[module:type]:messages. At cpu cpuid, task pid(cmdline). Total fault: cnt
    The details include the following parameters:
    • class: The level of the exception event.
    • module: The module to which the exception event belongs, including sched, mem, io, fs, net, and hardware. When an exception is caused by multiple modules, the value of this parameter is general.
    • type: The type of the exception event.
    • messages: The custom message of the exception event.
    • cpuid: The ID of the CPU on which the exception event occurs.
    • pid(cmdline): The pid and cmdline of the process corresponding to the exception event.
      Note If the value of pid is -1, no corresponding process is running.
    • cnt: The total number of the occurrences of exception events of the current type after system startup.
  • Use the /proc/fault_events file to show the total number of the occurrences of exception events of all types. Example:
    Total fault events: 0
    Slight: 0
    Normal: 0
    Fatal: 0
    soft lockup: 0
    rcu stall: 0
    hung task: 0
    global oom: 0
    cgroup oom: 0
    page allocation failure: 0
    list corruption: 0
    bad mm_struct: 0
    io error: 0
    ext4 fs error: 0
    mce: 0
    fatal signal: 0
    warning: 0
    panic: 0

Control interfaces

Interface Description
/proc/sys/kernel/fault_event_enable Specifies whether UKFEF is enabled. Valid values:
  • 1: UKFEF is enabled.
  • 0: UKFEF is disabled.
/proc/sys/kernel/fault_event_print Specifies whether UKFEF generates event reports. Valid values:
  • 1: UKFEF generates event reports.
  • 0: UKFEF does not generate event reports.
/proc/sys/kernel/panic_on_fatal_event Specifies whether to trigger the Panic mechanism of the operating system when a Fatal event occurs. Valid values:
  • 1: triggers the Panic mechanism.
  • 0: does not trigger the Panic mechanism.