System Analyse Kit (SysAK) provides tools for routine monitoring, online issue diagnostics, and system failure recovery on Alibaba Cloud operating systems. SysAK was built from years of experience operating and maintaining millions of servers.
SysAK runs with minimal overhead. All tools combined use at most 3% CPU, and each individual tool uses at most 1% CPU. SysAK does not add load overheads or cause network jitters.
SysAK hooks functions in the kernel for runtime diagnostics and monitoring, which may cause system instability. Select an appropriate maintenance window to run diagnostic and monitoring commands.
Quick start
After you install SysAK, try these common commands:
sysak list -a # List all available tools
sysak loadtask -s # Show current system load summary
sysak iofsstat -T 10 # Monitor disk I/O for 10 seconds
sysak memleak -t slab -c # Quick check for slab memory leaks
sysak nosched -t 20 -s 30 # Detect scheduling delays > 20ms for 30s
sysak irqoff -t 5 60 # Detect interrupt-off periods > 5ms for 60s
sysak pingtrace -c <ip> # Trace network latency to a host
sysak memgraph -g # Show memory usage chart
sudo sysak mservice -S # Start system monitoring
sysak mservice -l # View monitoring data interactivelySysAK operates in two modes:
| Mode | Behavior |
|---|---|
| Monitoring | Runs in the background. Collects and tracks system metrics continuously. |
| Diagnostics | Runs on demand. Analyzes root causes of system issues in real time. |
O&M scenarios
SysAK covers three operations and maintenance (O&M) scenarios:
Routine monitoring: Monitor system resources, schedule and manage business, and control business resources at fine granularity. Track interruptions and jitters in real time.
Issue diagnostics: Diagnose abnormal loads, network jitters, memory leaks, I/O hangs, and performance exceptions online.
Failure recovery: Isolate and recover from partial failures such as deadlocks and breakdowns.
Install SysAK
Prerequisites
Linux with kernel version 3.10 or later (Alibaba Cloud Linux 2, Alibaba Cloud Linux 3, Anolis OS 8.4 ANCK, or CentOS 7)
x86_64 architecture
Run uname -a to check the kernel version of your Elastic Compute Service (ECS) instance.Alibaba Cloud Linux 2
Option 1: Install from the YUM repository (recommended)
Check available versions:
yum search sysakInstall the latest version:
sudo yum install -y sysak
Option 2: Install from an RPM package (if the Alibaba Cloud YUM repository is unavailable)
Download the latest RPM package that matches your kernel version:
Visit the Open Source Image Site to find RPM packages for your kernel version.
wget https://mirrors.openanolis.cn/sysak/packages/sysak-1.3.0-2.x86_64.rpmInstall the package:
sudo rpm -ivh --nodeps sysak-1.3.0-2.x86_64.rpm
Anolis OS 8.4 ANCK
Download the latest RPM package that matches your kernel version:
Visit anolis / sysak to find RPM packages for your kernel version.
wget https://mirrors.openanolis.cn/sysak/packages/sysak-1.3.0-2.x86_64.rpmInstall the package:
sudo rpm -ivh --nodeps sysak-1.3.0-2.x86_64.rpm
Other Linux distributions (kernel 3.10 or later)
For other Linux distributions such as CentOS 7, build SysAK from source. Only the open source version is available, and compatibility issues may occur. Visit anolis / sysak for build instructions.
Verify the installation
Confirm that SysAK is working:
sysak helpExpected output:
Usage: sysak [ cmd ] [ subcmd [ cmdargs ] ]List all available tools:
sysak list -aCommon commands
| Command | Description |
|---|---|
sysak help | Display usage information. Syntax: sysak [cmd] [subcmd [cmdargs]] |
sysak list -a | List all supported tool features |
sysak [subcmd] -h | Display help for a specific tool |
cmd: Management commands such aslistandhelp.subcmd: Tool-specific feature commands.cmdargs: Arguments for tool commands.
System monitoring
Start monitoring
Start monitoring with one of the following methods:
Run the monitoring service directly:
sudo sysak mservice -SAdd SysAK as a persistent system service that starts automatically on boot:
sudo systemctl enable sysak sudo systemctl start sysak
Configure monitoring
The configuration file is located at /usr/local/sysak/sysakmon.conf. After modifying the file, restart the service:
systemctl restart sysakConfiguration options:
| Option | Description | Default |
|---|---|---|
server_mode http,local | Monitoring mode. http: expose metrics over HTTP. local: store and view data locally. | - |
cron_period 60 | Sampling period in local mode (seconds) | 60 |
output_file_path | Log storage path in local mode | /usr/local/sysak/tsar.data |
mod_xxx on|off | Enable (on) or disable (off) a specific metric | - |
View monitoring data
| Mode | Command | Description |
|---|---|---|
| HTTP | curl http://127.0.0.1:9200/metrics/raw/ | System monitoring data |
| HTTP | curl http://127.0.0.1:9200/metrics/cgroup/raw | Control groups (cgroups) monitoring data |
| HTTP | curl http://127.0.0.1:9200/metrics/cgroup/$cgroupid/raw | Data for a specific cgroup |
| Local | sysak mservice -l | Interactive monitoring view |
For HTTP mode, replace 127.0.0.1 with the IP address of the monitored ECS instance.Monitoring metrics reference
Metrics marked with a provider name are implemented by SysAK itself or by kernel features of Alibaba Cloud Linux and Anolis OS.
System resources
Computing resources
| Category | Metric | Description |
|---|---|---|
| CPU | user | User-mode CPU utilization |
| CPU | sys | System-mode CPU utilization |
| CPU | hirq | CPU utilization servicing hardware interrupts |
| CPU | sirq | CPU utilization servicing software interrupts |
| LOAD | load* | Average system load over the past 1 second, 5 seconds, or 15 seconds |
Memory resources
| Category | Metric | Description |
|---|---|---|
| Memory | free | Amount of unused memory |
| Memory | used | Amount of used memory |
| Memory | buffer | Amount of memory used as buffers |
| Memory | cache | Amount of memory used as cache |
| Memory | total | Total memory |
| Memory | mem.util | Memory usage percentage |
| Swap | swpin | Number of pages swapped in |
| Swap | swapout | Number of pages swapped out |
| Swap | total | Total swap pages |
| Swap | swap.util | Swap usage percentage |
I/O resources
| Category | Metric | Description |
|---|---|---|
| I/O access | rrqms | Merged read requests per second |
| I/O access | wrqms | Merged write requests per second |
| I/O access | rs | Read requests per second |
| I/O access | ws | Write requests per second |
| I/O access | rsecs | Sectors read per second |
| I/O access | wsecs | Sectors written per second |
| I/O access | rqsize | Average request size |
| I/O access | qusize | Average request queue length |
| I/O access | svctm | Average I/O service duration |
| I/O access | io.util | Percentage of CPU time during which requests are issued |
| Disk space | bfree | Unused data blocks |
| Disk space | bused | Used data blocks |
| Disk space | btotl | Total data blocks |
| Disk space | patition.util | Partition usage |
| Disk space | ifree | Available inodes |
| Disk space | itotl | Total inodes |
| Disk space | iutil | Inode usage |
Network resources
| Category | Metric | Description |
|---|---|---|
| Network traffic | bytin | Received bytes |
| Network traffic | bytout | Sent bytes |
| Network traffic | pktin | Total received packets |
| Network traffic | pktout | Total sent packets |
| TCP | active | Active TCP connections |
| TCP | pasive | Passive TCP connections |
| TCP | iseg | Received TCP packets |
| TCP | outseg | Sent TCP packets |
| UDP | idgm | Received UDP packets |
| UDP | odgm | Sent UDP packets |
System bottlenecks
I/O bottleneck
| Category | Metric | Description |
|---|---|---|
| Read/write latency | await | Average I/O waiting time |
| Read/write latency | rawait | Average I/O read waiting time |
| Read/write latency | wawait | Average I/O write waiting time |
Memory bottleneck
| Category | Metric | Description |
|---|---|---|
| Cache reclaim and defragmentation | kswapd | Number of times Kernel Swap Daemon (kswapd) reclaims pages |
| Cache reclaim and defragmentation | pg_kr | Pages asynchronously reclaimed |
| Cache reclaim and defragmentation | pg_dr | Pages directly reclaimed |
| Cache reclaim and defragmentation | kcompd | Number of times kcompactd compacts memory |
| Cache reclaim and defragmentation | dc_all | Number of direct memory compaction events |
| Cache reclaim and defragmentation | dc_fin | Number of completed direct memory compactions |
| Cache reclaim and defragmentation | oom | Number of out-of-memory (OOM) errors |
Network bottleneck
| Category | Metric | Description |
|---|---|---|
| Network transmission | pkterr | Error packets |
| Network transmission | pktdrp | Dropped packets |
| Network transmission | EstReset | Resets during ESTABLISHED TCP connections |
| Network transmission | AtmpFail | Failed TCP connection attempts |
| Network transmission | retran | TCP retransmission rate |
| Network transmission | noport | Nonexistent UDP ports or addresses |
| Network transmission | idmerr | Invalid UDP packets |
CPU bottleneck
| Category | Metric | Description | Provided by |
|---|---|---|---|
| Multitask concurrency | cswch | Context switches on CPU resources | - |
| Multitask concurrency | proc | Number of fork system calls | - |
| Ready queue delays | rqslow.dltnum | Times the ready queue wait exceeded the threshold | SysAK |
| Ready queue delays | rqslow.dlttm | Total latency when ready queue wait exceeded the threshold | SysAK |
System software bottleneck
| Category | Metric | Description | Provided by |
|---|---|---|---|
| Kernel critical resources | noschd.dltnum | Times CPU system-mode duration exceeded the threshold | SysAK |
| Kernel critical resources | noschd.dlttm | Total latency when CPU system-mode duration exceeded the threshold | SysAK |
System interruptions
| Category | Metric | Description | Provided by |
|---|---|---|---|
| Interrupt disable latency | irqoff.dltnum | Times the interrupt disable period exceeded the threshold | SysAK |
| Interrupt disable latency | irqoff.dlttm | Total latency when the interrupt disable period exceeded the threshold | SysAK |
Container metrics
These metrics are collected per container.
Computing resources
| Category | Metric | Description | Provided by |
|---|---|---|---|
| CPU | usr/sys/hriq/sirq | CPU utilization in user mode, system mode, hardware interrupts, and software interrupts | - |
| Load | nrun | Ready tasks in the container | Alibaba Cloud Linux and Anolis OS |
| Load | nunint | Tasks in D block state in the container | Alibaba Cloud Linux and Anolis OS |
| Load | load* | Average container load over the past 1 second, 5 seconds, or 10 seconds | Alibaba Cloud Linux and Anolis OS |
Memory resources
| Category | Metric | Description | Provided by |
|---|---|---|---|
| Memory | total/free/used/cache/buffer | Total, available, used, cache, and buffer memory in the container | - |
| Memory bottleneck | pgfault | Page faults in the container | - |
| Memory bottleneck | pgmajfault | Page faults due to disk swapping or file mappings | - |
| Memory bottleneck | mfailcnt | Failed memory allocation requests in the container | - |
| Memory bottleneck | drgl* | Global memory reclaim latency distribution | Alibaba Cloud Linux and Anolis OS |
| Memory bottleneck | drml* | Container memory reclaim latency distribution | Alibaba Cloud Linux and Anolis OS |
| Memory bottleneck | dcl* | Container memory compaction latency distribution | Alibaba Cloud Linux and Anolis OS |
I/O resources
| Category | Metric | Description | Provided by |
|---|---|---|---|
| I/O | riops | Read operations in the container | - |
| I/O | wiops | Write operations in the container | - |
| I/O | rbps | Bytes read from the container | - |
| I/O | wbps | Bytes written to the container | - |
| I/O | rwait | Read operation waiting time | Alibaba Cloud Linux and Anolis OS |
| I/O | wwait | Write operation waiting time | Alibaba Cloud Linux and Anolis OS |
| I/O | rsrv | Read service time | Alibaba Cloud Linux and Anolis OS |
| I/O | wsrv | Write service time | Alibaba Cloud Linux and Anolis OS |
| I/O | rioq | Queued read operations | Alibaba Cloud Linux and Anolis OS |
| I/O | wioq | Queued write operations | Alibaba Cloud Linux and Anolis OS |
| I/O | rioqsz | Bytes in queued read operations | Alibaba Cloud Linux and Anolis OS |
| I/O | wioqsz | Bytes in queued write operations | Alibaba Cloud Linux and Anolis OS |
| I/O | rarqsz | Average bytes per read operation | Alibaba Cloud Linux and Anolis OS |
| I/O | warqsz | Average bytes per write operation | Alibaba Cloud Linux and Anolis OS |
Hardware resources
| Category | Metric | Description |
|---|---|---|
| Resource bottleneck | llcref | Last Level Cache (LLC) accesses in the container |
| Resource bottleneck | llcmis | LLC misses in the container |
| Resource bottleneck | CPI | CPI (Cycles Per Instruction) in the container |
Diagnostics tools
System scanning
ossre_client
Automatically scans for potential issues across the system.
sysak ossre_client [ -a ] [ -p ] [ -i ]| Option | Description |
|---|---|
-a | Scan the entire system |
-p | Scan for panic events only |
-i | Scan for known issues only |
Some options can be used with the ossre server.
CPU and scheduling issues
loadtask
Diagnoses system load by identifying the processes with the highest loads and their causes.
sysak loadtask [ -m maxload ] [ -i interval ] [ -f outfile ] [ -d ] [ -s ] [ -g ]| Option | Description | Default |
|---|---|---|
-m maxload | Load threshold. Triggers automatic diagnostics when breached. If omitted, diagnoses immediately. | Immediate |
-i interval | Scan interval in seconds (monitoring mode) | - |
-f outfile | Output file path | /var/log/sysak/loadtask.log |
-d | In monitoring mode: save all data when values exceed maxload (without -d, SysAK exits after the first detection) | Off |
-s | Show load summary in the console | Off |
-g | Generate a flame graph for the entire system | Off |
nosched
Diagnoses tasks that cannot be scheduled in a timely manner because the CPU has run in kernel mode for an extended period.
sysak nosched [--help] [-t THRESH(ms)] [-f LOGFILE] [-s duration(s)]| Option | Description | Default |
|---|---|---|
-t THRESH | Threshold for unscheduled time (milliseconds). Events exceeding this value are recorded. | 10 |
-f LOGFILE | Log file path | /var/log/sysak/nosched/nosched.log |
-s duration | Program run duration (seconds). Runs indefinitely if omitted. | Indefinite |
irqoff
Diagnoses interrupts that are disabled for an extended period.
sysak irqoff [--help] [-t THRESH(ms)] [-f LOGFILE] [duration(s)]| Option | Description | Default |
|---|---|---|
-t THRESH | Threshold for interrupt-disabled time (milliseconds). Events exceeding this value are recorded. | 10 |
-f LOGFILE | Log file path | /var/log/sysak/irqoff/irqoff.log |
duration | Program run duration (seconds). Runs indefinitely if omitted. | Indefinite |
runqslower
Diagnoses high task scheduling latency.
sysak runqslower [-s SPAN] [-t TID] [-f LOGFILE] [-P] [THRESH]| Option | Description | Default |
|---|---|---|
-s SPAN | Program run duration (seconds). Runs indefinitely if omitted. | Indefinite |
THRESH | Threshold for preemption time (milliseconds). Events exceeding this value are recorded. | 50 |
-f LOGFILE | Log file path | /var/log/sysak/runqslow/runqslow.log |
-t TID | Filter to a specific thread ID. Monitors all threads if omitted. | All threads |
-P | Record the name and TID of the previously preempted task | Off |
cpuirq
Shows interrupt binding and execution status for a CPU.
sysak cpuirq [-c cpu -b ] [ -t [ -i interval ] ]| Option | Description |
|---|---|
-c cpu | Specify a CPU |
-b | Show interrupt binding information for the specified CPU |
-t | Show the request with the most interrupts over a time period |
-i interval | Data collection interval |
softirq
Records the running status (count or rate) of soft interrupts in the system.
sysak softirq [ option ] [ args ]| Option | Description |
|---|---|
-s | Source file containing initial data |
-r | Output file |
Memory issues
memleak
Checks for kernel memory leaks (slab, vmalloc, and buddy allocator) and identifies where leaks occur.
sysak memleak [-t type] [-i interval] [-c]| Option | Description | Default |
|---|---|---|
-t type | Memory leak type: slab, vmalloc, or page | - |
-i interval | Diagnostic period (seconds) | 300 |
-c | Quick diagnostics mode. Determines whether memory is leaked without identifying exact locations. | Off |
mmaptrace
Identifies user-mode memory leak locations and provides call stacks for memory allocation requests.
The mmaptrace tool requires a separate component download. Run sysak list -a to check whether this tool is installed.sysak mmaptrace [ option ] [ args ]| Option | Description |
|---|---|
-p <pid> | Monitor memory allocation for a specific process |
-l | Monitor memory sizes requested by malloc and mmap |
-s | Show the call stack for user-mode memory requests |
memgraph
Analyzes and visualizes memory usage.
sysak memgraph [ option ]| Option | Description |
|---|---|
-g | Show the memory usage chart |
-f | Show page cache details |
-a | Show anonymous memory details |
-k | Check for memory leaks |
-l | Show memory usage by system threads |
-c | Show memory usage by system cgroups |
I/O issues
iosdiag
Diagnoses I/O latency and I/O hang conditions.
sysak iosdiag [ options ] subcmd [ cmdargs ]Options:
| Option | Description |
|---|---|
-u url | Upload diagnostic logs to the specified URL using curl. Logs are not uploaded if omitted. |
-s latency|hangdetect | Stop diagnostics for the specified subcommand |
Subcommands:
| Subcommand | Description |
|---|---|
latency | Enable I/O latency diagnostics |
hangdetect | Enable I/O hang diagnostics |
-h | Show supported parameters (use after a subcommand) |
iofsstat
Collects disk I/O information at process and file granularity.
sysak iofsstat [-h] [-T TIMEOUT] [-t TOP] [-u UTIL_THRESH] [-b BW_THRESH] [-i IOPS_THRESH] [-c CYCLE] [-d DEVICE] [-p PID] [-j] [-f]| Option | Description |
|---|---|
-T TIMEOUT | Run duration (seconds) |
-t TOP | Number of top I/O-consuming disks to display |
-u UTIL_THRESH | I/O utilization threshold. Disks below this threshold are ignored. |
-b BW_THRESH | Bandwidth threshold. Disks below this threshold are ignored. |
-i IOPS_THRESH | IOPS threshold. Disks below this threshold are ignored. |
-c CYCLE | Refresh interval (seconds) |
-d DEVICE | Disk name to monitor |
-p PID | Process ID to monitor |
-j, --json | Output in JSON format |
-f, --fs | Monitor and report partition information |
Network issues
pingtrace
Detects and traces network latency.
sysak pingtrace [ options ]| Option | Description | Default |
|---|---|---|
-v, --version | Show the version number | - |
-h, --help | Show help information | - |
-s, --server | Run in server mode | - |
-c, --client ip | Run in client mode | - |
-C, --count UINT | Number of probe packets | Unlimited |
-i <interval_us> | Packet send interval (microseconds) | - |
-t <UINT> | Program run duration (seconds) | - |
-m, --maxdelay us | Ping latency threshold. Only packets exceeding this latency are recorded. | 0 |
-b <INT=556> | Probe packet size in bytes. Must be greater than 144. | 556 |
--log TEXT=./pingtrace.log | Log file name | ./pingtrace.log |
--logsize INT | Maximum log file size | - |
--logbackup INT=3 | Maximum number of log file backups | 3 |
--mode auto/pingpong/compact | PingTrace running mode | - |
-o, --output image/json/log/imagelog | Output format | - |
-n, --namespace | Check net namespace information | - |
--nslocal | Indicate that client and server run on the same host (prevents redundant data when checking namespaces) | - |
--userid UINT | Assign different user IDs per host to help resolve time desynchronization | - |
--debug | Show debugging information (such as libbpf data) | - |
skcheck
Checks for TCP and socket leaks.
sysak skcheck [ options ] [ cmdargs ]| Option | Description | Default |
|---|---|---|
-s | Enable leak detection | - |
-i | Socket enable threshold | 2000 |
-l | Socket disable threshold | 500 |
Performance analysis
numa_access
Shows process information for a specified PID and Non-Uniform Memory Access (NUMA) information for a CPU.
sysak numa_access [ options ] [ cmdargs ]| Option | Description |
|---|---|
-p <pid> | Specify a process ID |
-c <cpu> | Specify a CPU |
-i <time> | Set a display interval |
hw_event
Shows container hardware events.
The hw_event tool requires a separate component download. Run sysak list -a to check whether this tool is installed.sysak hw_event [ options ] [ cmdargs ]| Option | Description | Default |
|---|---|---|
-c <name> | Name of a container. If omitted, hardware events for all containers are displayed. | All containers |
-s <time> | Run duration (seconds) | 5 |
syscall_slow
Analyzes lock contention among application threads when system call response times are slow.
sysak syscall_slow [-t THRESH(ms)] [-n sys_NR] <[-c COMM] [-p tid]> [-f LOGFILE] [duration(s)]| Option | Description | Default |
|---|---|---|
-t THRESH | System response time threshold (milliseconds). Events exceeding this value are recorded. | 10 |
-n sys_NR | Exclude specified system call IDs from tracing. Traces all system calls if omitted. | All |
-c COMM / -p tid | Specify a task name or process ID. Required. Cannot specify both. | Required |
-f LOGFILE | Log file path | /var/log/sysak/syscall_slow/syscall_slow.log |
duration | Program run duration (seconds). Runs indefinitely if omitted. | Indefinite |
Lock contention
ulockcheck
Analyzes lock contention among application threads.
The ulockcheck tool requires a separate component download. Run sysak list -a to check whether this tool is installed.sysak ulockcheck -p <pid> | -s <thread pid> | -a | -t <0|1> | -d| Option | Description |
|---|---|
-p <pid> | Monitor lock contention among threads of a specified process |
-a | Show the current lock owner and the top five lock requesters |
-s <thread pid> | Show lock contention status for a monitored thread |
-t <0|1> | Enable output. If a thread waits for a lock for more than 100 milliseconds, display the user-mode call stack. |
-d | Disable monitoring |
Virtualization
kvmexittime
Traces and diagnoses VM-exit events.
sysak kvmexittime [--help] [-p PID] [-t TID] [interval]| Option | Description |
|---|---|
-p <PID> | Specify a process ID |
-t <TID> | Specify a thread ID |
interval | Interval for tracing and analyzing VM-exit events |
--help | Show help information |