Use Managed Service for Prometheus to monitor Windows OSs

This topic describes how to use Alibaba Cloud Managed Service for Prometheus to monitor Windows OSs.

Prerequisites

An ECS instance in a VPC is monitored by Managed Service for Prometheus. For more information, see Create a Prometheus instance to monitor an ECS instance.

Limits

You can install the exporter only for Prometheus instances for ECS.

Reference model for monitoring Windows metrics

The reference model for the overall monitoring of Windows metrics consists of three parts: metric collection, monitoring dashboard, and alert rules.

Metric collection

The basic monitoring metrics for Windows include the CPU, memory, disk, network, and process. Note that this topic uses WMI to refer to Windows Management Instrumentation.

CPU

As a core component for computing and control in a computer system, the CPU processes information and executes programs. The following table describes the main CPU metrics.

Metric	Level	Source	Description
CPU utilization (%)	Critical	WMI (PercentProcessorTime)	If the CPU utilization is constantly at 100%, the system may be bottlenecked. In this case, you can identify whether the machine is overloaded by checking the processor queue length. An overly high CPU utilization and excessively long processor queue mean that the OS lacks resources to complete its computing tasks.
Thread context switches	Major	WMI (ContextSwitchesPersec)	A context switch occurs when the processor switches to a new task after the processor completes a task or exits the execution of a task. For more information, see Context Switches. Frequent context switches indicate that the competition for CPU resources is fierce, and the system is bottlenecked. However, if context switches happen due to an increased interrupt rate from hardware devices, this issue may be caused by hardware driver problems.
Processor queue length	Critical	WMI (ProcessorQueueLength)	A thread in the processor queue is ready to run, but cannot run because other threads are using the processor. For a multi-processor system, if the value of the processor queue length is greater than twice the number of CPU cores, the system is bottlenecked by its CPU utilization. For more information, see Observing Processor Queue Length.
Tasks delayed due to interrupts	Major	WMI (DPCsQueuedPersec)	Deferred Procedure Call (DPC) provides an interrupt mechanism for low-priority tasks on Windows OSs. Some hardware requirements demand real-time access to the CPU to ensure that high-priority tasks such as the keyboard input are processed in time. Hence, high-priority tasks can interrupt and use the processor to handle their high-priority requests, which triggers context switches. As a result, the low-priority tasks on some devices may be postponed. DPCs allow real-time processes such as device driver processes to execute low-priority tasks after the high-priority tasks are performed. DPCs are created by the kernel and can only be called by a program in kernel mode. If the number of DPCs is too large or remains unchanged for an extended period, problems may occur on basic system software.
CPU utilization in privileged mode (%)	Major	WMI (PercentPrivilegedTime)	PrivilegedTime is the time that the CPU spends on processing program instructions in kernel mode. The system has regular interrupts that trigger context switches. The processing of these context switches requires privileged mode. The CPU reserves a small amount of resources for privileged mode (typically within 10%). If the CPU utilization in privileged mode exceeds 30%, check the PercentDPCTime and PercentInterruptTime first. If PercentDPCTime or PercentInterruptTime exceeds 20%, it indicates that hardware problems may be present. In this case, you can use tools such as Xperf to further analyze the processes that have problems.
CPU utilization of DPCs (%)	Major	WMI (PercentDPCTime)	DPCTime is the time that the CPU spends on processing DPCs. We recommend that you take PercentPrivilegedTime and PercentInterruptTime into consideration when you analyze this metric.
CPU utilization of interrupts (%)	Major	WMI (PercentInterruptTime)	PercentInterruptTime is the time that the CPU spends on processing interrupts. We recommend that you take PercentPrivilegedTime and PercentDPCTime into consideration when you analyze this metric.

Memory

Memory is used to store the CPU computing data and the data imported from external devices such as hard disks. The following table describes the main memory metrics.

Metric	Level	Source	Description
Available physical memory (MB)	Critical	WMI (AvailableMBytes)	Multiple processes compete for memory resources. This leads to pagination and performance degradation. The system needs sufficient memory resources to process workloads. If the memory availability remains low for a long time, segmentation faults and other serious problems may occur. In this case, we recommend that you increase the physical memory of your system and appropriately configure page combining for your memory.
Committed virtual memory (bytes)	Major	WMI (CommittedBytes)	CommittedBytes is the amount of committed virtual memory. The memory that has been allocated is counted as the committed virtual memory, including the physical memory and page files. When the committed virtual memory approaches or exceeds the physical memory of the system, disk pagination is triggered. This affects the overall performance of the system. When the committed virtual memory approaches the maximum memory of the system, and the system reports out-of-memory errors, you must increase the amount of available physical memory or the size of page files. If the committed virtual memory keeps rising, we recommend that you monitor the related business and identify the root cause.
Memory of the nonpaged pool (bytes)	Major	WMI (PoolNonpagedBytes)	The Windows kernel and hardware devices preempt threads to execute time-sensitive tasks. To ensure the efficiency of task execution, the kernel and hardware devices are provided direct access to the physical memory. This is unlike user processes, which are provided access to the virtual memory. The nonpaged pools are not affected by disk pagination. The problems that occur on the components that use the nonpaged pool memory have a critical impact on the system. For example, a memory leak in a driver that uses a nonpaged pool may cause the system to fail completely, because the data of processes in user mode that is stored in the memory must be dumped to the disk.
Page faults (per second)	Critical	WMI (PageFaultsPersec)	When a process requests a page that cannot be found in the memory, the following page faults occur. Soft page fault: the page that the process requests is found elsewhere in the memory. Hard page fault: the page that the process requests must be retrieved from a disk. This metric returns the sum of soft page faults and hard page faults. Soft page faults can be easily rectified. The OSs can tolerate a large number of soft page faults. However, hard page faults are costly to fix, and typically delay process execution. If the number of hard page faults surges, we recommend that you increase your system memory. The OS continuously adjusts the memory allocated to processes when the system is heavily loaded. This results in frequent page faults.
Pages read from disks (per second)	Major	WMI (PagesInputPersec)	This metric indicates the number of pages read from disks to resolve hard page faults. You can use this metric together with PageFaultsPersec to determine the type of the fault that occurs. If the value of PagesInputPersec is high, a hard page fault occurs. Otherwise, a soft page fault occurs. When a hard page fault occurs, the Windows OS attempts to read multiple contiguous pages into memory in the expectation of minimizing the number of read operations. This may increase the time that is spent on handling page faults, because unnecessary pages are read into memory and more disk bandwidth resources are consumed. To troubleshoot this issue, you can store page files on separate physical disks or add available RAM resources.
Page file space usage (%)	Recommend	WMI (PagingFile.PercentUsage)	Page files are hidden system files, similar to the swap space in Linux OSs. Page files are used to store infrequently accessed memory pages on the disk. This way, memory resources can be released for other processes. Page files are located on the disk. Data reads from and data writes to page files affect the overall system performance and result in disk space fragmentation, which further compromises system performance. By default, the Windows OS manages the page files by increasing or decreasing the file size. No manual intervention is required. In some cases, you may need to set the file size. If the page file space is fully occupied, the process fails to run due to insufficient system memory.

Disk

Disks are external storage devices of computers, including HDDs, hybrid hard drives (HHDs), and SSDs. The following table describes the main disk metrics.

Metric	Level	Source	Description
Remaining disk space (%)	Critical	WMI (PercentFreeSpace)	The Windows OS must have sufficient available disk space. In addition to the regular processes that use disk space, the core system processes also use disk space to store logs and other types of data. If the available disk space on the OS is lower than 15%, an alert is generated.
Disk idle time (%)	Major	WMI (PercentIdleTime)	This metric indicates the percentage of time when the disk is idle. If you host the page files on a separate drive from the OS drive, you must track and monitor alerts of this metric on the OS main drive and the page file drive. If the disk idle time is constantly low, read and write operations are performed non-stop on the disk. In this case, we recommend that you monitor this metric closely. If the I/O value of the disk where the page files are stored is large, it indicates that the access to the memory pages increases. The performance of applications that are mapped from the memory to the page file is compromised. Therefore, we recommend that you host page files on free drives or drives that have higher processing speeds, such as SSDs. Additionally, the performance of application programs that require a large number of disk resources, such as databases, is severely compromised by a constantly high I/O.
Average time per read/write (in seconds)	Major	WMI (AvgDisksecPerRead/AvgDisksecPerWrite)	This metric indicates the average time that a read or write operation consumes. If disk operations take more than about 30 milliseconds, you can switch the operations to a disk that has a higher processing speed, such as an SSD.
Average length of a read/write request queue	Major	WMI (AvgDiskQueueLength)	If the average length of disk read or write queues exceeds twice the number of drives attached to the system, the system is bottlenecked by the disks.
Disk read/write operation rate (operations per second)	Major	WMI (DiskTransfersPersec)	If you host time-sensitive applications on your server, such as databases, you must monitor the disk I/O rate. The DiskTransfersPersec metric measures the read and write operations by disk. It is a combination of the DiskReadsPerSecs and DiskWritesPerSec metrics. If the disk I/O is constantly high, the system may become unstable and service degradation may occur, especially when coupled with high memory and page file usage. To solve this problem, you can add disks that have higher processing speeds, such as SSDs, or reserve more memory for the cache of file systems.
File system cache in memory (bytes)	Recommend	WMI (CacheBytes)	This metric indicates the size of memory that is occupied by the file system cache. Page files are used to store memory files on the disk. The file system cache caches disk contents in memory for better access performance. However, if the system cache is too small, file access performance is slow. If the system cache is too large, programs may store memory pages on disks, which will also decrease file access performance. Typically, this issue is managed by Windows. However, in some cases, you must adjust the file cache by using tools such as CacheSet. Assume that you want to open multiple files larger than 1 GB. If you have set the FILE_FLAG_RANDOM_ACCESS flag when you call CreateFile, the cache manager will keep memory pages that have been viewed in the cache. When the data that is accumulated in the cache exceeds the size of the physical memory, your system performance is severely affected.

Network

Networks are usually built on the TCP/IP protocol and enable real-time communication between computers. The following table describes the main network metrics.

Metric	Level	Source	Description
Network sending/receiving rate (bytes per second)	Major	WMI (BytesSentPersec/BytesReceivedPersec)	You can obtain the total throughput of the network port by checking the network sending rate and receiving rate. When the throughput exceeds 80% of the network port bandwidth, network saturation occurs. You can upgrade hardware to solve the problem. Most hardware is supported by gigabit network interface controllers (NICs) or NICs with higher specifications, so the network specifications of hardware do not cause performance bottlenecks. However, the bandwidth provided by network switches and network service providers may cause performance bottlenecks.
Network connections	Major	WMI	The number of network connections includes the numbers of Listen, Total, Established, and Non_Established connections. You can confirm whether the network is overloaded by comparing the absolute values of Established and Non_Established connections and checking the relationship between them. To detect connection leaks, you can observe whether Non_Established connections continue increasing.
TCP retransmission rate (times per second)	Critical	WMI (SegmentsRetransmittedPersec)	When a message segment that has been transmitted is not acknowledged within the TCP timeout window, the message segment is retransmitted. This is considered a TCP retransmission. When network congestion and network hardware failures occur, the TCP retransmission rate becomes elevated. In a healthy system, the TCP retransmission rate is typically lower than 5%. To ensure system performance, we recommend that you monitor this metric and configure reasonable alert rules.

Process

A process is the basic unit for the OS to allocate and schedule resources. It is also the foundation of the OS structure. The following table describes the main process metrics.

Metric	Level	Source	Description
CPU occupancy time of processes	Major	WMI (PercentPrivilegedTime/PercentUserTime)	This metric shows the CPU utilization of processes. You must pay attention to the processes with high CPU utilization or processes whose CPU utilization have sudden fluctuations.
Process handles	Recommend	WMI (HandleCount)	When a process applies for resources such as windows, icons, and cursors, the Windows OS creates the resources as required. At the same time, the Windows OS allocates memory to these resources and returns the serial numbers that are attached to these resources. These serial numbers are handles. Windows places a limit on the number of handles that can be owned by a process. If a process has handle leaks, it cannot obtain resources when it reaches its limits on the number of handles.
Process threads	Recommend	WMI (ThreadCount)	A process contains n threads. This metric can be used to confirm whether the number of threads of a specified process meets your expectations.
Process memory working set (bytes)	Major	WMI (WorkingSet)	The working set of a process is the set of pages in the virtual address space of the process that currently reside in the physical memory. The working set contains only pageable memory allocations.
Total process I/Os (bytes)	Major	WMI (IODataBytesPerSec)	This metric indicates the total number of read and write I/Os of a process. If you notice that the disk is unavailable or the disk response is slow, check whether the processes with a large amount of I/Os meet your expectations.
Process I/O requests (bytes)	Major	WMI (IODataOperationsPerSec)	The process I/O request rate.
Process page file size (bytes)	Recommend	WMI (PageFileBytes)	The amount of virtual memory that a process has reserved for use in the page files.

Monitoring dashboards

This section provides suggestions for metrics that you should include in your monitoring dashboards. The suggestions provided here are based on commonly monitored metrics in Node Exporter, a widely used monitoring system for Linux OSs.

Category	Metric
CPU	CPU utilization (%): the most important metric that can be used to determine the performance of Windows machines. DPC queue length, processor queue length, and context switches: the key metrics that provide insights into processor performance on Windows machines.
Memory	Physical memory usage and virtual memory usage (%): two of the most important metrics that are used to monitor whether Windows operates as expected. Page file usage and page error rate (%). Paged and nonpaged memory.
Disk	Disk space usage (%): the remaining available disk space. Disk idle rate (%): the metric that reflects the volume of activity on a disk. Disk read/write IOPS and disk read/write queue length: the metrics that reflect the activity of processes on a disk.
Network	Network inbound/outbound rate (bit/s): the core metric that reflects the volume of activity of a network. TCP connections (including Listen, Total, Non_Established, and Established connections): the metric that reflects the status of the process using the network at different phases. TCP retransmission rate (times per second): the metric that reflects the stability of the network for the external interactions of Windows.
Process	Process CPU utilization (%): the metric that shows the CPU utilization of a process. Process memory usage (%): the metric that shows the memory usage of a process. Process handles. Process I/O bytes: the metric that shows the number of read and write throughput of a process.

To provide O&M personnel with information on the overall running status of the managed Windows cluster, we recommend that you configure a Top N dashboard. A Top N dashboard includes key metrics such as CPU utilization, disk space usage, disk idle rate, and network traffic.

Alert rules

Based on the preceding description of the main metrics, we recommend that you configure the following default alert rules.

Category	Alert
CPU	CPU utilization: Generates an alert when this metric exceeds 80% for N minutes. This means that the system is bottlenecked by CPU utilization. Processor queue length: Generates an alert when this metric exceeds twice the number of CPU cores for N minutes.
Memory	Physical memory usage: Generates an alert when this metric exceeds 90% for N minutes.
Disk	Disk space usage: Generates an alert when this metric exceeds 85% for N minutes. The system is about to enter an unknown state. Disk idle rate: Generates an alert when this metric is less than 15% for N minutes.
Network	Established network connections: Generates an alert when this metric value exceeds a custom value for N minutes. This means that the number of network connections is excessive. Non_Established network connections: Generates an alert when this metric value exceeds a custom value for N minutes. This means that the network connection overload exists, or exceptional connections are disabled. TCP retransmission rate: Generates an alert when this metric value exceeds a reference value for N minutes. This means that the network is overloaded, or the network is unstable. The reference value for this metric is 5%.

Pain points of using a self-managed Prometheus system to monitor Windows OSs

The Windows OS that you use is deployed on Elastic Compute Service (ECS) instances. You may encounter the following problems when you use the self-managed Prometheus system to monitor Windows OSs.

To ensure security and facilitate organization management, we recommend that you deploy separate business in separate virtual private clouds (VPCs). If you want to use a self-managed Prometheus system to monitor your business, you must deploy the self-managed Prometheus system in each VPC. This increases the deployment and O&M costs.
You must configure Prometheus, Grafana, and Alertmanager in each self-managed monitoring system. The process is complex and requires a long time to complete.
The self-managed Prometheus system does not have a service discovery mechanism that can be quickly implemented for Alibaba Cloud ECS. The targets that are deployed on ECS instances cannot be monitored based on the ECS tags. If you want to implement a similar mechanism, you must write code in Golang to call the POP API of Alibaba Cloud ECS to integrate open source Prometheus. Then, you must compile and package the code, and then deploy open source Prometheus. This process is complex and causes great trouble in version upgrades.
Most open source Grafana dashboards for Windows are not designed for specific services. You cannot customize the monitoring metrics based on the principles and best practices of Windows OSs.
No alert template is available for Windows OSs. You must configure the alert rules yourself, which is an effort-consuming process with high technical requirements.

Comparison between a self-managed Prometheus system and Managed Service for Prometheus

Managed Service for Prometheus is compatible with the open source Prometheus ecosystem and provides out-of-the-box dashboards for you to monitor a wide variety of components. Managed Service for Prometheus can be used to monitor Container Service for Kubernetes (ACK) and self-managed Kubernetes clusters, and can be used with the remote write feature. Managed Service for Prometheus also provides metric monitoring capabilities for ECS instances that are deployed across multiple clouds or on a hybrid cloud. Managed Service for Prometheus supports the unified monitoring of multiple instances. This helps you query Prometheus metrics and receive alerts based on unified Grafana data sources.

Managed Service for Prometheus is seamlessly integrated with ECS. It collects the core monitoring metrics of Windows OSs, including the CPU, memory, disk, network, and process by design. Managed Service for Prometheus also provides out-of-the-box monitoring dashboards and alert metrics for Windows machines.

The following table compares the self-managed Prometheus system with Managed Service for Prometheus in the scenario of monitoring Windows machines.

Item	Self-managed Prometheus system	Managed Service for Prometheus
Deployment and O&M costs	You must purchase ECS instances and deploy Prometheus, Grafana, and Alertmanager individually in multiple VPCs. This results in high O&M costs.	Managed Service for Prometheus is a fully managed and out-of-the-box service that integrates Prometheus monitoring, Grafana dashboards, and the alert center.
Availability, performance, and storage capacity	The overall performance and availability are poor, and the storage capacity is small.	The overall performance and availability are high, and the storage capacity is large.
Service discovery	The service discovery of ECS instances is implemented by using the open source Static_Configs or the third-party service registries. The service discovery process is complex and is costly to maintain.	Managed Service for Prometheus has aliyun_sd_configs. Similar to the LabelSelector for Kubernetes service discovery, you can use ECS tags to locate ECS targets. This greatly simplifies service configuration and O&M tasks.
Grafana dashboard	The open source Grafana dashboard only shows the collected Windows metrics. You cannot customize the monitoring metrics based on the principles and best practices of Windows machines.	Managed Service for Prometheus provides a professional dashboard template for monitoring Windows machines. The dashboard provides a quick and accurate overview on the running status of your Windows machines and helps you troubleshoot issues.
Alert rule	No alert template is available for monitoring Windows machines. You must configure the alert rules.	Managed Service for Prometheus provides professional and flexible alert metric templates based on the best practices of monitoring Windows machines. You can configure alert rules on the GUI.

Perform the following steps to use Managed Service for Prometheus to monitor your Windows OS:

Step 1: Configure the Windows OS

Install and configure Windows Exporter to expose the metrics to Managed Service for Prometheus. For more information, see How do I install and configure Windows Exporter?.
Log on to the ARMS console.
In the left-side navigation pane, click Integration Center. In the Application Components section, click + Add of the Windows component.
In the panel that appears, select ECS Environment in the STEP1 section as the environment where Windows is deployed. In the STEP2 section, select the Prometheus instance where Windows resides.

In the STEP3 section, configure the parameters for integrating Managed Service for Prometheus.

Parameter	Description
Exporter Name	The unique name of the Windows Exporter.
Exporter Port Number	The listening port that is configured when you deploy the Windows Exporter.
Collection Path	The HTTP path of the Windows Exporter from which Managed Service for Prometheus collects monitoring metrics. The default value is `/metrics`.
Collection Interval (seconds)	The interval at which monitoring data is collected.
ECS Tag	The key-value pair of the tag that is added to the ECS instance where Windows Exporter is deployed. Managed Service for Prometheus uses this tag for service discovery.

Step 2: View the Windows dashboards

By default, Managed Service for Prometheus provides the overview, the process, and the top N dashboards.

Log on to the ARMS console.
In the left-side navigation pane, choose Managed Service for Prometheus > Instances.
Click the name of the Prometheus instance instance that you want to manage to go to the Integration Center page.
Click the Windows card in the Integrated section. On the panel that appears, click the Dashboards tab, and click a dashboard name to view the Windows monitoring metrics.
- The overview dashboard displays important metrics about the CPU, memory, disk, and network of a specified Windows OS.
- The process dashboard displays the CPU, memory, thread, and I/O monitoring information of each process. You can troubleshoot exceptions that occur in processes based on the metrics provided on this dashboard.
- The top N dashboard displays the top five items for each key metric of the monitored Windows cluster, including the CPU, memory, disk, and network. The top N dashboard shows the overall health status of the Windows cluster in real time.

Step 3: Configure alert rules for monitoring the Windows OS

Log on to the ARMS console.
In the left-side navigation pane, choose Managed Service for Prometheus > Instances.
Click the name of the Prometheus instance instance that you want to manage to go to the Integration Center page.
Click the Windows card in the Integrated section. On the panel that appears, click the Alerts tab to view the Windows alert rules of the Prometheus instance that you selected. Managed Service for Prometheus provides 11 key alert metrics for Windows OSs, including the alert metrics for the CPU, memory, disk, and network. You can add alert rules based on your business requirements. For more information, see Create an alert rule for a Prometheus instance.

(Optional) Step 4: Customize Windows monitoring metrics

By default, the Windows Exporter of Managed Service for Prometheus collects the following items: cpu, cpu_info, memory, process, tcp, cs, logical_disk, net, os, system, textfile, and time.

You can modify the configuration file based on your business requirements to collect the metrics of Windows components such as Active Directory, Container, and Domain Name System (DNS). The new configurations of the Windows Exporter take effect after a restart. For more information, see Windows Exporter.