This topic describes how to use Alibaba Cloud Managed Service for Prometheus to monitor Windows OSs.

Prerequisites

An ECS instance in a VPC is monitored by Alibaba Cloud Managed Service for Prometheus. For more information, see Create a Prometheus instance to monitor an ECS instance.

Limit

You can install the exporter only for Prometheus for ECS instances.

Reference model for monitoring Windows metrics

The reference model for the overall monitoring of Windows metrics consists of three parts: metric collection, monitoring dashboard, and alert rules.

Metric collection

The basic monitoring metrics for Windows include the CPU, memory, disk, network, and process. Note that this topic uses WMI to refer to Windows Management Instrumentation.

CPU

CPUs are the core components for computing and control in a computer system. The CPU is the component that processes information and executes programs. The following table describes the main CPU metrics.
MetricLevelProperty nameDescription
CPU usage (%)CriticalWMI (PercentProcessorTime)If the CPU usage is constantly at 100%, the system may be bottlenecked. In this case, you can identify whether the machine is overloaded by checking the processor queue length.

An overly high CPU usage and excessively long processor queue mean that the OS lacks resources to complete its computing tasks.

Thread context switchesMajorWMI (ContextSwitchesPersec)A context switch occurs when the processor switches to a new task after the processor completes a task or exits the execution of a task. For more information, see Context Switches.

Frequent context switches indicate that the competition for CPU resources is fierce, and the system is bottlenecked. However, if context switches happen due to an increased interrupt rate from hardware devices, this issue may be caused by hardware driver problems.

Processor queue lengthCriticalWMI (ProcessorQueueLength)A thread in the processor queue is ready to run, but cannot run because other threads are using the processor. For a multi-processor system, if the value of the processor queue length is greater than twice the number of CPU cores, the system is bottlenecked by its CPU usage. For more information, see Observing Processor Queue Length.
Tasks delayed due to interruptsMajorWMI (DPCsQueuedPersec)Deferred Procedure Call (DPC) provides an interrupt mechanism for low-priority tasks on Windows OSs. Some hardware requirements demand real-time access to the CPU to ensure that high-priority tasks such as the keyboard input are processed in time.

Hence, high-priority tasks can interrupt and use the processor to handle their high-priority requests, which triggers context switches. As a result, the low-priority tasks on some devices may be postponed.

DPCs allow real-time processes such as device driver processes to execute low-priority tasks after the high-priority tasks are performed. DPCs are created by the kernel and can only be called by a program in kernel mode. If the number of DPCs is too large or remains unchanged for an extended period, problems may occur on basic system software.

CPU usage in privileged mode (%)MajorWMI (PercentPrivilegedTime)PrivilegedTime is the time that the CPU spends on processing program instructions in kernel mode. The system has regular interrupts that trigger context switches. The processing of these context switches requires privileged mode. The CPU reserves a small amount of resources for privileged mode (typically within 10%).

If the CPU usage in privileged mode exceeds 30%, check the PercentDPCTime and PercentInterruptTime first. If PercentDPCTime or PercentInterruptTime exceeds 20%, it indicates that hardware problems may be present. In this case, you can use tools such as Xperf to further analyze the processes that have problems.

CPU usage of DPCs (%)MajorWMI (PercentDPCTime)DPCTime is the time that the CPU spends on processing DPCs. We recommend that you take PercentPrivilegedTime and PercentInterruptTime into consideration when you analyze this metric.
CPU usage of interrupts (%)MajorWMI (PercentInterruptTime)PercentInterruptTime is the time that the CPU spends on processing interrupts. We recommend that you take PercentPrivilegedTime and PercentDPCTime into consideration when you analyze this metric.

Memory

Memory is used to store the CPU computing data and the data imported from external devices such as hard disks. The following table describes the main memory metrics.
MetricLevelProperty nameDescription
Available physical memory (MB)CriticalWMI (AvailableMBytes)Multiple processes compete for memory resources. This leads to pagination and performance degradation. The system needs sufficient memory resources to process workloads.

If the memory availability remains low for a long time, segmentation faults and other serious problems may occur. In this case, we recommend that you increase the physical memory of your system and appropriately configure page combining for your memory.

Committed virtual memory (bytes)MajorWMI (CommittedBytes)CommittedBytes is the amount of committed virtual memory. The memory that has been allocated is counted as the committed virtual memory, including the physical memory and page files. When the committed virtual memory approaches or exceeds the physical memory of the system, disk pagination is triggered. This affects the overall performance of the system.

When the committed virtual memory approaches the maximum memory of the system, and the system reports out-of-memory errors, you must increase the amount of available physical memory or the size of page files. If the committed virtual memory keeps rising, we recommend that you monitor the related business and identify the root cause.

Memory of the nonpaged pool (bytes)MajorWMI (PoolNonpagedBytes)The Windows kernel and hardware devices preempt threads to execute time-sensitive tasks. To ensure the efficiency of task execution, the kernel and hardware devices are provided direct access to the physical memory. This is unlike user processes, which are provided access to the virtual memory. The nonpaged pools are not affected by disk pagination.

The problems that occur on the components that use the nonpaged pool memory have a critical impact on the system. For example, a memory leak in a driver that uses a nonpaged pool may cause the system to fail completely, because the data of processes in user mode that is stored in the memory must be dumped to the disk.

Page faults (per second)CriticalWMI (PageFaultsPersec)When a process requests a page that cannot be found in the memory, the following page faults occur.
  • Soft page fault: the page that the process requests is found elsewhere in the memory.
  • Hard page fault: the page that the process requests must be retrieved from a disk.

This metric returns the sum of soft page faults and hard page faults. Soft page faults can be easily rectified. The OSs can tolerate a large number of soft page faults. However, hard page faults are costly to fix, and typically delay process execution.

If the number of hard page faults surges, we recommend that you increase your system memory. The OS continuously adjusts the memory allocated to processes when the system is heavily loaded. This results in frequent page faults.

Pages read from disks (per second)MajorWMI (PagesInputPersec)This metric indicates the number of pages read from disks to resolve hard page faults. You can use this metric together with PageFaultsPersec to determine the type of the fault that occurs. If the value of PagesInputPersec is high, a hard page fault occurs. Otherwise, a soft page fault occurs.

When a hard page fault occurs, the Windows OS attempts to read multiple contiguous pages into memory in the expectation of minimizing the number of read operations. This may increase the time that is spent on handling page faults, because unnecessary pages are read into memory and more disk bandwidth resources are consumed. To troubleshoot this issue, you can store page files on separate physical disks or add available RAM resources.

Page file space usage (%)RecommendWMI (PagingFile.PercentUsage)Page files are hidden system files, similar to the swap space in Linux OSs. Page files are used to store infrequently accessed memory pages on the disk. This way, memory resources can be released for other processes.

Page files are located on the disk. Data reads from and data writes to page files affect the overall system performance and result in disk space fragmentation, which further compromises system performance.

By default, the Windows OS manages the page files by increasing or decreasing the file size. No manual intervention is required. In some cases, you may need to set the file size. If the page file space is fully occupied, the process fails to run due to insufficient system memory.

Disk

Disks are external storage devices of computers, including HDDs, hybrid hard drives (HHDs), and SSDs. The following table describes the main disk metrics.

MetricLevelProperty nameDescription
Remaining disk space (%)CriticalWMI (PercentFreeSpace)The operating system must have sufficient available disk space. In addition to the regular processes that use disk space, the core system processes also use disk space to store logs and other types of data. If the available disk space on a Windows OS is lower than 15%, an alert is generated.
Disk idle time (%)MajorWMI (PercentIdleTime)This metric indicates the percentage of time when the disk is idle. If you host the page files on a separate drive from the OS drive, you must track and monitor alerts of this metric on the OS main drive and the page file drive.

If the disk idle time is constantly low, read and write operations are performed non-stop on the disk. In this case, we recommend that you monitor this metric closely.

If the I/O value of the disk where the page files are stored is large, it indicates that the access to the memory pages increases. The performance of applications that are mapped from the memory to the page file is compromised. Therefore, we recommend that you host page files on free drives or drives that have higher processing speeds, such as SSDs. Additionally, the performance of application programs that require a large number of disk resources, such as databases, is severely compromised by a constantly high I/O.

Average time per read/write (in seconds)MajorWMI (AvgDisksecPerRead/AvgDisksecPerWrite)This metric indicates the average time that a read or write operation consumes. If disk operations take more than about 30 milliseconds, you can switch the operations to a disk that has a higher processing speed, such as an SSD.
Average length of a read/write request queueMajorWMI (AvgDiskQueueLength)If the average length of disk read or write queues exceeds twice the number of drives attached to the system, the system is bottlenecked by the disks.
Disk read/write operation rate (operations per second)MajorWMI (DiskTransfersPersec)If you host time-sensitive applications on your server, such as databases, you must monitor the disk I/O rate.

The DiskTransfersPersec metric measures the read and write operations by disk. It is a combination of the DiskReadsPerSecs and DiskWritesPerSec metrics.

If the disk I/O is constantly high, the system may become unstable and service degradation may occur, especially when coupled with high memory and page file usage. To solve this problem, you can add disks that have higher processing speeds, such as SSDs, or reserve more memory for the cache of file systems.

File system cache in memory (bytes)RecommendWMI (CacheBytes)This metric indicates the size of memory that is occupied by the file system cache. Page files are used to store memory files on the disk. The file system cache caches disk contents in memory for better access performance.

However, if the system cache is too small, file access performance is slow. If the system cache is too large, programs may store memory pages on disks, which will also decrease file access performance. Typically, this issue is managed by Windows. However, in some cases, you must adjust the file cache by using tools such as CacheSet.

Assume that you want to open multiple files larger than 1 GB. If you have set the FILE_FLAG_RANDOM_ACCESS flag when you call CreateFile, the cache manager will keep memory pages that have been viewed in the cache. When the data that is accumulated in the cache exceeds the size of the physical memory, your system performance is severely affected.

Network

Networks are usually built on the TCP/IP protocol and enable real-time communication between computers. The following table describes the main network metrics.

MetricLevelProperty nameDescription
Network sending/receiving rate (bytes per second)MajorWMI (BytesSentPersec/BytesReceivedPersec)You can obtain the total throughput of the network port by checking the network sending rate and receiving rate. When the throughput exceeds 80% of the network port bandwidth, network saturation occurs. You can upgrade hardware to solve the problem.

Most hardware is supported by gigabit network interface controllers (NICs) or NICs with higher specifications, so the network specifications of hardware do not cause performance bottlenecks. However, the bandwidth provided by network switches and network service providers may cause performance bottlenecks.

Network connectionsMajorWMIThe number of network connections includes the numbers of Listen, Total, Established, and Non_Established connections. You can confirm whether the network is overloaded by comparing the absolute values of Established and Non_Established connections and checking the relationship between them. To detect connection leaks, you can observe whether Non_Established connections continue increasing.
TCP retransmission rate (times per second)CriticalWMI (SegmentsRetransmittedPersec)When a message segment that has been transmitted is not acknowledged within the TCP timeout window, the message segment is retransmitted. This is considered a TCP retransmission. When network congestion and network hardware failures occur, the TCP retransmission rate becomes elevated.

In a healthy system, the TCP retransmission rate is typically lower than 5%. To ensure system performance, we recommend that you monitor this metric and configure reasonable alert rules.

Process

A process is the basic unit for the OS to allocate and schedule resources. It is also the foundation of the OS structure. The following table describes the main process metrics.

MetricLevelProperty nameDescription
CPU occupancy time of processesMajorWMI (PercentPrivilegedTime/PercentUserTime)This metric shows the CPU usage of processes. You must pay attention to the processes with high CPU usage or processes whose CPU usage have sudden fluctuations.
Process handlesRecommendWMI (HandleCount)When a process applies for resources such as windows, icons, and cursors, the Windows OS creates the resources as required. At the same time, the Windows OS allocates memory to these resources and returns the serial numbers that are attached to these resources. These serial numbers are handles.

Windows places a limit on the number of handles that can be owned by a process. If a process has handle leaks, it cannot obtain resources when it reaches its limits on the number of handles.

Process threadsRecommendWMI (ThreadCount)A process contains n threads. This metric can be used to confirm whether the number of threads of a specified process meets your expectations.
Process memory working set (bytes)MajorWMI (WorkingSet)The working set of a process is the set of pages in the virtual address space of the process that currently reside in the physical memory. The working set contains only pageable memory allocations.
Total process I/Os (bytes)MajorWMI (IODataBytesPerSec)This metric indicates the total number of read and write I/Os of a process. If you notice that the disk is unavailable or the disk response is slow, check whether the processes with a large amount of I/Os meet your expectations.
Process I/O requests (bytes)MajorWMI (IODataOperationsPerSec)The process I/O request rate.
Process page file size (bytes)RecommendWMI (PageFileBytes)The amount of virtual memory that a process has reserved for use in the page files.

Monitoring dashboards

This section provides suggestions for metrics that you should include in your monitoring dashboards. The suggestions provided here are based on commonly monitored metrics in Node Exporter, a widely used monitoring system for Linux OSs.
CategoryMetric
CPU
  • CPU usage (%): the most important metric that can be used to determine the performance of Windows machines.
  • DPC queue length, processor queue length, and context switches: the key metrics that provide insights into processor performance on Windows machines.
Memory
  • Physical memory usage and virtual memory usage (%): two of the most important metrics that are used to monitor whether Windows operates as expected.
  • Page file usage and page error rate (%).
  • Paged and nonpaged memory.
Disk
  • Disk space usage (%): the remaining available disk space.
  • Disk idle rate (%): the metric that reflects the volume of activity on a disk.
  • Disk read/write IOPS and disk read/write queue length: the metrics that reflect the activity of processes on a disk.
Network
  • Network inbound/outbound rate (bit/s): the core metric that reflects the volume of activity of a network.
  • TCP connections (including Listen, Total, Non_Established, and Established connections): the metric that reflects the status of the process using the network at different phases.
  • TCP retransmission rate (times per second): the metric that reflects the stability of the network for the external interactions of Windows.
Process
  • Process CPU usage (%): the metric that shows the CPU usage of a process.
  • Process memory usage (%): the metric that shows the memory usage of a process.
  • Process handles.
  • Process I/O bytes: the metric that shows the number of read and write throughput of a process.

To provide O&M personnel with information on the overall running status of the managed Windows cluster, we recommend that you configure a Top N dashboard. A Top N dashboard includes key metrics such as CPU usage, disk space usage, disk idle rate, and network traffic.

Alert rules

Based on the preceding description of the main metrics, we recommend that you configure the following default alert rules.

CategoryAlert
CPU
  • CPU usage: Generates an alert when this metric exceeds 80% for N minutes. This means that the system is bottlenecked by CPU usage.
  • Processor queue length: Generates an alert when this metric exceeds twice the number of CPU cores for N minutes.
MemoryPhysical memory usage: Generates an alert when this metric exceeds 90% for N minutes.
Disk
  • Disk space usage: Generates an alert when this metric exceeds 85% for N minutes. The system is about to enter an unknown state.
  • Disk idle rate: Generates an alert when this metric is less than 15% for N minutes.
Network
  • Established network connections: Generates an alert when this metric value exceeds a custom value for N minutes. This means that the number of network connections is excessive.
  • Non_Established network connections: Generates an alert when this metric value exceeds a custom value for N minutes. This means that the network connection overload exists, or exceptional connections are disabled.
  • TCP retransmission rate: Generates an alert when this metric value exceeds a reference value for N minutes. This means that the network is overloaded, or the network is unstable. The reference value for this metric is 5%.

Pain points of using the self-managed Prometheus service to monitor Windows OSs

The Windows OS that you use is deployed on Elastic Compute Service (ECS) instances. You may encounter the following problems when you use the self-managed Prometheus service to monitor Windows OSs.
  1. To ensure security and facilitate organization management, we recommend that you deploy separate business in separate virtual private clouds (VPCs). If you want to use a self-managed Prometheus service to monitor your business, you must deploy the self-managed Prometheus service in each VPC. This increases the deployment and O&M costs.
  2. You must configure Prometheus, Grafana, and Alertmanager in each self-managed monitoring system. The process is complex and requires a long time to complete.
  3. The self-managed Prometheus service does not have a service discovery mechanism that can be quickly implemented for Alibaba Cloud ECS. The targets that are deployed on ECS instances cannot be monitored based on the ECS tags. If you want to implement a similar mechanism, you must write code in Golang to call the POP API of Alibaba Cloud ECS to integrate the open source Prometheus service. Then, you must compile and package the code, and then deploy the open source Prometheus service. This process is complex and causes great trouble in version upgrades.
  4. Most open source Grafana dashboards for Windows are not designed for specific services. You cannot customize the monitoring metrics based on the principles and best practices of Windows OSs.
  5. No alert template is available for Windows OSs. You must configure the alert rules yourself, which is an effort-consuming process with high technical requirements.

Comparison between self-managed Prometheus service and Alibaba Cloud Managed Service for Prometheus

Prometheus Service is a managed monitoring service that is provided by Alibaba Cloud. Prometheus Service is compatible with the open source Prometheus ecosystem and provides out-of-the-box dashboards for you to monitor a wide variety of components. Alibaba Cloud Managed Service for Prometheus can be used to monitor Alibaba Cloud container services and self-managed Kubernetes clusters, and can be used with the remote write feature. Alibaba Cloud Managed Service for Prometheus also provides metric monitoring capabilities for ECS instances that are deployed across multiple clouds or on a hybrid cloud. Alibaba Cloud Managed Service for Prometheus supports the unified monitoring of multiple instances. This helps you query Prometheus metrics and receive alerts based on unified Grafana data sources.

Alibaba Cloud Managed Service for Prometheus is seamlessly integrated with ECS. It collects the core monitoring metrics of Windows OSs, including the CPU, memory, disk, network, and process by design. Alibaba Cloud Managed Service for Prometheus also provides out-of-the-box monitoring dashboards and alert metrics for Windows machines.

The following table compares the self-managed Prometheus service with Alibaba Cloud Managed Service for Prometheus in the scenario of monitoring Windows machines.

ItemSelf-managed Prometheus serviceAlibaba Cloud Managed Service for Prometheus
Deployment and O&M costsYou must purchase ECS instances and deploy Prometheus, Grafana, and Alertmanager individually in multiple VPCs. This results in high O&M costs.Alibaba Cloud Managed Service for Prometheus is a fully managed and out-of-the-box service that integrates Prometheus monitoring, Grafana dashboards, and the alert center.
Availability, performance, and storage capacityThe overall performance and availability are poor, and the storage capacity is small.The overall performance and availability are high, and the storage capacity is large.
Service discoveryThe service discovery of ECS instances is implemented by using the open source Static_Configs or the third-party service registries. The service discovery process is complex and is costly to maintain.Alibaba Cloud Managed Service for Prometheus has aliyun_sd_configs. Similar to the LabelSelector for Kubernetes service discovery, you can use ECS tags to locate ECS targets. This greatly simplifies service configuration and O&M tasks.
Grafana dashboardThe open source Grafana dashboard only shows the collected Windows metrics. You cannot customize the monitoring metrics based on the principles and best practices of Windows machines.Alibaba Cloud Managed Service for Prometheus provides a professional dashboard template for monitoring Windows machines. The dashboard provides a quick and accurate overview on the running status of your Windows machines and helps you troubleshoot issues.
Alert ruleNo alert template is available for monitoring Windows machines. You must configure the alert rules.Alibaba Cloud Managed Service for Prometheus provides professional and flexible alert metric templates based on the best practices of monitoring Windows machines. You can configure alert rules on the GUI.

Use Alibaba Cloud Managed Service for Prometheus to monitor Windows OSs

Perform the following steps to use Alibaba Cloud Managed Service for Prometheus to monitor your Windows OS:

Step 1: Configure the Windows OS

  1. Install and configure Windows Exporter to expose the metrics to Alibaba Cloud Managed Service for Prometheus. For more information, see How do I install and configure Windows Exporter?.
  2. Log on to the ARMS console.
  3. In the left-side navigation pane, click Integration Center. In the Application Components section, click + Add of the Windows component.
  4. In the panel that appears, select ECS Environment in the STEP1 section as the environment where Windows is deployed. In the STEP2 section, select the Prometheus instance where Windows resides.
  5. In the STEP3 section, configure the parameters for integrating Alibaba Cloud Managed Service for Prometheus.
    ParameterDescription
    Exporter NameThe unique name of the Windows Exporter that exports the monitoring metrics of Windows OS.
    Exporter Port NumberThe listening port that is configured when you deploy the Windows Exporter.
    Collection PathThe HTTP path of the Windows Exporter from which Alibaba Cloud Managed Service for Prometheus collects monitoring metrics. The default value is /metrics.
    Collection Interval (seconds)The interval at which monitoring data is collected.
    ECS TagThe key-value pair of the tag that is added to the ECS instance where Windows Exporter is deployed. Alibaba Cloud Managed Service for Prometheus uses this tag for service discovery.
    tk

Step 2: View the Windows dashboards

By default, Alibaba Cloud Managed Service for Prometheus provides the overview, the process, and the top N dashboards.

  1. Log on to the ARMS console.
  2. In the left-side navigation pane, choose Prometheus Service > Prometheus Instances.
  3. Click the name of the Prometheus instance instance that you want to manage to go to the Integration Center page.
  4. Click the Windows card in the Integrated section. On the panel that appears, click the Dashboards tab, and click a dashboard name to view the Windows monitoring metrics.
    • The overview dashboard displays important metrics of a specified Windows OS, including the CPU, memory, disk, and network metrics. dg
    • The process dashboard displays the CPU, memory, thread, and I/O monitoring information of each process. You can troubleshoot exceptions that occur in processes based on the metrics provided on this dashboard. dg
    • The top N dashboard displays the top five items for each key metric of the monitored Windows cluster, including the CPU, memory, disk, and network. The top N dashboard shows the overall health status of the Windows cluster in real time. dr

Step 3: Configure alert rules for monitoring the Windows OS

  1. Log on to the ARMS console.
  2. In the left-side navigation pane, choose Prometheus Service > Prometheus Instances.
  3. Click the name of the Prometheus instance instance that you want to manage to go to the Integration Center page.
  4. Click the Windows card in the Integrated section. On the panel that appears, click the Alerts tab to view the Windows alert rules of the Prometheus instance that you selected. Alibaba Cloud Managed Service for Prometheus provides 11 key alert metrics for Windows OSs, including the alert metrics for the CPU, memory, disk, and network. You can add alert rules based on your business requirements. For more information, see Create an alert rule for a Prometheus instance.

(Optional) Step 4: Customize Windows monitoring metrics

By default, the Windows Exporter of Alibaba Cloud Managed Service for Prometheus collects the following items: cpu, cpu_info, memory, process, tcp, cs, logical_disk, net, os, system, textfile, and time.

You can modify the configuration file based on your business requirements to collect the metrics of Windows components such as Active Directory, Container, and Domain Name System (DNS). The new configurations of the Windows Exporter take effect after a restart. For more information, see Windows Exporter.