Health score overview - Cloud Monitor - Alibaba Cloud Documentation Center

This document explains the core concepts of the entity health score feature in Cloud Monitor 2.0 and how it works. Use this guide to quickly understand and start using the health score feature.

What is entity health score

Entity health score is a proactive health check feature in Cloud Monitor 2.0. It helps you quickly understand the health status of each entity in your system, such as applications, pods, and nodes.

The entity health score proactively identifies potential threats by continuously monitoring entities, enabling you to:

Quickly identify problems: Instantly determine which entities need attention through intuitive red, yellow, and green status indicators.
Receive early warnings: Detect abnormal trends before problems worsen to reduce the risk of faults.
Use out-of-the-box: Use built-in detection rules that cover common health issues. You can also add custom alert rules for flexible configurations.
Analyze impact scope: Quickly understand the upstream and downstream dependencies and the scope of impact for problem entities.

Scenarios

Scenario	Description
Routine monitoring	Quickly browse the status of all entities to confirm overall system health.
Fault localization	Start with entities that have an abnormal health score to view specific health events.
Impact scope assessment	Use the impact scope feature to understand the spread of a problem and its upstream and downstream dependencies.
Capacity planning	Identify entities with high resource usage to plan for scale-out in advance.

Key features

1. Quickly locate problem entities

When a workspace has tens or hundreds of applications, checking the metrics of each service individually is time-consuming and prone to errors. The health score list lets you:

View the health status of all entities on a single page.
Distinguish between Normal (green), Warning (yellow), and Critical (red) statuses by color.
Quickly filter for problem entities that require immediate attention.

2. Proactively detect potential threats

Health score inspection can detect several types of potential threats:

Performance degradation trends: For example, a significant increase in response time compared to the same time yesterday.
Unusual traffic patterns: For example, a sudden surge in requests.
Resource pressure warnings: For example, CPU or memory usage approaching the threshold.

3. Analyze impact scope

When an entity has a health problem, you can use the impact scope feature to:

View the upstream and downstream dependencies of the entity.
Understand the potential scope of the problem's impact.
Quickly locate the root cause and propagation path of the problem.

4. Flexible customization

The health score feature provides an out-of-the-box experience and also supports customization based on your business needs:

Extend detection rules: In addition to built-in rules, you can add your existing custom alert rules to the health score inspection system. This makes the health assessment more relevant to your business scenarios.
Customize assessment criteria: The thresholds for determining health status can be customized to meet the different definitions of entity health across teams.

Health status determination

Health status levels

Entity health score uses three status indicators:

Color	Status	Meaning
Green	Normal	All metrics are normal, or no health score rules are configured or enabled.
Yellow	Warning	An anomaly that requires attention exists.
Red	Critical	A critical issue that may affect business exists.

Event level threshold (default method)

By default, the health status is determined based on the event level threshold. The system determines the health status based on the severity level of the health events associated with the entity:

Color	Status	Default Configuration
Green	Normal	No events
Yellow	Warning	A P3 (Warning) or P4 (Normal) event occurred.
Red	Critical	A P1 (Critical) or P2 (Error) event occurred.

You can customize the mapping between event levels and health statuses in Threshold Settings.

Risk index (advanced feature)

For advanced users with more fine-grained requirements, the system also provides the risk index as an optional determination method. The risk index is a quantified risk value that considers:

The number of health events.
The severity level of the events (P1 Critical / P2 Error / P3 Warning / P4 Normal).
The event's persistent state.

After you enable the risk index, the system determines the health status by comparing the risk index with the threshold. You can enable this feature and customize the risk index threshold in Threshold Settings.

Note: The risk index feature is disabled by default. It is suitable for advanced users who need more fine-grained health assessments.

How it works

The health score feature is event-driven:

Continuous inspection → Anomaly detection → Health event generation → Status determination → Status display

Continuous inspection: The system periodically checks the metrics of entities based on health rules.
Anomaly detection: When a metric exceeds a threshold or changes abnormally, it is identified as a health problem.
Event generation: A health event is generated for each problem found, recording detailed context.
Status determination: The health status is determined based on the event level threshold or risk index.
Status display: The status is intuitively displayed through lists, timelines, and impact scope views.

Detection rule system

Health score uses a system of rules for continuous detection. The rules come from the following sources:

Built-in rules

The system has preset detection rules that cover common health problems. You can enable or disable specific rules and adjust their threshold parameters. For example, for the health score of applications in the Application Performance Management (APM) domain, the built-in rules include the following:

Detection Category	Detection Content
Error	Error rate exceeds threshold, average response time exceeds threshold, HTTP 5xx rate exceeds threshold, number of exceptions exceeds threshold, and more.
Anomaly	Period-over-period anomaly in error rate, average response time, request volume, and more.
Water Level (Saturation)	Number of Full GCs, total GC time, CPU usage, memory usage, number of abnormal JVM threads, and more.

When you enable a built-in rule on the inspection configuration page, the system automatically creates a corresponding alert rule. You can view these rules in Cloud Monitor 2.0 Alert Center > Alert Management > Alert Rules. By default, these alert rules do not send notifications. They are only used to generate health events for health score assessment. To receive notifications, you can configure a notification policy for them in the Alert Center.

For more information, see Built-in rules for health score.

Custom alert rules

In addition to built-in rules, you can add custom alert rules configured in the Alert Center to the health score inspection.

On the inspection configuration page, use the Add Custom Alert feature to select existing alert rules and add them to the health score inspection system.

Get started

You can go to the Entity Explorer page and switch to the Health Score view.
You can view the health status list of your application services.
You can click any entity to go to its details page and view the health event timeline and impact scope.
You can manage detection rules and add custom alerts in Health Score Inspection Configuration.
You can customize the criteria for health status determination in Threshold Settings.

For more information, see Health Score User Guide.