Natural Language-driven Fault Diagnosis - STAROps - Alibaba Cloud Documentation Center

Business scenario

When an alert fires or a system anomaly surfaces, you need to pinpoint the root cause quickly. Manually querying multiple monitoring systems, log platforms, and topology maps is slow and prone to missing correlated signals across data sources.

STAROps Intelligent Conversations let you describe the problem in natural language. The Digital Employee automatically correlates multi-source data, compressing fault diagnosis from tens of minutes to a few rounds of conversation.

Solution architecture

A fault diagnosis session involves the following stages:

Problem input: Start a conversation through the CloudMonitor (CMS) or Log Service (SLS) Intelligent Assistant and describe the alert content or observed anomaly.
Data correlation: Based on the problem description, the Digital Employee automatically queries multi-source data including Prometheus metrics, SLS logs, and topology relationships. When an @entity reference is used, the Digital Employee starts the analysis from that entity.
Multi-turn reasoning: Based on the initial analysis, the Digital Employee may suggest further investigation directions. Follow-up questions can guide the analysis toward specific dimensions.
Root cause output: The Digital Employee delivers a root cause analysis report containing anomalous metrics, log evidence, and topology relationships, along with remediation recommendations.

Procedure

Verify environment readiness

Before starting a fault diagnosis session, confirm the following prerequisites.

Checklist item	How to verify
Digital Employee created	In the STAROps console sidebar, click Digital Employee Management and confirm that the target Digital Employee status is normal.
Workspace has observability data sources connected	On the Workspace detail page, confirm that metric (Prometheus) and log (SLS) data sources are connected.
Unified Observability Model (UModel) configured	In the conversation input box, type `@` and confirm that an entity list (such as Service, Node, and Pod) appears. An empty list means UModel is not yet configured.
Default Rules configured (recommended)	Configuring Default Rules significantly improves diagnosis quality. For details, see Build a Custom O&M Intelligent Agent.

Scenario 1: Fault diagnosis through the CloudMonitor entry

The CloudMonitor Intelligent Assistant supports @entity references, making it well suited for alert-driven investigations that trace fault propagation along topology relationships.

Use @entity references

In the conversation input box, type @ to open the entity picker. The system displays the entities available in the current workspace. Entity types include Cluster, Node, Service, and Pod, depending on the UModel configuration.

After selecting a target entity, the Digital Employee starts the analysis from that entity and automatically correlates its upstream and downstream topology and associated metrics.

Example dialog: Pod restart investigation

The following multi-turn dialog shows how to investigate a Pod restart alert and progressively narrow down the root cause.

Turn 1: Describe the problem

@Service:order-service Pods have been restarting frequently in the past 1 hour. Help me analyze the cause.

The Digital Employee automatically:

Queries the Pod list associated with order-service and their restart counts.
Checks the container exit codes and recent logs for the restarting Pods.
Analyzes the resource usage trends (CPU and memory) for the affected Pods.

Turn 2: Follow up for deeper analysis

If the initial analysis reveals OOMKilled, continue with a targeted follow-up:

Show me the JVM heap memory usage trend for this Pod, and check whether any large query requests were made recently.

The Digital Employee queries JVM metrics and request logs to identify memory leaks or large queries that caused the OOM.

Scenario 2: Fault diagnosis through the Log Service entry

The Log Service Intelligent Assistant is optimized for log-centric analysis, best suited for error log surges or abnormal request latency.

Example dialog: Error log surge investigation

Turn 1: Describe the problem

payment-service has been throwing large numbers of 500 errors since 14:00. Help me find the cause.

The Digital Employee automatically:

Queries the error log distribution for payment-service around 14:00.
Categorizes errors by type, extracting the most frequent exception stack traces.
Correlates upstream call chains to check for cascading failures.

Turn 2: Verify the fix

I have restarted the database connection pool. Confirm whether the error rate for payment-service has returned to normal.

The Digital Employee compares error rate trends before and after the fix to confirm whether the issue is resolved.

Follow-up techniques: improving AI diagnosis quality

When the AI's initial response is too shallow or misdirected, use the following follow-up strategies to steer the analysis.

Situation	Example follow-up	Expected outcome
Analysis too shallow	"Analyze this separately from three angles: JVM, connection pool, and downstream dependencies."	Guides the AI to explore multiple dimensions instead of a single perspective.
Analysis headed in the wrong direction	"The issue is not at the network layer. Focus on the application-layer error logs."	Corrects the analysis direction and avoids wasting time on irrelevant dimensions.
Missing context	"This service had a release at 14:00 today, upgrading the image from v2.3.1 to v2.4.0. Please re-analyze with this change in mind."	Provides business context the AI cannot retrieve automatically, such as release records and configuration changes.
Need time-based comparison	"Compare the metrics from the same time yesterday with today's data."	Detects anomalous deviations through baseline comparison.
Need topology analysis	"Check whether the upstream and downstream services of order-service also show anomalies."	Traces the source of cascading failures along topology relationships.
Verifying a fix	"I have scaled up the Pod replicas. Confirm whether the latency has dropped."	Validates whether the remediation measure has taken effect.

Diagnosis report output format

After diagnosis is complete, ask the AI to produce a structured diagnosis report using the following prompt:

Summarize the analysis and output a structured diagnosis report that includes root cause, impact scope, cascade relationships, and recommended actions.

The AI typically outputs a report in the following format:

Fault Diagnosis Report

Root Cause
  pod-storage-02 triggered OOMKilled due to an excessively low memory limit (2 GiB).

Impact Scope
  - Direct impact: Available Pods for service-storage dropped from 3 to 2.
  - Indirect impact: Upstream pod-data-processor triggered a retry storm, causing CPU on node-03 to spike to 92%.

Cascade Failure Chain
  pod-storage-02 OOM → service-storage latency increase → pod-data-processor retry storm → node-03 CPU spike

Recommended Actions
  1. [Immediate] Increase pod-storage-02 memory limit to 4 GiB.
  2. [Immediate] Confirm that pod-storage-02 has recovered.
  3. [Short-term] Configure a PodDisruptionBudget (minAvailable=2) for service-storage.
  4. [Long-term] Configure an exponential backoff retry strategy for pod-data-processor.

From diagnosis to continuous monitoring

If issues uncovered during diagnosis require ongoing attention, set up automated monitoring by creating a long-running Mission:

Create a long-running mission to inspect service-storage memory usage and Pod health every day. Notify me immediately if OOM risk is detected.

The AI will guide you through Mission creation and blueprint planning. For details, see Create and Plan a Long-running Mission.