All Products
Search
Document Center

Cloud Monitor:What is CloudMonitor 2.0

Last Updated:Dec 04, 2025

Alibaba Cloud CloudMonitor 2.0 is a one-stop observability platform that integrates Simple Log Service (SLS), CloudMonitor (CMS), and Application Real-Time Monitoring Service (ARMS). It unifies metrics, traces, logs, and events into a single view. Using UModel modeling and an observability graph, combined with visualization and alerting capabilities, CloudMonitor 2.0 automatically associates resources and performs intelligent diagnostics. This provides full-stack, end-to-end observability from the infrastructure to the application layer, which lets you quickly find and resolve potential issues to improve O&M efficiency. It is widely used in complex scenarios that involve microservices, containers, and cloud services.

CloudMonitor 2.0 uses AI-enhanced, cross-domain insights to analyze and predict system performance in real time. It identifies anomalies in advance and provides intelligent fault diagnosis and optimization suggestions. This helps businesses build a full-stack observability system for the AI-native era that is smarter, more efficient, and more cost-effective, ensuring business stability and security.

Try in Playground

Alibaba Cloud Playground provides a demo environment where you can experience the main features of Cloud Monitor 2.0.

Go to the Playground Demo. You will be directed to the workspace by default.

Benefits

One-stop unified observability

CloudMonitor 2.0 integrates the core capabilities of CloudMonitor (CMS), Simple Log Service (SLS), and Application Real-Time Monitoring Service (ARMS), unifying multiple data sources, such as metrics, logs, traces, and events. This eliminates the need to deploy and maintain separate monitoring tools. You can achieve comprehensive, end-to-end observability from the infrastructure to the application layer on a single platform, which significantly reduces the complexity and management costs of your monitoring system.

Unified data modeling

Based on the Universal Observability Model (UModel), CloudMonitor 2.0 breaks down data silos between metrics, logs, traces, and changes to build a panoramic digital view of your IT system. This model allows people, programs, and AI to understand and analyze observable data, helping you build true full-stack observability and improve the efficiency of issue resolution.

AI-driven intelligent analysis

Using the unified observability data model as a foundation, CloudMonitor 2.0 uses machine learning to perform deep pattern recognition and association analysis. This enables precise anomaly detection, trend prediction, and intelligent alert denoising. It also leverages large language models (LLMs) to turn complex observable data into deep insights. This enables "conversational O&M", allowing you to use natural language to interact with an intelligent assistant to quickly locate, analyze, and resolve issues.

Open and compatible with mainstream ecosystems

CloudMonitor 2.0 is fully compatible with the open source technology ecosystem. It natively supports industry-standard tools such as Prometheus, Grafana, OpenTelemetry, and Elasticsearch. This allows for a smooth migration and integration of your existing monitoring assets and technology stacks. You can achieve seamless, unified monitoring for both native applications and hybrid cloud environments. This provides a flexible, open solution without vendor lock-in.

Terms

Before using CloudMonitor 2.0, you need to understand the following basic concepts.

Term

Description

Workspace

A workspace is an abstract layer in CloudMonitor 2.0 that represents a collection of resources. It provides teams with unified management and data isolation for resource groups. The selected region is used to store the data and configuration information accessed by the workspace. Using workspaces, you can create multiple independent resource environments. Each environment can have its own set of objects, such as cloud services, infrastructure, server-side and frontend applications, and middleware. The resources within each group are isolated from each other. This prevents resource contention between different groups and improves the security of resource usage.

  • The region of a workspace is the region of its underlying data source. It is strictly confined to that region and is also the storage area for EntityStore.

  • WorkSpaceName: Must be globally unique, similar to a ProjectName. You can specify it during creation. If you do not specify a name, the system generates one automatically. The WorkSpaceName cannot be modified after creation and cannot be duplicated. It serves as the unique identifier in API calls.

  • WorkSpaceDisplayName: Specified by the user. It can be modified and can be duplicated.

  • default WorkSpace: The default workspace. The name format is default-cms-{uid}-{regionId}. The name of the Simple Log Service (SLS) project bound to the default workspace is similar to the default workspace ID, such as default-cms-{uid}-{regionId}. However, the SLS project name may have a random string appended and does not strictly follow this format.

  • The first workspace that a user creates in a region is always the default workspace.

Application (App)

An observable App is a medium for reading and writing data from data sources within a workspace. You can show or hide Apps in a workspace. An App typically represents the observability knowledge for a specific scenario. It has the following features:

  • Apps are lightweight. You can enable or disable them. When disabled, they are hidden from the navigation pane on the left but can be re-enabled at any time. Different users, at the RAM user level, can choose to enable different Apps.

  • Apps are stateless. The application center in each workspace contains all Apps. If a user does not have permission to access the data source associated with an App, the App will show no data when opened.

  • An App can be associated with and access multiple data sources. By switching data sources, it reveals the relationship with the IaaS layer.

  • The integration process in the Integration Center defines the initialization of the App and the creation of the data source.

  • An App can reference another App. For example, Explorer and Alerting can be embedded in other Apps or exist as standalone Apps.

Entity

An entity is an observable object, such as a container cluster or an ECS server.

Model (UModel)

UModel is a specification for defining observability data models. It is used to define models for various observable objects, including logs, metrics, traces, and entities, along with the relationships between them. This achieves unified definition and management of observable data.

Features

Feature

Description

Full-stack data collection and monitoring

  • Infrastructure monitoring

    • Cloud service monitoring: Real-time collection of performance metrics (CPU, memory, disk, network traffic, etc.) for cloud services such as ECS, RDS, SLB, Container Service for Kubernetes (ACK), Serverless Kubernetes (ASK), and Kubernetes pods.

    • Network performance monitoring: Provides network-layer monitoring capabilities, including network latency, packet loss rate, DNS resolution, and TCP/UDP connection status.

  • Application performance monitoring (APM)

    • Distributed tracing: Traces microservice invocation chains, supporting mainstream languages such as Java, Python, and Go. It displays API response times, error rates, dependency topologies, and more.

    • Code-level diagnostics: Pinpoints application performance bottlenecks through thread profiling, slow SQL statement detection, and stack tracing.

  • Log monitoring

    • Log collection and storage: Supports log collection from servers, containers, and Function Compute (FC). It is compatible with logging frameworks such as Log4j and Logback.

    • Real-time log analysis: Provides SQL-like query syntax, keyword-based alerting, and log clustering analysis, such as aggregating and counting error logs.

Intelligent analysis and diagnostics

  • Anomaly detection and alerting

    • Dynamic threshold alerting: Uses machine learning algorithms to automatically learn historical patterns of metrics and identify abnormal fluctuations, such as a sudden spike in CPU utilization.

    • Multi-condition composite alerts: Supports alerts based on correlations across multiple metrics, such as triggering an alert when "CPU > 90% AND network packet loss rate > 5%".

  • Root cause analysis (RCA)

    • Intelligent association analysis: Automatically correlates abnormal metrics, log errors, and trace data to generate root cause analysis reports, such as an API timeout causing a downstream service cascade failure.

    • Time series data backtracking: Provides a historical data comparison feature to quickly locate the time of an anomaly and its impact scope.

Visualization and reporting

  • Custom monitoring dashboards

    • Drag-and-drop dashboards: Supports visualization components such as line charts, column charts, and topology graphs to flexibly display key metrics across resources and services.

    • Scenario-based template library: Provides pre-built monitoring templates for scenarios such as e-commerce sales promotions, container clusters, and database performance to create panoramic business views with one click.

  • Business overview dashboards

    • Real-time data display: Supports full-screen display of core business metrics, such as order volume and payment success rate, suitable for O&M war room scenarios.

    • Multi-tenant view isolation: Assigns data viewing permissions by team or line-of-business to ensure data security.

Alert and notification management

  • Multi-channel alert delivery

    • Notification channels: Supports alert push notifications through DingTalk, WeCom, text messages, email, and webhooks. It also supports alert silencing during specific periods, such as notifying only on-duty personnel during non-working hours.

    • Alert escalation policies: Set tiered alerts (such as "Notice -> Critical -> Fatal"). If there is no timely response, the notification is automatically escalated to other recipients.

  • Closed-loop alert management

    • Alert history and statistics: Records the status of alert handling (Confirmed, Resolved) and generates Mean Time To Repair (MTTR) analysis reports.

    • Integration with O&M tools: Alerts can automatically trigger actions in ticketing systems (such as DingTalk Yida) or automated O&M scripts (such as restarting a service).

Openness and integration capabilities

  • Seamless ecosystem integration

    • Alibaba Cloud service integration: Deeply integrates with services such as Simple Log Service (SLS), Application Real-Time Monitoring Service (ARMS), and Apsara DevOps. This enables automatic data correlation, such as directly jumping from a log query to the corresponding abnormal trace.

    • Third-party tool compatibility: Supports open source protocols such as Prometheus, OpenTelemetry, and Telegraf. It is compatible with Grafana for visualization and Jenkins for continuous integration.

  • API and SDK support

    • OpenAPI management: Automate monitoring configurations through APIs, such as creating alert rules in batches or exporting monitoring data.

    • Custom metric reporting: Allows users to report business metrics (such as order volume, and activity PV/UV) through an SDK to expand the scope of monitoring.

Security and high availability

  • Data security assurance

    • End-to-end encryption: Monitoring data is encrypted throughout process, during both transmission (HTTPS) and storage (encrypted storage).

    • Permission management: Implements fine-grained permission management based on RAM roles, such as "read-only access" or "alert configuration permissions".

    • Security and compliance assurance: Complies with various international and domestic security standards to ensure the security and reliability of data transmission and storage during monitoring.

  • Service reliability

    • Globally distributed collection nodes: Monitoring nodes cover major regions worldwide to prevent data loss due to network jitter.

    • Redundant data storage: Monitoring data is stored in multiple replicas to ensure data recoverability.

Cost optimization features

  • Resource usage analysis

    • Idle resource identification: Automatically marks long-term underutilized resources, such as ECS instances and unattached EIPs, and generates suggestions for their release.

    • Cost allocation reports: Gathers statistics on cloud resource consumption by project, department, or tag to support cost allocation and budget control.

  • Adaptive data sampling

    • On-demand adjustment of collection frequency: Reduces the collection frequency for non-critical metrics, for example, from 1 minute to 5 minutes, to lower data storage costs.

Cross-region unified management

Supports centralized monitoring and management of resources distributed across multiple regions, simplifying O&M workflows.

Scenarios

Scenario

Scenario description

Benefits

Scenario 1: Full-stack unified monitoring and real-time observability graph

A business needs to monitor resources such as physical servers, container clusters, microservice applications, and databases in a hybrid cloud environment. Traditional tools are often fragmented, which leads to low O&M efficiency. CloudMonitor 2.0 builds an end-to-end observability graph by unifying the collection of metrics (such as CPU and memory), traces (such as API call chains), logs (such as error logs), and events (such as configuration changes). This provides a global, visualized view of the state across all resources and services.

  • Multi-source data fusion: Supports integration with over 50 data sources, covering infrastructure, middleware, and the application layer to eliminate data silos.

  • Visual dashboards: Custom views display resource topologies, service dependencies, and key performance indicators (KPIs).

  • Cross-domain association analysis: Automatically correlates abnormal metrics with related logs and trace information to quickly identify the root cause.

Scenario 2: Intelligent anomaly detection and fault prediction

It is difficult to manually identify potential faults during traffic spikes or in complex architectures. CloudMonitor 2.0 uses machine learning models to analyze historical data, predict risks such as system capacity bottlenecks and service response delays in real time, and trigger early warnings.

  • Root cause identification: Real-time detection and computation use metrics, traces, and profiling data to cover various scenarios, including high latency, high error rates, exceptions, and out-of-memory (OOM) errors.

  • Impact analysis: Supports analysis of business impact on end users, frontend applications, and page requests, along with application impact on applications, APIs, databases, and containers or ECS instances.

  • Copilot self-service exploration: Use generative AI to automatically obtain detection reports, solutions, and more.

  • Alert convergence: Converges alerts across products and instances to prevent multiple alerts for the same root cause.

Scenario 3: End-to-end full-stack tracing from client to server (APM)

In a microservices model, a single user request can involve dozens of service invocations and frontend-backend calls, which makes performance bottlenecks difficult to trace. CloudMonitor 2.0 combines full-stack tracing with code-level diagnostics, linking user experience with the underlying infrastructure. It builds a full-stack observability graph to accurately analyze issues such as slow queries and deadlocks.

  • Full-stack observability graph: Covers various observable objects, such as services, APIs, and cloud service instances. It includes rich observable data, such as metrics, events, and metadata, and provides cross-domain entity association relationships.

  • Associated data query and analysis: Upstream: Dynamically obtains real-time information about upstream access terminals to analyze business impact. Downstream: Dynamically obtains real-time, complete monitoring information for downstream dependencies (such as middleware and databases) and containers.

  • Dynamic architecture awareness: Provides a panoramic, global topology and dynamically generates a complete Configuration Management Database (CMDB) with automatic discovery capabilities.

Scenario 4: Security, compliance, and threat insights

Businesses need to monitor security events, such as abnormal logons and data breaches, in real time while meeting compliance audit requirements. CloudMonitor 2.0 quickly detects potential threats through real-time log analysis and behavior pattern recognition.

  • Real-time threat detection: Identifies attack behaviors such as abnormal logons and SQL injection based on a rules engine and AI models.

  • Compliance audit reports: Automatically generates resource operation log reports to support compliance requirements such as MLPS and GDPR.

  • Automated response: Integrates with security groups or Web Application Firewall (WAF) to automatically block access from high-risk IP addresses.

Scenario 5: Resource optimization and cost management

A lack of transparency in cloud resource usage can easily lead to waste. CloudMonitor 2.0 analyzes resource utilization and recommends Auto Scaling policies and solutions for releasing idle resources.

  • Utilization analysis: Identifies underutilized resources, such as low-load ECS instances and unattached disks, and generates an optimization checklist.

  • Cost prediction: Predicts monthly bills based on historical consumption trends and provides cost-reduction suggestions.

  • Automated elasticity: Automatically scales Kubernetes clusters or serverless services in or out based on traffic.

Scenario 6: Intelligent alerting and automated O&M

Traditional alerting is prone to false positives or information overload. CloudMonitor 2.0 improves alert accuracy through alert denoising, dynamic thresholds, and tiered notification mechanisms. It also supports automated remediation actions.

  • Alert aggregation: Merges similar events to avoid duplicate notifications.

  • Multi-channel delivery: Pushes notifications to DingTalk, email, or text message based on severity level.

  • Automated playbooks: Triggers pre-configured scripts to perform operations such as service restarts or faulty node isolation.

Scenario 7: Managed services for open source observability components and intelligent O&M

Businesses widely use open source observability tools (such as Prometheus, Grafana, and OpenTelemetry) in hybrid or multicloud environments, but face the following challenges:

  1. High O&M complexity: Self-managed Prometheus clusters require you to manage the entire chain of data collection, storage, and alerting, which leads to high deployment and scaling costs.

  2. Data silo issues: OpenTelemetry trace data, Prometheus metrics, and Grafana dashboards are stored separately and lack unified analysis capabilities.

  3. Lack of intelligent capabilities: Open source tools rely on manual configuration of alert rules and root cause analysis, which makes it difficult to handle the dynamic nature of AI-native architectures.

  • Cost reduction and efficiency improvement: Managed services eliminate 90 % of O&M workloads and increase resource utilization by 30 %.

  • Full-stack observability: Provides full-stack observability, from infrastructure (Prometheus metrics) and application performance (OpenTelemetry traces) to user experience (Grafana visualization).

  • Open compatibility: Supports seamless integration with the open source ecosystem (such as Prometheus Operator and Grafana plugins) to meet the needs of enterprise hybrid cloud technology stacks.

List of observable applications

Application type

Application name

Description

Resident

Alert Center

Manages all alert information centrally.

Resident

Application Center

Manages all applications and their related services centrally.

Resident

Integration Center

Provides integration and management for various observable objects and data.

Resident

Entity Explorer

Explores the status and performance of different observable objects.

Resident

Cloud Service Monitoring

Provides basic monitoring metric queries and alerting services for Alibaba Cloud services.

Application Observability

Application Monitoring

Provides real-time monitoring and fault diagnosis for application performance.

Application Observability

Real User Monitoring

Focuses on monitoring for web, mobile app, and miniapp scenarios.

Application Observability

AI Application Observability

Provides one-stop, full-stack observability for AI applications.

Service Monitoring

Prometheus Service

A fully managed cloud service for Prometheus to build a high-performance monitoring system.

Service Monitoring

Incident Response

Aggregates alert events into incidents for management.

Service Monitoring

Synthetic Monitoring

Simulates user requests to proactively monitor network quality, service availability, and user experience.

Service Monitoring

Database Observability

Provides one-stop observability for database services.

Service Monitoring

Log Audit

Records and audits operation logs.

Cloud Service Insights

PAI Insights

Provides one-stop, full-stack observability for Platform for AI (PAI).

Cloud Service Insights

Container Insights

Provides in-depth analysis of the running status of Kubernetes clusters.

Cloud Service Insights

ECS Insights

Provides advanced monitoring features for Elastic Compute Service (ECS).

Intelligent Exploration and Analysis

UModel Explorer

A debugging tool for entities and UModel.

Intelligent Exploration and Analysis

Data Explorer

Explores and analyzes various monitoring metrics and data.

Intelligent Exploration and Analysis

Event Center

Manages various types of event information centrally.

Intelligent Exploration and Analysis

Dashboard

A comprehensive dashboard that displays key metrics.

Intelligent Exploration and Analysis

Log Explorer

Provides log data exploration and analysis services.