Cloud Monitor Architecture & Observability Overview - Cloud Monitor

Cloud Monitor 2.0 is a unified observability platform that integrates Simple Log Service (SLS), Cloud Monitor (CMS), and Application Real-Time Monitoring Service (ARMS). It consolidates metrics, traces, logs, and events into a single view. Using the UModel observability framework and observability graph, Cloud Monitor 2.0 combines visualization and alerting to automatically associate resources and perform intelligent diagnostics. It delivers full-stack, end-to-end observability—from infrastructure to applications—so you can quickly detect and resolve issues and improve operations and maintenance (O&M) efficiency. It supports complex environments such as microservices, containers, and cloud services.

Cloud Monitor 2.0 uses AI-enhanced cross-domain insights to analyze and predict system performance in real time. It detects anomalies early and provides intelligent fault diagnosis and optimization suggestions. This helps enterprises build a smarter, more efficient, and lower-cost full-stack observability system in the AI-native era—ensuring business stability and security.

Try in Playground

Alibaba Cloud Playground provides a demo environment where you can experience the main features of Cloud Monitor 2.0.

Visit the Playground Demo. You enter a workspace by default.

Benefits

Unified observability

Cloud Monitor 2.0 deeply integrates the core capabilities of CMS, SLS, and ARMS. It unifies metrics, logs, traces, and events into one platform. You no longer need to deploy or maintain multiple standalone monitoring tools. Instead, you can obtain full-stack, end-to-end visibility—from infrastructure to applications—in one place. This reduces complexity and management costs.

Unified data modeling

UModel (Universal Observability Model) connects data silos—including metrics, logs, traces, and configuration changes—to build a complete digital view of your IT systems. People, programs, and AI can all understand and analyze observability data. This enables true full-stack observability and accelerates issue detection.

AI-powered intelligent analysis

Using the unified observability data model, Cloud Monitor 2.0 applies machine learning for deep pattern recognition and association analysis. It delivers precise anomaly detection, trend forecasting, and intelligent alert noise reduction. It also uses large language models to transform complex observability data into actionable insights. This enables revolutionary “conversational O&M”—you can interact with an intelligent assistant in plain language to quickly locate, analyze, and fix problems.

Open and compatible with mainstream ecosystems

Cloud Monitor 2.0 fully embraces open-source technology. It natively supports Prometheus, Grafana, OpenTelemetry, Elasticsearch, and other industry standards and tools. Your existing monitoring assets and technology stack migrate smoothly. Whether you run cloud-native applications or hybrid cloud environments, you get seamless, unified monitoring. This gives you a flexible, open, vendor-neutral solution.

Terms

Before using Cloud Monitor, learn these basic concepts.

Term	Description
Workspace	A workspace is an abstraction layer in Cloud Monitor 2.0 that groups a set of resources. It gives teams unified management and resource-group data isolation. The selected region stores the workspace’s data and configuration. With workspaces, you can create multiple independent resource environments. Each environment has its own set of objects—such as cloud services, infrastructure, server-side and frontend applications, and middleware. Resources in each group are isolated. This prevents resource conflicts across groups and improves security. The workspace’s region is the same as the underlying data source region. Data stays local. It is also the storage region for EntityStore. WorkSpaceName is globally unique—like ProjectName. You can enter it when creating the workspace. If you leave it blank, the system generates one. After creation, you cannot change the WorkSpaceName. It must be unique (and serves as the unique identifier in API calls). WorkSpaceDisplayName is user-defined. You can edit it. Duplicate names are allowed. Default workspace: named `default-cms-{uid}-{regionId}`. Its linked SLS project name follows a similar format—such as `default-cms-{uid}-{regionId}`. But the SLS project name may include extra random characters and does not always match this exact format. The first workspace created in a region is always the default workspace.
App	An app is a lightweight carrier for reading and writing data sources within a workspace. You can show or hide apps in the workspace. An app usually represents domain-specific observability knowledge for a particular scenario. Key traits: Apps are lightweight. You can enable or disable them—including whether they appear in the navigation pane on the left. Disabled apps can be re-enabled anytime. Different users (RAM users) can choose different apps. Apps are stateless. Every workspace includes all available apps. If a user lacks permissions for an app’s data source, the app shows no data when opened. An app can link to and access multiple data sources. Switching data sources reveals relationships with the IaaS layer. The integration flow in the Integration Center defines app initialization and data source creation. An app can embed another app. For example, Explorer and Alerting are embedded in many apps—and also run as standalone apps.
Entity	An entity is an observable object—such as a container cluster or an ECS instance.
Model (Umodel)	UModel is a specification for defining observability data models. It defines models for logs, metrics, traces, entities, and their relationships—enabling unified definition and management of observability data.

Features

Features	Description
Full-stack data collection and monitoring	Infrastructure monitoring Cloud service monitoring: Collect real-time performance metrics—such as CPU, memory, disk, and network traffic—for ECS, RDS, SLB, container services (ACK/ASK), and Kubernetes pods. Network performance monitoring: Track network latency, packet loss rate, DNS resolution, and TCP/UDP connection status. Application performance monitoring (APM) Distributed tracing: Trace microservice call chains. Supports Java, Python, Go, and other major languages. Shows interface latency, error rate, and dependency topology. Code-level diagnostics: Locate application performance bottlenecks using thread profiling, slow SQL detection, and stack trace analysis. Log monitoring Log collection and storage: Collect logs from servers, containers, and Function Compute (FC). Compatible with Log4j, Logback, and other logging frameworks. Real-time log analysis: Run SQL queries, set keyword alerts, and use log clustering—such as aggregating error logs.
Intelligent analysis and diagnostics	Anomaly detection and alerting Dynamic threshold alerting: Use machine learning to learn historical metric patterns and spot anomalies—such as sudden CPU spikes. Multi-condition alerting: Trigger alerts based on cross-metric conditions—such as “CPU > 90% AND packet loss rate > 5%”. Root cause analysis (RCA) Smart association analysis: Automatically link abnormal metrics, error logs, and trace data to generate root cause reports—such as an API timeout triggering a downstream service avalanche. Time-series data rollback: Compare historical data to quickly find when an anomaly started and its impact scope.
Visualization and reporting	Custom monitoring dashboards Drag-and-drop dashboard: Use line charts, column charts, topology graphs, and other visual components to flexibly display key metrics across resources and services. Scenario-based template library: Prebuilt templates for e-commerce sales promotions, container clusters, and database performance. Generate business-wide views with one click. Business dashboards Real-time data projection: Show core business metrics—such as order volume and payment success rate—in full-screen mode. Designed for O&M war rooms. Multi-tenant view isolation: Assign data viewing permissions by team or line-of-business to keep data secure.
Alerting and notification management	Multi-channel alert delivery Notification channels: Send alerts via DingTalk, WeCom, text message, email, and webhook. Support time-based silence—such as notifying only on-call staff outside business hours. Alert escalation policy: Set tiered alerts—such as “notice → critical → fatal”. Escalate notifications automatically if no response arrives on time. Alert lifecycle management Alert history and statistics: Record alert status—such as confirmed or recovered—and generate MTTR (mean time to repair) reports. Integration with O&M tools: Trigger ticket systems—such as DingTalk Yida—or automated O&M scripts—such as restarting a service—when alerts fire.
Openness and integration	Seamless ecosystem integration Alibaba Cloud service integration: Deeply links with SLS, ARMS, and Apsara DevOps. Enables automatic data association—such as jumping from a log query directly to an abnormal trace. Third-party tool compatibility: Supports open-source protocols like Prometheus, OpenTelemetry, and Telegraf. Works with Grafana visualization and Jenkins continuous integration. API and SDK support OpenAPI management: Automate monitoring tasks—such as batch-creating alert rules or exporting monitoring data—using APIs. Custom metric reporting: Use SDKs to report business metrics—such as order volume or activity PV/UV—to extend monitoring coverage.
Security and high availability	Data security End-to-end encryption: Encrypt monitoring data during transmission (HTTPS) and storage (encrypted storage). Permission control: Use RAM roles for fine-grained permission management—such as read-only access or alert configuration permissions. Security and compliance: Meets international and domestic security standards to ensure safe and reliable data transmission and storage. Service reliability Global distributed collection points: Monitoring nodes cover major regions worldwide to avoid data loss due to network jitter. Data redundancy: Store monitoring data across multiple replicas to ensure recoverability.
Cost optimization	Resource usage analysis Idle resource identification: Auto-flag long-term low-load ECS instances and unbound EIPs. Generate release recommendations. Cost allocation reports: Track cloud resource consumption by project, department, or tag. Support cost allocation and budget control. Adaptive data sampling Adjust collection frequency as needed: Reduce frequency for non-critical metrics—such as from 1 minute to 5 minutes—to cut storage costs.
Cross-region unified management	Monitor and manage resources across multiple regions from one place. Simplify O&M workflows.

Scenarios

Scenario	Description	Advantages
Scenario 1: Unified full-stack monitoring and real-time observability graph	Enterprises must monitor physical servers, container clusters, microservice applications, and databases across hybrid cloud environments. Fragmented tools reduce O&M efficiency. Cloud Monitor 2.0 collects metrics—such as CPU and memory usage—traces—such as API call chains—logs—such as error logs—and events—such as configuration changes—in one place. It builds an end-to-end observability graph for global, cross-resource, cross-service visibility.	Multi-source data fusion: Supports over 50 data sources—from infrastructure and middleware to applications—to break down data silos. Visual dashboards: Customize views to show resource topology, service dependencies, and key performance indicators (KPIs). Cross-domain association analysis: Automatically link abnormal metrics with related logs and traces to identify root causes faster.
Scenario 2: Intelligent anomaly detection and failure prediction	During traffic spikes or in complex architectures, manually identifying hidden failures is difficult. Cloud Monitor 2.0 uses machine learning models to analyze historical data. It predicts risks—such as capacity bottlenecks and service latency—and triggers early warnings.	Root cause localization: Detect and compute in real time using metrics, traces, and profiling data. Covers latency, error rate, anomalies, and out-of-memory (OOM) cases. Impact scope analysis: Identify affected areas—such as end users, frontend applications, page requests, applications, interfaces, databases, and containers or ECS instances. Copilot self-service exploration: Use generative AI to retrieve detection reports and solutions on demand. Alert convergence: Converge alerts across products and instances. Group alerts with the same root cause to reduce noise.
Scenario 3: End-to-end full-stack tracing from client to server (APM)	In microservice architectures, a single user request may involve dozens of service calls and frontend or backend interactions. Pinpointing performance bottlenecks is challenging. Cloud Monitor 2.0 combines full-stack tracing with code-level diagnostics. It links upstream to user experience and downstream to infrastructure—building a full-stack observability graph to precisely analyze slow queries and deadlocks.	Full-stack observability graph: Covers services, interfaces, cloud service instances, and rich data—such as metrics, events, and metadata—with cross-domain entity associations. Associated data query and analysis: Upstream—retrieve real-time, dynamic upstream access endpoints and analyze business impact. Downstream—retrieve real-time, dynamic downstream dependencies—such as middleware, databases, and containers—with full monitoring data. Dynamic architecture awareness: Build a complete, global CMDB with auto-discovery and full topology views.
Scenario 4: Security compliance and threat insights	Enterprises need real-time monitoring of security events—such as abnormal logins and data breaches—and must meet compliance audit requirements. Cloud Monitor 2.0 uses real-time log analysis and behavior pattern recognition to detect threats quickly.	Real-time threat detection: Use rule engines and AI models to identify attacks—such as abnormal logins and SQL injection. Compliance audit reports: Automatically generate operation log reports to meet compliance needs—such as China’s classified protection scheme and GDPR. Automated response: Integrate with security groups or WAF to block high-risk IP addresses automatically.
Scenario 5: Resource optimization and cost management	Opaque cloud resource usage leads to waste. Cloud Monitor 2.0 analyzes resource utilization and recommends elastic scaling policies and idle resource release plans.	Utilization analysis: Flag low-load ECS instances and unmounted disks. Generate optimization checklists. Cost forecasting: Predict monthly bills based on historical spending trends and provide cost-saving tips. Automated elasticity: Scale Kubernetes clusters or Serverless services up or down based on traffic.
Scenario 6: Intelligent alerting and automated O&M	Traditional alerting often causes false positives or information overload. Cloud Monitor 2.0 uses alert noise reduction, dynamic thresholds, and tiered notifications to improve accuracy—and supports automated remediation actions.	Alert aggregation: Merge similar events to avoid duplicate notifications. Multi-channel delivery: Push alerts by severity—such as to DingTalk, email, or text message. Automated scripts: Trigger prebuilt scripts to restart services or isolate faulty nodes.
Scenario 7: Managed open-source observability components and intelligent O&M	Enterprises widely use open-source observability tools—such as Prometheus, Grafana, and OpenTelemetry—in hybrid or multicloud environments. But they face three challenges: High O&M complexity: Self-managed Prometheus clusters require managing data collection, storage, and alerting end to end. Deployment and scaling are costly. Data silos: OpenTelemetry trace data, Prometheus metrics, and Grafana dashboards reside in separate locations—lacking unified analysis. Lack of intelligence: Open-source tools rely on manual alert rule setup and root cause analysis. They struggle with the dynamism of AI-native architectures.	Cost savings and efficiency gains: The managed service eliminates 90% of O&M work. Resource utilization improves by 30%. Full-stack observability: Covers infrastructure (Prometheus metrics), application performance (OpenTelemetry traces), and user experience (Grafana visualization). Open and compatible: Integrates seamlessly with open-source ecosystems—such as Prometheus Operator and Grafana plugins—to fit enterprise hybrid cloud stacks.

Observability apps

App type	App name	Description
Persistent	Alert Center	Manage all alert information in one place
Persistent	All Features	Manage all apps and related services in one place
Persistent	Integration Center	Integrate and manage observability objects and data
Resident	Entity Explorer	Explore the status and performance of monitored objects.
Persistent	Cloud Service Monitoring	Query and alert on basic monitoring metrics for Alibaba Cloud services
Application observability	Application Monitoring	Monitor application performance and diagnose faults in real time
Application observability	Real User Monitoring	Monitor web, mobile apps, and mini programs
Application observability	AI Application Observability	Deliver full-stack, integrated observability for AI applications
O&M monitoring	Prometheus Service	A fully managed Prometheus cloud service for high-performance monitoring
O&M monitoring	Incident Response	Group alert events into incidents and manage them
O&M monitoring	Synthetic Monitoring	Simulate user requests to proactively monitor network quality, service availability, and user experience
O&M monitoring	Database Observability	Deliver one-stop observability for database services
O&M monitoring	Log Audit	Record and review operation logs
Cloud service insights	PAI Insights	Deliver full-stack, one-stop observability for Platform for AI (PAI)
Cloud service insights	Container Insights	Analyze the operational status of Kubernetes clusters in depth
Cloud service insights	ECS Insights	Advanced monitoring for Elastic Compute Service (ECS)
Intelligent exploration and analysis	UModel Explorer	Entity and UModel debugging tool
Intelligent exploration and analysis	Data Explorer	Explore and analyze monitoring metrics and data
Intelligent exploration and analysis	Event Hub	Manage all types of event information in one place
Intelligent exploration and analysis	Dashboard	Dashboard showing key metrics
Intelligent exploration and analysis	Log Explorer	Provide log data exploration and analysis services