What is STAROps - STAROps - Alibaba Cloud Documentation Center

STAROps is an intelligent O&M platform built on large language models and AI agent technologies. It integrates cross-domain observability data with LLM reasoning to let you define operational goals in natural language while AI agents autonomously handle planning, execution, and verification — delivering 24/7 autonomous operations that protect business resilience in real time.

What does STAR stand for?

Each letter in STAR represents a core design principle of the platform:

S — Sense: holistic observability. STAROps breaks down data silos by collecting logs, metrics, traces, and topology data through a unified model (UModel). It builds a real-time digital twin of your system, capturing state changes at microsecond granularity and providing accurate, real-time context for AI-driven decisions.
T — Target: goal-oriented operations. Instead of reacting to alerts, STAROps focuses O&M on achieving goals. Define your business and operational objectives in natural language. The platform continuously evaluates deviation from targets across multiple dimensions and pinpoints factors that affect outcomes.
A — Autonomy: autonomous operations. Multi-agent coordination drives a full sense-decide-execute-verify loop without repeated human intervention, enabling 24/7 autonomous operations. High-risk actions retain Human-in-the-Loop (HIL) approval, balancing autonomy with safety.
R — Resilience: business resilience. STAROps shifts from reactive firefighting to proactive protection. Continuous inspection catches risks early. When incidents occur, the platform automatically executes remediation — scaling, rollback, or traffic shifting — and verifies the results, significantly reducing Mean Time to Recovery (MTTR). The platform captures operational experience in a knowledge graph, enabling resilience capabilities to grow over time.

Core Capabilities

STAROps provides three core capabilities:

AI Chat

Analyze alerts, query data, interpret metrics, and investigate logs using natural language — turning tedious command-line operations into instant insights. Initiate sessions from multiple portals, including STAROps, CloudMonitor, and Log Service.

Missions

Missions enable long-running, asynchronous O&M plans that span days or months, triggered by schedules or events. They convert repetitive manual interventions into reliable automated processes. A built-in Human-in-the-Loop (HIL) mechanism ensures high-risk operations require explicit approval.

Digital Employees

SRE Agent is the intelligent executor of STAROps. Build enterprise-specific SRE agents with custom responsibilities, permissions, tools, and skills tailored to your business scenarios — reducing customization costs and accelerating R&D and operations productivity. Each digital employee serves as both a conversational assistant and a long-running task executor.

Core Strengths

Benefit	Description
Unified Data Platform	A unified observability data store for logs, topologies, metrics, and traces. Supports petabyte-level daily ingestion, EB-level storage, and seconds-level analysis over hundreds of billions of records. Multi-zone deployment delivers 99.99% reliability.
Operations digital twin	Built on UModel, the digital twin uniformly models applications, services, resources, topologies, alarms, and change relationships. It supports custom extensions, real-time topology inference, and causal analysis.
Streaming data analysis operator	General-purpose and AI-powered analysis operators covering metric anomaly detection, log clustering, trace analysis, performance profiling, and change backtracking. These operators improve root cause analysis (RCA) effectiveness while reducing inference costs.
Flexible integration solution	Provides multiple integration solutions such as OpenAPI, page integration, and IM integration (DingTalk and Feishu) to flexibly integrate existing workflows.

Security and compliance assurance

Fine-grained authorization policy: Hierarchical RAM role authorization for operators and digital employees manages what people can do and what agents can access, enforcing least-privilege access and reducing the risk of unauthorized operations.
Manual intervention: Connect tools via MCP with Human-in-the-Loop (HIL) configured. High-risk write operations require manual confirmation, and a blocking engine intercepts abnormal operations to prevent misoperations and malicious behavior.
Agent behavior audit: Retains complete records of conversation history, runtime artifacts, tool calls, CLI commands, and data access. Full-lifecycle agent behavior becomes traceable, repeatable audit evidence for compliance and security reviews.
End-to-end data encryption: All data in transit is encrypted via HTTPS/TLS. Observability data is encrypted at rest using KMS, and agent artifacts are likewise protected — ensuring data privacy and integrity throughout the pipeline.

Typical Scenarios

Scheduled intelligent inspection of Kubernetes clusters: automatically checks cluster health daily, generates structured reports, and compares historical differences.
Core service high-availability assurance: continuously monitors core services and automatically performs root cause analysis (RCA) when an alert is triggered.
Natural language-driven fault diagnosis: narrow down the troubleshooting scope through multi-turn dialogue and perform correlation analysis using UModel topology.
Regular data quality checks: monitors data pipeline health on schedule and automatically notifies you of exceptions.
Automated O&M report generation: aggregates operational data weekly or monthly and generates structured reports.