×
Community Blog What Does Alibaba Cloud's Agent Infra Look Like

What Does Alibaba Cloud's Agent Infra Look Like

This article introduces Alibaba Cloud's Agent Infra, a comprehensive product matrix unveiled at the 2026 Summit to address the full lifecycle challeng.

Like the Harness engineering story, Agent Infra is also a buzzword that practitioners keep bringing up. But in practice, there is still no absolute consensus on what Agent Infra should include.

Today, at the 2026 Alibaba Cloud Summit, Feifei Li, CTO of Alibaba Cloud, President of International Business, shared what Alibaba Cloud's Agent Infra looks like, including six foundational infrastructure capabilities: Agent runtime, Agent orchestration, Agent governance, Agent memory, Agent data plane, and Agent security, to address six major challenges of agents: irregular bursty loads, large-scale dynamic orchestration, short lifecycles, complex data modalities and storage forms, dynamic environmental dependencies, and task-level security control.

At the Agent Native Infrastructure Subforum, Guoqiang Li, product lead for Alibaba Cloud Intelligent Cloud-Native Application Platform, shared the team's complete thinking and product practices in the field of agent engineering. From building and deployment to scaled operations, how to use one Agent Infra to cover the full lifecycle of agent development, runtime, governance, operations, and optimization. Here are the core takeaways from this session.

The Five Major Pain Points Enterprises Face When Building Agents

Today, enterprises are more eager than ever to put agents into production. Gartner predicts that by the end of 2026, 70% of enterprises will be running AI agents in production, and 40% of enterprise applications will embed agents to achieve new business growth, while at the beginning of 2025 that figure was still below 5%. But in this rush to move quickly, the engineering challenges are entering the real deep end.

1

First, agent architectures have many dependencies, so how can they be built and deployed quickly? There are many development frameworks and dependencies, and the runtime environment demands extremely high levels of isolation and elasticity. From a local IDE to production release, the path crosses multiple layers, including sandboxes, runtimes, model access, and credential management. The more steps there are, the longer the rollout cycle becomes.

Second, how can multiple agents be governed and collaborate? Multi-agent systems have become a trend for enterprise adoption. But if multiple agents each operate independently and communicate through black boxes, how can unified governance and control be achieved? How can humans and agents, and agents and agents, collaborate efficiently instead of working in silos?

Third, how can operational status be understood so costs remain under control? Agents are highly elastic, have many dependencies, and long call chains. Once token consumption gets out of control, costs become a black hole. Enterprises need to understand the runtime state of agents in real time from both operations and business perspectives.

Fourth, evaluation is difficult, so how can continuous optimization be achieved? Agent effectiveness is the key lifeline, but the runtime process is a black box, and traditional testing methods fall short. How to build an evaluation system and drive agents to evolve autonomously is a question facing every team.

Fifth, in complex architectures, operational issues are discovered slowly and are hard to fix. Agents further increase system complexity, and traditional SRE approaches are no longer sufficient. Intelligent methods are needed to ensure the continuity of emerging intelligent services.

These five pain points point to different stages of the agent lifecycle. Alibaba Cloud's answer is a complete Agent Infra product matrix.

Agent Infra Product Panorama: Five Platforms Cover the Full Lifecycle

The design logic of Alibaba Cloud's Agent Infra is to let enterprises "focus on outcomes and leave engineering to infrastructure." Centered on the five stages of agent development, runtime, governance, operations, and optimization, five core products each serve their own role:

2

  • AgentRun: A one-stop intelligent agent development and build platform
  • AgentTeams: A multi-agent governance and collaboration platform
  • AgentLoop - Observability: Full-stack observability for agents
  • AgentLoop - Evaluation and Optimization: Continuous agent optimization
  • STAROps: An all-domain intelligent operations platform

Next, let's break them down one by one in the order of the agent lifecycle, from building to operations.

AgentRun: High-Code-Centric One-Stop Agentic AI Infrastructure

Built on Function Compute, AgentRun is a one-stop Agentic AI infrastructure platform centered on high-code, with an open ecosystem and flexible composition, providing enterprise agents with full-lifecycle management for development, debugging, deployment, and operations.

3

Its core design philosophy is a dual-track approach: "high-code for flexible customization + low-code for rapid validation":

  • On the high-code side, AgentRun provides complete modules such as runtime, sandbox, observability, agent evaluation, memory and knowledge bases (Context Engineering), model connectors, and credential and security management, allowing developers to flexibly assemble them based on business needs.
  • On the low-code side, it is compatible with Alibaba Cloud Bailian, ModelScope, and other no-code/low-code platforms, as well as the MCP protocol and SDKs, delivering out-of-the-box rapid validation capabilities. At the same time, through the AI gateway Higress, it unifies access to open-source models and fine-tuned models (on PAI & FC & ACS), connecting the model inference path.

In one sentence: AgentRun is the engineering foundation that takes agents from "can run" to "run well."

AgentTeams: Letting AI Agents form Real Teams

If AgentRun solves the question of "how to build a single agent," AgentTeams answers the question of "how multiple agents collaborate." This is a product upgrade from microservice governance to multi-agent governance.

4

2024-2025 is the trial period for single agents, 2025-2026 enters the departmental multi-agent pilot phase, and 2026-2027 will usher in large-scale enterprise deployment. The new problems enterprises face are: agents scattered across departments with no unified governance view, black-box communication between agents that humans cannot effectively supervise, agents directly holding credentials that create security risks, and unmonitored token consumption leading to uncontrollable costs.

AgentTeams is a one-stop enterprise multi-agent governance and collaboration platform focused on four core needs: unified governance (multi-source agent onboarding without being tied to a single vendor), collaboration orchestration (Leader-Worker, human in the loop), security and compliance (enterprise SSO integration, full-chain auditing), and controllable cost (usage-based billing, token monitoring and limits).

  • Unified governance for multi-source agents: all agent communication is based on the Matrix protocol, achieving protocol-level decoupling. One Team can mix heterogeneous agents such as OpenClaw, QwenPaw, Claude Code, and in-house agents, eliminating framework lock-in.
  • Leader-Worker collaboration orchestration: built on a collaborative foundation based on the Leader-Worker architecture, the Leader agent is responsible for intent understanding, task decomposition, and progress monitoring, while Worker agents execute assigned work. The Human-in-the-Loop design ensures the process is 100% visible, letting users understand the communication between agents like reading a work group chat, and intervene to correct course at any time.
  • Native IM integration: built-in Matrix-native IM, while also integrating mainstream enterprise IM tools such as DingTalk, Feishu, and WeCom. Employees can initiate tasks, supervise in real time, and approve interventions in the chat window they already know; agent execution results are sent back to IM, enabling collaboration like a "digital coworker."
  • Agent asset management (AI Registry): unified registration of Skills, MCP Servers, Agents, and Team templates, with Team-based assignment, versioned management, security review, and hot loading in runtime. REST to MCP supports zero-code conversion, so existing business systems can connect without modification.
  • Enterprise-grade security governance: a zero-trust architecture is used, with agents not holding credentials and gateways centrally controlling access. Identity permissions, cost metering, audit compliance, and data security span the entire agent lifecycle, meeting compliance requirements in finance, healthcare, manufacturing, and other industries.
  • Full-chain observability: based on OpenTelemetry Trace, it provides end-to-end tracing from user requests to model calls and tool execution. Token costs are analyzed by Team / Agent / model dimensions, and work with AgentLoop to drive continuous agent evolution.

AgentTeams covers four core scenarios: first, enterprise digital employees, where users initiate tasks through enterprise IM and AgentTeams schedules Agent Teams by department for execution with full audit traceability; second, agent Team service enablement, where administrators create Team pools configured by role and business teams apply for access as needed through RBAC, with independent quotas and billing; third, SaaS Agent Team enablement, where SaaS vendors allocate independent Agent Teams to different tenants and control accessible Skills/MCP through permission policies, with isolated data and calls; fourth, governance of existing agents, where heterogeneous agents already deployed and running can be centrally governed and orchestrated without modification, and assets can be unified, accumulated, and reused.

The management layer of AgentTeams is implemented based on the open-source project HiClaw (an open-source framework for multi-agent governance and collaboration), while the agent core is Alibaba Cloud's self-developed agent engine QwenPaw, balancing flexibility and out-of-the-box usability. AgentTeams is currently in invite-only testing.

AgentLoop: Data Flywheel Drives Continuous Agent Evolution

Effectiveness is the lifeline of agents. But unlike traditional applications, whether an agent is "good to use" cannot be judged by a single launch; it needs a continuously running data flywheel to drive evolution. AgentLoop was created for exactly this purpose: an Agent full-lifecycle observability and data flywheel platform covering the two areas of observability and evaluation & optimization.

5

AgentLoop - Observability

The design goal of AgentLoop observability is "zero-modification access, full-chain visibility."

At the integration layer, AgentLoop supports multiple collection methods, including self-developed probes, OpenTelemetry SDK, and OTel eBPF, and is compatible with mainstream agent frameworks and platforms such as QwenPaw, HiClaw, Dify, Hermes-Agent, Coze, AgentScope, Alibaba Cloud Bailian applications, AgentRun, LangChain/LangGraph, and OpenAI, enabling out-of-the-box, non-intrusive integration.

At the analysis layer, AgentLoop provides multi-dimensional performance profiling and intelligent anomaly diagnosis, covering latency distribution, call hotspots, and token cost attribution, turning the "black-box agent" into a "transparent agent."

AgentLoop - Evaluation and Optimization

Observability finds problems; evaluation and optimization solve them. AgentLoop builds a complete data flywheel of Collect → Analyze → Evaluate → Optimize:

  • Collect: Non-intrusively capture full-chain agent interaction data, covering inputs, outputs, and every intermediate reasoning step.
  • Analyze: Perform multi-dimensional performance analysis on the collected data, intelligently locating bottlenecks and anomalous behavior.
  • Evaluate: Automatically score quality and quantify agent performance. Supports the Agent-as-a-Judge mode for more precise evaluation.
  • Optimize: Experiment-driven iteration, with data supporting every improvement. Supports intelligent tuning and autonomous evolution, including prompt optimization and skill iteration.

AgentLoop's evaluation capabilities also include continuous dataset construction and accumulation. Observability data is not just for "looking at it and that's it"; it is accumulated into reusable evaluation datasets, making every online interaction fuel for optimizing agents. This flywheel gives agents true "continuous acceleration": the more they are used, the better they run. AgentLoop - Evaluation and Optimization is expected to enter public beta in June.

STAROps: An All-Domain Intelligent Operations Platfor

Large-scale deployment of agents inevitably increases system complexity. When call chains span multiple layers of models, tools, middleware, and infrastructure, traditional manual operations are no longer enough. STAROps is Alibaba Cloud's all-domain intelligent operations platform, combining large-model capabilities with observability data to autonomously complete the full closed loop of sensing, decision-making, execution, and verification. Centered on Sense, Target, Autonomy, and Resilience, STAROps shifts the operations model from passive response to active autonomy, providing enterprises with 24/7 uninterrupted autonomous operations capabilities.

6

To achieve this goal, STAROps provides three core capabilities.

  • First is the intelligent assistant: STAROps directly converts natural language into unified query and analysis results for cross-domain observability data, with alert analysis, data queries, metric interpretation, and log diagnosis all completed in one chat window.
  • Second is the long-running task mechanism: STAROps turns operations from "a person watching the system run" into "agents continuously running on behalf of people." After a one-time goal alignment, subsequent inspections, alert analysis, anomaly handling, and verification all execute autonomously.
  • Third is the digital employee: enterprises can build dedicated SRE agents for each team and business scenario, customize scope of responsibility, permission boundaries, and skill sets, and turn the operations standards, response playbooks, and troubleshooting experience accumulated by the team into configurable "digital employees."

Core technical advantages:

STAROps performs unified modeling of all-domain data. Using unified observability data as the foundation, its self-developed UModel unifies logs, metrics, traces, events, topology, and other data into a single model, building an operations graph specific to the customer's system. During AI analysis, it automatically perceives service clusters, dependent components, and call relationships, enabling end-to-end tracing from the business layer to the infrastructure layer. It also supports custom extensions by business scenario, enabling real-time topology reasoning and automatic correlation of fault causality.

At the data analysis layer, the platform includes general-purpose operators and observability AI operators, covering typical scenarios such as metric anomaly detection, log clustering, trace analysis, performance profiling, and change tracing, shortening the time needed to identify and handle root causes. At the same time, through algorithm light-weighting and optimized compute strategies, it significantly reduces the resource overhead of model inference.

In addition, STAROps has built a fault simulation system that is close to the production environment, connecting the loop of "fault injection - data collection - intelligent diagnosis - automatic repair." Combined with online status and offline simulation, it continuously iterates analysis models and operations strategies, forming an intelligent operations flywheel that is evaluable, rollbackable, and self-evolving.

Open-source contributions:

Along with the product release, Alibaba Cloud also open-sourced the UModel unified data model project and the RCA evaluation benchmark set, and jointly launched the "Industry Initiative for Enterprise General Semantic Standards" with more than 10 industry partners and academic institutions, including the China Academy of Information and Communications Technology, XPeng Motors, and the Institute of Software, Chinese Academy of Sciences. This allows enterprises to avoid being tied to a single vendor and to flexibly build intelligent operations systems based on public standards. UModel provides enterprises with directly reusable entity modeling and semantic governance standards, eliminating the high cost of starting from scratch; the RCA evaluation benchmark set covers more than 2,000 evaluation records and more than 700 operations scenarios, providing enterprises with a public benchmark for independently evaluating operations AI capabilities.

7

Outlook: In the AI Era, Effectiveness Is King

Looking back at Alibaba Cloud's entire Agent Infra design philosophy, one core idea runs throughout: in the AI era, effectiveness is king. Alibaba Cloud uses Agent Infra to help enterprises focus on outcomes and win the next round of growth in the intelligent era.

AgentRun makes building easier, AgentTeams makes collaboration transparent and secure, AgentLoop makes results measurable and evolvable, and STAROps makes operations intelligent, forming an organic whole.

The paradigm shift from deterministic systems to probabilistic agents has already happened. The underlying infrastructure is no longer just a resource pool; it must become a platform that supports the dynamic operation and continuous evolution of agents. When engineering complexity is absorbed by infrastructure, every bit of effort freed up by enterprises will directly translate into incremental business results.

Related Links

[1] https://aiops-benchmark.oss-cn-hongkong.aliyuncs.com/rca/rca100/v1.0/README.md

0 0 0
Share on

You may also like

Comments

Related Products