Like the Harness engineering story, Agent Infra is also a buzzword that practitioners keep bringing up. But in practice, there is still no absolute consensus on what Agent Infra should include.
Today, at the 2026 Alibaba Cloud Summit, Feifei Li, CTO of Alibaba Cloud, President of International Business, shared what Alibaba Cloud's Agent Infra looks like, including six foundational infrastructure capabilities: Agent runtime, Agent orchestration, Agent governance, Agent memory, Agent data plane, and Agent security, to address six major challenges of agents: irregular bursty loads, large-scale dynamic orchestration, short lifecycles, complex data modalities and storage forms, dynamic environmental dependencies, and task-level security control.
At the Agent Native Infrastructure Subforum, Guoqiang Li, product lead for Alibaba Cloud Intelligent Cloud-Native Application Platform, shared the team's complete thinking and product practices in the field of agent engineering. From building and deployment to scaled operations, how to use one Agent Infra to cover the full lifecycle of agent development, runtime, governance, operations, and optimization. Here are the core takeaways from this session.
Today, enterprises are more eager than ever to put agents into production. Gartner predicts that by the end of 2026, 70% of enterprises will be running AI agents in production, and 40% of enterprise applications will embed agents to achieve new business growth, while at the beginning of 2025 that figure was still below 5%. But in this rush to move quickly, the engineering challenges are entering the real deep end.

First, agent architectures have many dependencies, so how can they be built and deployed quickly? There are many development frameworks and dependencies, and the runtime environment demands extremely high levels of isolation and elasticity. From a local IDE to production release, the path crosses multiple layers, including sandboxes, runtimes, model access, and credential management. The more steps there are, the longer the rollout cycle becomes.
Second, how can multiple agents be governed and collaborate? Multi-agent systems have become a trend for enterprise adoption. But if multiple agents each operate independently and communicate through black boxes, how can unified governance and control be achieved? How can humans and agents, and agents and agents, collaborate efficiently instead of working in silos?
Third, how can operational status be understood so costs remain under control? Agents are highly elastic, have many dependencies, and long call chains. Once token consumption gets out of control, costs become a black hole. Enterprises need to understand the runtime state of agents in real time from both operations and business perspectives.
Fourth, evaluation is difficult, so how can continuous optimization be achieved? Agent effectiveness is the key lifeline, but the runtime process is a black box, and traditional testing methods fall short. How to build an evaluation system and drive agents to evolve autonomously is a question facing every team.
Fifth, in complex architectures, operational issues are discovered slowly and are hard to fix. Agents further increase system complexity, and traditional SRE approaches are no longer sufficient. Intelligent methods are needed to ensure the continuity of emerging intelligent services.
These five pain points point to different stages of the agent lifecycle. Alibaba Cloud's answer is a complete Agent Infra product matrix.
The design logic of Alibaba Cloud's Agent Infra is to let enterprises "focus on outcomes and leave engineering to infrastructure." Centered on the five stages of agent development, runtime, governance, operations, and optimization, five core products each serve their own role:

Next, let's break them down one by one in the order of the agent lifecycle, from building to operations.
Built on Function Compute, AgentRun is a one-stop Agentic AI infrastructure platform centered on high-code, with an open ecosystem and flexible composition, providing enterprise agents with full-lifecycle management for development, debugging, deployment, and operations.

Its core design philosophy is a dual-track approach: "high-code for flexible customization + low-code for rapid validation":
In one sentence: AgentRun is the engineering foundation that takes agents from "can run" to "run well."
If AgentRun solves the question of "how to build a single agent," AgentTeams answers the question of "how multiple agents collaborate." This is a product upgrade from microservice governance to multi-agent governance.

2024-2025 is the trial period for single agents, 2025-2026 enters the departmental multi-agent pilot phase, and 2026-2027 will usher in large-scale enterprise deployment. The new problems enterprises face are: agents scattered across departments with no unified governance view, black-box communication between agents that humans cannot effectively supervise, agents directly holding credentials that create security risks, and unmonitored token consumption leading to uncontrollable costs.
AgentTeams is a one-stop enterprise multi-agent governance and collaboration platform focused on four core needs: unified governance (multi-source agent onboarding without being tied to a single vendor), collaboration orchestration (Leader-Worker, human in the loop), security and compliance (enterprise SSO integration, full-chain auditing), and controllable cost (usage-based billing, token monitoring and limits).
AgentTeams covers four core scenarios: first, enterprise digital employees, where users initiate tasks through enterprise IM and AgentTeams schedules Agent Teams by department for execution with full audit traceability; second, agent Team service enablement, where administrators create Team pools configured by role and business teams apply for access as needed through RBAC, with independent quotas and billing; third, SaaS Agent Team enablement, where SaaS vendors allocate independent Agent Teams to different tenants and control accessible Skills/MCP through permission policies, with isolated data and calls; fourth, governance of existing agents, where heterogeneous agents already deployed and running can be centrally governed and orchestrated without modification, and assets can be unified, accumulated, and reused.
The management layer of AgentTeams is implemented based on the open-source project HiClaw (an open-source framework for multi-agent governance and collaboration), while the agent core is Alibaba Cloud's self-developed agent engine QwenPaw, balancing flexibility and out-of-the-box usability. AgentTeams is currently in invite-only testing.
Effectiveness is the lifeline of agents. But unlike traditional applications, whether an agent is "good to use" cannot be judged by a single launch; it needs a continuously running data flywheel to drive evolution. AgentLoop was created for exactly this purpose: an Agent full-lifecycle observability and data flywheel platform covering the two areas of observability and evaluation & optimization.

▍AgentLoop - Observability
The design goal of AgentLoop observability is "zero-modification access, full-chain visibility."
At the integration layer, AgentLoop supports multiple collection methods, including self-developed probes, OpenTelemetry SDK, and OTel eBPF, and is compatible with mainstream agent frameworks and platforms such as QwenPaw, HiClaw, Dify, Hermes-Agent, Coze, AgentScope, Alibaba Cloud Bailian applications, AgentRun, LangChain/LangGraph, and OpenAI, enabling out-of-the-box, non-intrusive integration.
At the analysis layer, AgentLoop provides multi-dimensional performance profiling and intelligent anomaly diagnosis, covering latency distribution, call hotspots, and token cost attribution, turning the "black-box agent" into a "transparent agent."
▍AgentLoop - Evaluation and Optimization
Observability finds problems; evaluation and optimization solve them. AgentLoop builds a complete data flywheel of Collect → Analyze → Evaluate → Optimize:
AgentLoop's evaluation capabilities also include continuous dataset construction and accumulation. Observability data is not just for "looking at it and that's it"; it is accumulated into reusable evaluation datasets, making every online interaction fuel for optimizing agents. This flywheel gives agents true "continuous acceleration": the more they are used, the better they run. AgentLoop - Evaluation and Optimization is expected to enter public beta in June.
Large-scale deployment of agents inevitably increases system complexity. When call chains span multiple layers of models, tools, middleware, and infrastructure, traditional manual operations are no longer enough. STAROps is Alibaba Cloud's all-domain intelligent operations platform, combining large-model capabilities with observability data to autonomously complete the full closed loop of sensing, decision-making, execution, and verification. Centered on Sense, Target, Autonomy, and Resilience, STAROps shifts the operations model from passive response to active autonomy, providing enterprises with 24/7 uninterrupted autonomous operations capabilities.

To achieve this goal, STAROps provides three core capabilities.
Core technical advantages:
STAROps performs unified modeling of all-domain data. Using unified observability data as the foundation, its self-developed UModel unifies logs, metrics, traces, events, topology, and other data into a single model, building an operations graph specific to the customer's system. During AI analysis, it automatically perceives service clusters, dependent components, and call relationships, enabling end-to-end tracing from the business layer to the infrastructure layer. It also supports custom extensions by business scenario, enabling real-time topology reasoning and automatic correlation of fault causality.
At the data analysis layer, the platform includes general-purpose operators and observability AI operators, covering typical scenarios such as metric anomaly detection, log clustering, trace analysis, performance profiling, and change tracing, shortening the time needed to identify and handle root causes. At the same time, through algorithm light-weighting and optimized compute strategies, it significantly reduces the resource overhead of model inference.
In addition, STAROps has built a fault simulation system that is close to the production environment, connecting the loop of "fault injection - data collection - intelligent diagnosis - automatic repair." Combined with online status and offline simulation, it continuously iterates analysis models and operations strategies, forming an intelligent operations flywheel that is evaluable, rollbackable, and self-evolving.
Open-source contributions:
Along with the product release, Alibaba Cloud also open-sourced the UModel unified data model project and the RCA evaluation benchmark set, and jointly launched the "Industry Initiative for Enterprise General Semantic Standards" with more than 10 industry partners and academic institutions, including the China Academy of Information and Communications Technology, XPeng Motors, and the Institute of Software, Chinese Academy of Sciences. This allows enterprises to avoid being tied to a single vendor and to flexibly build intelligent operations systems based on public standards. UModel provides enterprises with directly reusable entity modeling and semantic governance standards, eliminating the high cost of starting from scratch; the RCA evaluation benchmark set covers more than 2,000 evaluation records and more than 700 operations scenarios, providing enterprises with a public benchmark for independently evaluating operations AI capabilities.

Looking back at Alibaba Cloud's entire Agent Infra design philosophy, one core idea runs throughout: in the AI era, effectiveness is king. Alibaba Cloud uses Agent Infra to help enterprises focus on outcomes and win the next round of growth in the intelligent era.
AgentRun makes building easier, AgentTeams makes collaboration transparent and secure, AgentLoop makes results measurable and evolvable, and STAROps makes operations intelligent, forming an organic whole.
The paradigm shift from deterministic systems to probabilistic agents has already happened. The underlying infrastructure is no longer just a resource pool; it must become a platform that supports the dynamic operation and continuous evolution of agents. When engineering complexity is absorbed by infrastructure, every bit of effort freed up by enterprises will directly translate into incremental business results.
[1] https://aiops-benchmark.oss-cn-hongkong.aliyuncs.com/rca/rca100/v1.0/README.md
Add Enterprise Memory to OpenClaw, and Your Agent Finally Doesn’t Have to Ask Again
718 posts | 58 followers
FollowAlibaba Cloud Big Data and AI - January 21, 2026
Alibaba Container Service - March 12, 2026
Alibaba System Software - August 27, 2018
Alibaba Developer - December 27, 2019
Neel_Shah - August 8, 2025
Alibaba Cloud Native Community - April 22, 2026
718 posts | 58 followers
Follow
Alibaba Cloud Model Studio
A one-stop generative AI platform to build intelligent applications that understand your business, based on Qwen model series such as Qwen-Max and other popular models
Learn More
CloudMonitor
Automate performance monitoring of all your web resources and applications in real-time
Learn More
Simple Log Service
An all-in-one service for log-type data
Learn More
Application Real-Time Monitoring Service
Build business monitoring capabilities with real time response based on frontend monitoring, application monitoring, and custom business monitoring capabilities
Learn MoreMore Posts by Alibaba Cloud Native Community