The Changes in the Agent Development Toolchain and the Invariance of the Application Architecture

This article introduces the rapid evolution of Agent development toolchains across four stages, contrasting it with the relatively stable underlying Agent application architecture.

By Wang Chen

Almost every month, new commercial products or open-source projects emerge in the field of Agent development tools, but the application architecture of Agent remains relatively stable.

Agent Development Toolchain is Rapidly Evolving

The models have brought awareness and autonomy, but they have reduced the determinism and consistency of the output results. Whether it’s foundational large model vendors or companies providing development toolchains and operational guarantees, their essence is to improve output reliability; only differing team genetics and industry judgment have provided various implementation paths. Below, we review the evolution of the Agent development toolchain in four stages by connecting several well-known development tools.

Stage One: Basic Development Framework

At the end of 2022, the release of ChatGPT allowed the world to intuitively feel the general intelligence potential of large language models for the first time, but at that time, LLMs were still isolated intelligences, unable to leverage the strength of a vast number of developers to accelerate industry development.

This was followed by the emergence of the first batch of Agent frameworks, such as LangChain and LlamaIndex, which reduced development complexity through modular abstractions such as model communication, ChatClient, Prompt, formatted output, Embedding, etc., enabling quick construction of chatbots, connecting contexts, and calling models.

In 2024, Spring AI Alibaba launched, providing high-level AI API abstractions and cloud-native infrastructure integration solutions to help Java developers quickly build AI applications. In the future, it will serve as a part of the AgentScope ecosystem, located to facilitate connections between Spring and AgentScope, and plans to release the AgentScope Java version by the end of November this year, aligning with the capabilities of AgentScope Python.

With the rapid development of the industry, various basic development frameworks are continually evolving, gradually supporting or integrating capabilities like retrieval, retrieval-augmented generation (RAG), Memory, tools, evaluation, observability, AI gateways, etc., and offering single-agent, workflow, and multi-agent development paradigms, as well as deep research agents and general-purpose agents based on frameworks, like DeepRearch and Jmanus.

Although development frameworks may not be as sexy as some researchers think, they play an irreplaceable role in tapping the vast number of developers quickly into the AI development ecosystem.

Stage Two: Collaboration & Tools

Although large models are intelligent, they lack the tools to extend into the physical world; they are both unreadable and unwritable to the physical world. At the same time, the application development frameworks of the first stage are not friendly to non-programmers, which is not conducive to inter-team collaboration and the involvement of domain experts. Thus, between 2023 and 2024, low-code and even no-code development frameworks like Dify and n8n were pushed into enterprise production environments, defining task processing workflows through workflows and adding if/else branches; even using natural language to generate some simple frontend pages, improving the collaboration efficiency between domain experts and programmers.

On the tools level, in June 2023, OpenAI officially launched Function Calling. In November 2024, Anthropic released the MCP protocol to enable cross-model tool interoperability. Especially with the emergence of MCP, the developer ecosystem was significantly activated.

Thus, together, they pushed the Agent development toolchain into the second stage: Tools & Collaboration.

However, simply lowering the barrier to building applications and allowing applications to call external applications or systems through tools hasn’t effectively solved the consistency and reliability of the output. Therefore, the evolution of the developer toolchain has entered the deep waters. In 2024, Andrej Karpathy proposed context engineering, sparking industry resonance; how to select context, organize context, and dynamically adjust context structures across different tasks became key to improving output stability, thus entering the stage of Reinforcement Learning (RL).

Stage Three: Reinforcement Learning

System prompts, knowledge bases, tools, and memory are important components of context engineering. Although the mechanisms have matured, outputs still fluctuate, relying on RL to turn context engineering from static templates into intelligent dynamic strategies. For example:

RAG retrieval ranking: RL optimizes document reordering strategies, making contexts closer to task semantics and reducing irrelevant noise.
Multi-turn dialogue memory: RL optimizes memory retention and forgetting strategies, allowing dialogue to maintain coherence over long-term interactions.
Tool calls: RL learns the timing and parameter construction of calls, improving the efficiency and accuracy of tool usage.

RL is a challenge in the industry, relying both on algorithmic technology and requiring sufficient domain know-how while facing generality challenges. However, there are notable practical implementations.

Jina.AI was recently acquired by Elastic Company; its CEO, Dr. Xiao Han, shared in an article titled "The Future of Search Lies in a Bunch of Small Models" about Jina.AI's research on search foundational models, mainly including Embeddings, Reranker, and Reader:

Embeddings are vector models designed for multi-language and multi-modal data, turning text or images into fixed-length vectors.
Reranker is a precise ranking model based on query-documents design, given a query and a bunch of documents, directly feeding into the model, which then outputs the relevance ranking of the documents based on the query.
Reader primarily aims to use generative small models (SLM) to achieve intelligence on single documents, such as data cleaning, filtering, and extraction.

Additionally, Alibaba Cloud's API gateway, based on RL, provides tool optimization and semantic retrieval capabilities, enhancing the call quality of batch MCPs and reducing call time. For example, through reordering and optional query rewriting, it pre-processes and filters the list of tools before the request is sent to a large language model, improving response speed and selection accuracy in large-scale toolset scenarios while reducing token costs. In evaluations using different scales of toolsets (50/100/200/300/400/500) on Salesforce's open-source dataset, results indicate:

Accuracy improvement: After query rewriting and tool ranking, the selection accuracy of tools and parameters can be improved by up to 6%.
Response speed improvement: When the toolset scale exceeds 50, the response time (RT) significantly decreases. In a test scenario with 500 tools, the response speed can improve by up to 7 times.
Cost reduction: Token consumption (cost) can be reduced by 4-6 times.

These enterprises excelling in RL often regard these practices as competitive and revenue-generating aspects of their commercial products, hence they don’t enable quick earnings for developers like frameworks or tools do. Thus, it has evolved to the fourth stage, where foundational model vendors directly engage in context engineering.

Stage Four: Model Centralization

In October 2025, OpenAI AgentKit and Apps SDK, and Claude Skills were released, marking the entry of Agent engineering into the era of model centralization.

OpenAI AgentKit and Apps SDK: Provide an officially sanctioned Agent development toolchain, directly hosting memory, tool registry, and external application calling logic on the model side, lowering development barriers.
Claude Skills: Allow the model itself to load and manage skills; users merely need to provide input, as the model constructs context and capability invocation chains internally.

Especially with Claude Skills, the Skills construction capability, and MCP connecting tools, it doesn’t even require MCP; Skills execute Python scripts to directly connect to APIs, with large models generating new Skills. This shifts the responsibilities of Agent context engineering from developers to the framework side, including construction, execution, and operation.

Agent Application Architecture is Relatively Stable

Compared to the rapidly evolving Agent development toolchain, the mapping infrastructure of the Agent application architecture remains relatively stable.

We shared in the "AI Native Application Architecture White Paper" that the AI native application architecture includes 11 key elements: models, development frameworks, prompts, RAG, memory, tools, AI gateways, runtime, observability, evaluation, and security. Take AI gateways, runtime, observability, and security as examples.

AI Gateway: Responsible for the aggregation and intelligent scheduling of models and tools, providing access authorization, along with load balancing, rate limiting, and other traffic governance capabilities. Regardless of which toolchain the developers use, they ultimately require a control hub equipped with load balancing, rate limiting, identity authentication, and call chain tracing.
Runtime: Provides the operating environment and computing power support, responsible for task scheduling, state management, security isolation, timeout management, and concurrent tracking, ensuring Agents run reliably and economically. Whether it's a privatized Agent deployment or a public cloud-based mixed orchestration of multiple models, it all boils down to the allocation of GPU resources, concurrent model inference, container isolation, and scheduling efficiency. These capabilities will not be frequently reconstructed due to updates at the tool layer.
Observability: Since an Agent consists of multiple elements (models, tools, RAG, memory, etc.) that dynamically compose a complex system, the lack of unified observability directly leads to engineering uncontrollability. Currently, there is a high level of consensus in the industry at this level: providing end-to-end log collection and link tracing for applications, gateways, and inference engines to offer request throughput, error rates, and resource usage, ensuring the stable, secure, and efficient operation of applications. Its structural evolution is relatively stable.
Security: The security of Agents still follows the general logic of cloud computing architecture, including identity authentication, access control, data de-identification, and protection against privilege escalation. Particularly in multi-model, multi-tenant operating environments, the determinacy of security policies is more important than the flexibility of the toolchain.

Rapid iteration and innovation in toolchains enhance output reliability; runtime modules such as gateways, computing power, observability, and security ensure that applications run stably, economically, and safely. It is precisely this structure of "rapid change above, stable state below" that guarantees the AI application ecosystem can innovate at high speed without falling into systemic chaos.

If you want to learn more about Alibaba Cloud API Gateway (Higress), please click: https://higress.ai/en/

Community

The Changes in the Agent Development Toolchain and the Invariance of the Application Architecture

Agent Development Toolchain is Rapidly Evolving

Stage One: Basic Development Framework

Stage Two: Collaboration & Tools

Stage Three: Reinforcement Learning

Stage Four: Model Centralization

Agent Application Architecture is Relatively Stable

Read previous post:

Read next post:

Alibaba Cloud Native Community

You may also like

Comments

Alibaba Cloud Native Community

Related Products

API Gateway

AgentBay

AI Acceleration Solution

Offline Visual Intelligence Software Packages