The engineering of AI Agents is underestimated

Engineering = Product Engineering + Technical Engineering，The collaboration between these two components determines whether an AI Agent is "usable, easy to use, and scalable.

Author: Wang Chen

Recently, two articles have gained significant attention1, both coincidentally mentioning that the engineering aspect of AI applications has been underestimated as AI has developed to the present.

"For example, better virtual machines, longer contexts, a large number of MCPs, and even smart contracts... a series of engineering issues are in high demand."
"There are many engineering tools for AI, such as LangGraph and LangChain. These are like Lego blocks for building; the richer the blocks, the stronger the ability to assemble complex structures."

However, the term engineering is a very generalized technical term, encompassing a wide range of content. Broadly speaking, non-algorithmic technical implementations and product designs can all be categorized as engineering. This article temporarily classifies engineering into product engineering and technical engineering, attempting to simplify the construction of an AI Agent's engineering system through this perspective.

Engineering = Product Engineering + Technical Engineering

The collaboration between these two components determines whether an AI Agent is "usable, easy to use, and scalable."

1. Product Engineering

Goal: Make AI usable and user-friendly, easy for users to understand, comfortable to use, and sustainable. From a growth perspective, it’s not just about download numbers, but also retention rates and user activity.

Product engineering focuses on the overall considerations of product philosophy, product business, interaction design, and user experience, ensuring that AI is no longer just a "black box," but that it can be perceptive, guiding, and feedback-rich, with a self-correcting mechanism. We will first deconstruct product engineering and then focus on key modules to elaborate on their role in achieving a successful AI Agent.

Module	Definition
Demand Modeling	Clarify who the AI application serves, what problems it can solve, and avoid "using AI for the sake of using AI."
UI/UX Design	Transform the complex behaviors of AI into interfaces and processes that users can understand and operate.
Human-Machine Interaction Process	Allow the AI to "ask questions" and "confirm decisions," completing tasks rhythmically like an assistant.
Prompt Engineering	Make good use of prompts like a "magic wand" to enhance the quality and consistency of AI outputs.
Feedback Loop	Enable users to provide feedback on results, allowing the system to learn to improve or signal failures.
Permissions and Compliance	Control who can use what data and prevent AI abuse or data leaks.

1️⃣ Demand Modeling

In traditional software development, we first ask, "What are the core pain points of the users?" The same applies to AI Agents. If we do not clarify the target users and scenarios, it is easy to end up in an awkward situation where the AI appears to answer everything but cannot be practically used.

The first step in building an AI Agent is not to choose the model, but rather to answer a question like a product manager: "Who is this AI supposed to help, how will it solve problems, what problems can it address, to what extent can these problems be solved, and is the user willing to pay for this value?" This defines the fit between the product and the market.

Take Manus as an example[3]. It is the world's first general-purpose AI entity, with the core idea being "a combination of hand and brain," emphasizing the shift of AI from a passive tool to an active collaborator.

1. Clarifying the Role of AI

During the demand modeling phase, Manus positions AI as an "active collaborator" that can think independently, plan, and execute complex tasks, rather than just providing suggestions or answers.

Example: A user inputs "7-day budget of 20,000 yuan for a trip to Thailand," and Manus automatically completes currency conversion, hotel comparisons, itinerary planning, and exports a PDF manual.

This role definition requires clearly delineating AI's responsibilities and behavioral boundaries in the system prompts, ensuring that it possesses autonomy and reliability in executing tasks.

2. Designing Task Closure Ability

Manus emphasizes the ability to execute tasks in a closed loop, meaning the entire process from receiving instructions to delivering complete results is automated.

Example: A user requests an analysis of stock xx, and Manus automatically retrieves relevant data, carries out financial modeling, generates an interactive dashboard, and deploys it as an accessible website.

In demand modeling, user requirements need to be broken down into multiple sub-tasks, and corresponding execution processes and tool invocation strategies must be designed to ensure smooth task closure.

3. Multi-Agent Collaborative Architecture

Manus employs a multi-agent architecture divided into planning, execution, and validation layers, with each layer collaborating to complete complex tasks.

Planning Layer: Responsible for task decomposition and process planning.
Execution Layer: Calls various tools and APIs to execute specific tasks.
Validation Layer: Verifies the task results to ensure accuracy and compliance of outputs.

In the demand modeling phase, the responsibilities and interactions of each layer need to be defined to ensure the coordinated operation of the entire system.

4. Innovation in Human-Machine Collaboration Models

Manus supports asynchronous processing and mid-task intervention, allowing users to add instructions or modify task parameters during execution, simulating collaboration modes found in real workplaces.

Example: During the task execution process in Manus, users can close devices or add instructions at any time; Manus will adjust the task execution process according to new instructions.

This design requires considering the flexibility and controllability of human-machine interaction in demand modeling, ensuring that users have sufficient control during collaboration.

This approach to demand modeling is similar to segmenting user task processes and identifying the areas where AI excels and should intervene, avoiding the "broad and vague" approach while allowing users to experience efficiency improvements in real-time.

Just as designing service boundaries for microservices entails defining which responsibilities lie with the order unit and which with the user inventory unit, AI applications also need to specify which tasks are handled by AI and which fall under business logic; these distinctions will directly determine the final user experience.

2️⃣ UI/UX Design

The AI Agent differs from traditional software in that its outputs are often uncertain, delayed, and difficult to predict. This means that UI/UX design is not just about "making interfaces beautiful," but also about understanding user psychology and behavioral rhythms.

For example, DeepSeek was the first to visualize the "thought processes" of a large model; before generating a response, it displays the model's thought chain, allowing users to see that the AI is not guessing randomly but is logically thinking. Users are no longer passively receiving results; instead, they are participating in the thought process, establishing a collaborative relationship in problem-solving.

This design effectively enhances users' trust and acceptance, especially in scenarios involving multi-step tasks, complex document summaries, and cross-reference of information.

Currently, interaction strategies such as "progressive information presentation," "visualization of thought processes," and "structured results" have become standard for AI Agents. Users can view call chains, track reference sources, and even Qwen provides options to delete reference sources, further reducing hallucinations caused by untrustworthy internet sources.

3️⃣ System Prompts

System prompts are a set of instructions or constraints pre-set by AI Agent developers to define the model's behavior framework, role settings, interaction style, and output rules in specific application scenarios. These differ from the queries users input at the front end (user prompts) and represent a lower-level control mechanism that decisively affects the quality, style, consistency, and safety of model outputs. The core components typically include: role definition, behavioral constraints, task orientation, and context management.[4]

For example, NotebookLM's core concept is: users upload their own materials, and the AI assistant responds to questions and provides suggestions based on this material, acting as a "trusted knowledge advisor," empowering users to engage in more efficient and intelligent learning and research activities.

This product positioning indicates that the system prompts behind NotebookLM must not only fulfill basic language guidance but also implement more complex task instructions and safety constraints. We can break down its system prompts design approach from several key dimensions:

1. Role Definition: You are my research assistant

NotebookLM's system prompts define the model as a "document knowledge advisor," emphasizing that it can only provide answers based on the notes, PDFs, and web pages uploaded by users. This is crucial as it ensures that the model's responses do not arbitrarily "improvise" or introduce hallucinated information, but are consistently centered around the user-provided content.

prompt
You are a research assistant, and your responsibility is to help users understand the content of the documents they upload. When answering questions, only reference the provided materials without making subjective inferences.

This role definition not only controls the output range of the model but also makes it easier for users to trust the source of the responses psychologically.

2. Behavioral Constraints: Cite and Specify Sources

NotebookLM emphasizes that answers must "cite sources", requiring the system prompts to explicitly instruct the model to attach corresponding document and paragraph links in its responses. Developers typically incorporate such constraints into the prompts:

prompt
For each question, cite the relevant sections of the document, and list their titles and paragraph indices in markdown format.

This allows users to trace back and verify sources while reading model responses, creating a positive traceable experience. Trust in AI output quality often comes from this "source-citing" design.

3. Task Orientation: Content Generation Driven by Questions

NotebookLM not only answers questions but also supports structured content generation, such as auto-summarizing documents and generating research outlines. Each task mode employs different system prompts. For example, in summarizing mode, the prompts might guide the model like this:

prompt
Please extract the five most important points from the provided document and list them consecutively, maintaining an objective and neutral language style without providing further explanations.

Task-oriented prompt arrangement has transcended traditional "conversational question-and-answer" formats, resembling a form of "task script," laying the foundational capabilities for multimodal AI applications.

4️⃣ Feedback Loop

No AI Agent's output is perfect from the start. In reality, it performs poorly under certain inputs and has semantic deviations under other data, necessitating the collection of "failure cases" to form a loop from input to evaluation.

For instance, Monica's memory function allows users browsing recommended memory entries to click "adopt as fact". These items will then be written into the memory database for use in subsequent conversations, while unadopted items will not be forcibly memorized. This "feedback → selective absorption → next-round tuning" mechanism enhances the traditional prompt + chat model.

Monica continuously learns from user characteristics to improve the accuracy of need understanding and response correctness. Essentially, it rebuilds context awareness in conversational interactions. Just like interpersonal communication, the longer two people interact, the more familiar they become with each other, enabling better understanding of each other's expressions.

2. Technical Engineering

Goal: Ensure that the system behind AI starts quickly, runs steadily, scales effectively, and remains observable.

Technical engineering validates product engineering. Just like the fast fish eating the slow fish in the internet era, the efficiency of technical engineering in the AI era is crucial to quickly running through product engineering, validating the market, and rapidly iterating. Technical engineering serves as the logistical system that supports AI applications, encompassing architecture and modularity, tool invocation mechanisms, model and service integration, traffic and access control, data management and structured outputs, safety and isolation mechanisms, and DevOps along with observability.

Module	Definition
Architecture and Modularity	Break down the AI application into small modules, where each component has clear responsibilities, allowing for easier combination and maintenance.
Tool Invocation Mechanism	Enable the AI to invoke databases, check weather, place orders, etc., to truly "get things done".
Model and Service Integration	Integrate multiple models (DeepSeek, Qwen, local large models, etc.) for unified invocation and management.
Traffic and Access Control	Control usage frequency and access permissions for different users and models to prevent misuse or crashes.
Data Management and Structured Output	Convert the AI's free text into structured data, allowing the system to use it directly or store it in a database.
Safety and Isolation Mechanisms	Prevent data cross-use and unauthorized operations, which is particularly critical in multi-tenant or enterprise applications.
DevOps and Observability	Support gray releases, feature rollbacks, performance alarms, log what happened during each invocation, enable problem identification, and provide metrics for optimization, ensuring continuous and stable operation.

1️⃣ Application Architecture and Modularity

As AI applications transition from prototypes to production systems, a key challenge is how to organize capabilities from multiple heterogeneous models, such as prompting, model augmentation, advisors, retrieval, memory, tools, evaluation, and MCP, into a unified architecture, while ensuring high maintainability, observability, and scalability. This is where modular architecture and orchestration tools come into play.

Aside from Python-centered ecosystems like LangChain and LangGraph, Spring AI Alibaba provides native, enterprise-level AI application orchestration capabilities for Java communities, thus becoming a significant tool for building modular AI applications. Its core features, in addition to the eight foundational capabilities mentioned above, include:

Multi-agent framework: Graph is one of the core implementations of the Spring AI Alibaba community, distinguishing the framework's design philosophy from that of Spring AI, which focuses solely on bottom-level atomic abstractions; Spring AI Alibaba aims to help developers more easily build agent applications.
Solve pain points concerning the implementation of enterprise agents through AI ecosystem integration. Spring AI Alibaba supports deep integration with the Bailian platform, providing model access and RAG knowledge base solutions; it supports seamless integration of observability products like ARMS and Langfuse; and it supports enterprise-level MCP integration, including Nacos MCP Registry for distributed registration and discovery, and automatic router routing.
Explore general agent products and platforms with autonomous planning capabilities. The community has released the JManus agent based on the Spring AI Alibaba framework, aiming to explore applications of autonomous planning in agent development while providing developers with more flexible options for building agents from low-code, high-code, to no-code.

Suppose you are developing an intelligent customer service system for internal enterprise use, with requirements such as:

Multi-turn conversations that can remember user context;
Access to internal knowledge bases for enhanced retrieval;
Invoke APIs of backend systems like ERP and CRM;
Deployable on K8s, supporting high concurrency and unified log tracking.

Using Spring AI Alibaba, the system can be designed as follows:

Prompt Interface Declaration
Use @Prompt annotations to define prompt templates for multiple business modules (e.g., "Customer Complaint Handling," "Expense Reimbursement Queries," etc.), where each module can be considered as a component class, facilitating independent iteration and maintenance.
Invoking Tool Functions
Use @Tool or @Function annotations to expose backend HTTP interfaces or local Java methods as functions callable by LLM, like checking customer order status or triggering CRM updates, similar to LangChain’s tool invocation mode, but naturally supporting the Spring Bean lifecycle and injection.
Context and Session Management
Leverage built-in memory management modules to record session states through Redis or local memory, enabling automatic concatenation of conversation context. Compared to traditional methods (manual maintenance of token context), this mode aligns better with microservices' data isolation and encapsulation design.
Decoupled Modules and Deployment
Each business line (e.g., financial assistant, HR assistant, technical Q&A) can be encapsulated as an independent Spring Boot microservice, enabling service discovery and governance via Spring Cloud + Nacos. Orchestration logic does not require cross-module dependency coding; it only requires interface declarations for access.
Link Tracking and Monitoring
Spring AI Alibaba has embedded SDK tracking at several critical points to record metrics and tracing information throughout the runtime, including model calling, vector retrieval, tool invocation, etc., providing observation capabilities at the service level for AI applications. The tracing information is compatible with OpenTelemetry, thus theoretically integrable with major open-source platforms like Langfuse or Alibaba Cloud ARMS.

2️⃣ The Greater the Uncertainty in Background Logic, the More Important Traffic and User Access Control Becomes

Large model capabilities are powerful, but compared to classic applications entirely architected by human business code, introducing uncertainty necessitates stronger traffic and user access control to mitigate risks, ensuring availability and costs while preventing abuse. The classic application upgraded to AI Agent added two information processing stages: LLM and MCP Server. The AI gateway plays the role of traffic and user access control among the AI Agent, LLM, and MCP Server (as shown below).

Traffic Control:

key-rate-limit: Set access frequency limits for each access key (API Key). For example, a maximum of 10 requests per second. For instance, free users can only call the AI interface 100 times per day, while paid users may invoke it 10,000 times.
http-real-ip: Obtain the user's real IP address in a multi-layer proxy environment. For example, identify the true address of malicious visitors, in cases where someone registers accounts frequently using a VPN. Determines the user's originating country or region, which is useful for subsequent content recommendations or risk control strategies.
HTTP Strict Transport Security: Enforce clients to communicate via HTTPS to prevent man-in-the-middle attacks. This protects sensitive data such as AI chat records, training data uploads, inference requests, which must all be transmitted encrypted. Particularly so for AI access involving government or financial institutions, which require high-security levels to ensure data isn’t intercepted or tampered with during transmission.
canary-header: Determine whether to send requests to the "canary version" of a service based on specific markers in the request headers. For example, gray-release of AI models, where a new version of the GPT interface is rolled out to 5% of users first, and if no issues arise, it is fully deployed. A/B testing: Testing differences in effects between two prompt templates or two fine-tuned models, routing requests with specific headers to the testing version.
traffic-tag: Tag requests with specific labels and perform strategy control based on those tags. For instance, multi-version scheduling of models, where the same question is routed to different versions based on tags like "paid user" or "new user" (for example, Qwen3 32B vs Qwen3 235B). Personalized strategy routing allows developers to tag behavior, such as "image preference" or "frequent code requests," and route them to custom processing chains.
cluster-key-rate-limit: Control access frequency uniformly across multiple node gateway instances, limiting traffic at a cluster level. For example, in a highly available AI service platform where AI gateways are deployed in multiple regions, it can still set total access limits for individual users to prevent bypassing single-point rate limits. Protect backend model resources during holidays or high-interest conversations by implementing unified rate limits to prevent large model resources from being exhausted.

User Access Control:

key-auth: Authenticate via pre-distributed API Keys. Clients must include specific Keys in requests to call services. For example, an open platform requires developers to register an account to obtain their key for accessing interfaces, along with webhook verification, where internal services or BFF layers authenticate against gateway services using Keys.
basic-auth: A username and password-based authentication scheme according to HTTP Basic Auth. Clients transmit credentials encoded in Base64 in request headers. This is a low-complexity internal authentication mechanism for fast protection of interfaces during testing or early prototype stages within enterprises. It also applies to control access to internal AI tools, such as certain visual analytics panels and prompt management platforms, which may quickly use Basic Auth to restrict entry.
hmac-auth: A signature authentication mechanism based on HMAC (Hash-based Message Authentication Code). Each request carries a signature value; the gateway verifies the signature's validity based on the key. For example, Webhook callback verification; when third-party platforms (like Stripe, OpenAI) send events, they attach a signature to verify the source's trustworthiness. Tamper-proof interface protection, as some sensitive interfaces (like issuing inference tasks) use signatures to protect against request alterations. It’s akin to placing a personalized seal on a parcel before shipping; the recipient confirms the parcel hasn’t been tampered with by verifying the seal. Signatures are typically generated based on the request body + timestamp + key to prevent replay and forgery of requests.
jwt-auth: Implement user authentication and authorization through JWT (JSON Web Token). Requests include issued JWTs, which the gateway verifies against signatures and payload. For instance, user identity authentication, where users obtain tokens upon logging in for accessing AI assistants, session systems, and implementing role-based permissions, where embedded role information (e.g., role=admin) in the JWT dictate access permissions by the gateway. It’s like having a digital pass labeled "I am a VIP customer"; the system scans the pass to confirm identity and permissions, commonly used in front-end and back-end separated architectures that support enterprise systems with SSO (single sign-on).
jwt-logout: Implement JWT invalidation mechanisms. While JWTs have built-in expiration times, when users voluntarily log out, they need to be invalidated systemically. For instance, in reinforcing security scenarios, upon user logout, it’s essential not to allow the old token to continue calling model interfaces. Session control in AI SaaS platforms mandates proactive clearing of identity credentials when users change devices or log out, much like canceling an access card someone else has issued to ensure it can’t be used again.
oauth: Implement third-party authorization login and access control based on the OAuth 2.0 protocol. Users can log in via authorization platforms like Google, GitHub, and WeChat. For example, one-click logins where agents allow users to log in with Google, obtaining identity tokens upon authorization and integrating corporate identity systems.

Moreover, AI gateways, represented by Higress / Alibaba Cloud API Gateway, offer more capabilities through a flexible and extendable plugin mechanism, allowing users to develop custom plugins to enrich functionality.

AI Retrieval-Augmented Generation: Simplify the development of RAG applications by interfacing with vector retrieval services (DashVector), optimizing outputs from large models.
Safety and Compliance: Intercepting inputs and outputs at the gateway, the content security protection strategy possesses real-time scanning capabilities for gateway requests and responses, identifying potential risk information. At the large model layer, real-time examinations of inputs and outputs are conducted to ensure that any potentially risky information does not get inadvertently exposed or transmitted, thus reducing the risk of data leaks. Interception at the internet search engine layer checks responses for potential risks (such as violations, malicious links, sensitive information, etc.).
Content Caching: In repetitive AI request scenarios, cache responses generated by large language models in a Redis database to avoid repeated invocations of the large language model, improving response speed.

3️⃣ Logging and Monitoring:

Classic internet applications have traceable debugging, but large model applications lack standardized debugging tools, making problem tracking, identification, and resolution significantly more challenging. Below are several key comparative dimensions and core differences.

Observable Object Differences: From Code Execution → to Model Behavior

Category	Traditional Application	Large Model Application
Observable Objects	Backend Logic, Database Queries, API Calls	Prompt Input/Output, Model Inference Process, Context Changes, Thought Chains
Focus Points	Performance Bottlenecks, Service Status, Exception Stacks	Reasonableness of Responses, Consistency, Deviations, Hallucination, and Potential Misinterpretations
Observable Granularity	Function-Level, Call Chain-Level	Token-Level, Semantic Level, Behavioral Path-Level

For example, classic applications might focus on "Does this interface time out?" while large model applications also need to address questions like "Why did this model say something inappropriate?" and "Did it misunderstand the user's intention?"

Changes in Observability Goals: From Availability → to Semantic Correctness

For traditional systems, the observable goal is whether the system is functioning normally, such as if there are timeouts in responses, or if CPU, memory, and error rates are alarmingly high. For large model applications, the situation is different: the system might be operating normally, but the results could be wrong! Therefore, observable goals must focus not only on availability metrics but also on

whether response content is correct and reasonable (semantic correctness)
whether there are deviations from set behaviors (like system roles)
whether hallucinations, unauthorized actions, or toxic language exist
whether model behavior correlates with version, prompt, or context

For instance, an AI medical Q&A system might not produce any crash logs, yet if the model outputs an incorrect recommendation like "not advised to take cold medicine," then such an error requires semantic-level observability.

To tackle these issues, the first step is to determine which components are involved in a single invocation, and then connect all components through a calling chain. For complete pathway diagnostics, links need to be established; when a request encounters issues, it must be quickly identified which stage failed, whether it was the AI application or internal model inference.

The second step is to construct a full-stack observability data platform that can correlate all of this data well, including not just the links but also metrics; for example, GPU utilization within the model could help discern whether the issue is at the application or model level.

Finally, we should analyze model logs for input and output information from each invocation, using this data for evaluation and analysis to validate the quality of AI applications.

Starting with these three means, we provide various observational techniques and core focus points at different levels from a monitoring perspective.

3. Conclusion

Although this article attempts to outline the core modules and key paths of AI Agent engineering, we must acknowledge that the real-world challenges are far more complex than those described herein. From the organic collaboration of human logic with large models, the paradigms for handling complex tasks, to the evaluation standards for generated content, each link conceals numerous unresolved details in product and technical engineering. Engineering has never been an embellishment; it is the only pathway for translating model capabilities into real productivity.

More importantly, the advancement of AI Agent engineering is not merely the concern of Agent Builders, but is also related to the evolution of the entire industry. Only by continuously investing in multiple dimensions—development platforms, traffic and access management, toolchains, observability, security, etc.—and constructing reliable, stable, and reusable application infrastructure can we truly drive the large-scale implementation of upstream Agent applications, fostering the formation of a new generation of "application supply chain" ecosystems centered around large models.

[1] https://mp.weixin.qq.com/s/UF2ox3WEfehk3QDMCHqXZw

[2] https://mp.weixin.qq.com/s/WdTiY8esxUuW5cqIh_NlpQ

[3] https://blog.csdn.net/Julialove102123/article/details/146196173

[4] https://www.woshipm.com/evaluating/6214962.html

[5] https://mp.weixin.qq.com/s/27M2HdWVPB1ZDPKcNd53dA

Community