What is AgentLoop
AgentLoop is a full-lifecycle data observability and data flywheel platform from Alibaba Cloud for large language model (LLM) applications. It helps enterprises build a sustainable, self-evolving feedback loop for their AI agents. AgentLoop uses key runtime data such as traces, logs, metrics, and conversations to provide end-to-end capabilities that range from data collection and observation, visual problem identification, evaluation, and experimentation, to data refinement for creating evaluation sets, post-training datasets, and long-term memory that continuously improves online performance.
AgentLoop focuses on the AI agent and establishes a sustainable, self-evolving loop for it. It is not a traditional monitoring tool. Instead, it transforms runtime data into a data flywheel that drives continuous improvements in reliability, stability, and performance. This enables agents to achieve iterative, verifiable, and scalable improvement in a production environment.
Product positioning
AgentLoop is positioned as an "AI agent performance optimization platform" that provides end-to-end observability, evaluation, and monitoring capabilities:
Track the prompt, model output, latency, token consumption, and cost for every call.
Use custom evaluation rules, collect human feedback, run A/B tests, and manage prompt versions.
Integrate with mainstream frameworks and use visual dashboards to gain real-time insights into behavior, efficiently debug issues, optimize prompt engineering, and control inference costs.
Improve the reliability, iteration efficiency, and business value of LLM applications in production environments.
Features
1. End-to-end observability
Model application monitoring: Connect to your AI applications to view model application lists, application details, and topology relationships.
Trace analysis: Provides span lists, trace lists, scatter plots, end-to-end aggregation, end-to-end topology, and slow/error trace analysis.
Multi-dimensional metrics: Covers core metrics such as request count, error count, latency, token usage, session count, and user count.
Scenario-based analysis: Supports analysis of AI-specific operations, including embedding analysis, retrieval augmentation, tool calls, and method calls.
2. Data asset management (Dataset)
A Dataset in AgentLoop is a new type of data storage designed for AI scenarios. It turns read-only logs into manageable assets by providing full CRUD operations, a flexible schema, vector search, and multi-dimensional analysis capabilities.
Custom schema: Supports multiple field types such as
text,long,double, andjson. Thejsontype supports indexing of nested sub-fields.Full CRUD: Use standard SQL to perform
INSERT,UPDATE, andDELETEoperations to revise and evolve data.Multi-dimensional retrieval: Combines full-text search, semantic search, and SQL analysis in four mix-and-match query modes.
Version traceability: Each data entry is automatically assigned a unique ID to support tracing, exporting, and regression testing.
3. Evaluation framework (Evaluation)
The evaluation framework provides developers with a measurable, reproducible, and automated quality governance system to address the engineering challenges posed by the non-deterministic nature of large models:
Quantify uncertainty: Transform ambiguous semantic feedback into precise statistical metrics.
Drive agile iteration: Reduce experiment cycles from weeks to minutes through automated assessment.
Ensure deployment reliability: Establish a standardized regression testing set to prevent regressions when fixing bugs.
Pre-built evaluators:
Category | Evaluator | Description |
General scenarios |
| Detects whether the model output contains offensive, harmful, or inappropriate language. |
| Evaluates whether the model output complies with security and compliance requirements. | |
| Evaluates whether the model output is logical and coherent. | |
| Evaluates whether the model output covers the necessary information to answer the user's question. | |
RAG evaluation |
| Evaluates how well the retrieved context matches the user's question. |
| Evaluates whether the model's generated answer directly addresses the user's question. | |
| Evaluates the richness and diversity of information in the retrieval results. | |
| Detects redundant or duplicate content in the retrieved context. | |
Tool use |
| Evaluates whether the model selected the correct tool to handle the user's request. |
| Evaluates whether the parameters passed by the model to the tool are accurate and complete. | |
Agent evaluation |
| Evaluates the overall quality and rationality of the agent's execution trajectory. |
| Evaluates the rationality of the tool selections made by the agent during execution. | |
| Evaluates the success rate of the agent's tool calls. |
Custom evaluator: You can write custom evaluation prompts to use an LLM-as-a-Judge. This allows for quantitative scoring and in-depth diagnosis of AI application outputs based on custom dimensions, standards, and weights.
4. Experiments and playground
Experiment records and the playground provide a complete feedback loop, from prompt engineering and debugging to large-scale automated evaluation:
Experiment Plan: Create and manage experiment plans to establish business baselines and save combinations of model services, prompt templates, datasets, and evaluators.
Playground: Configure multiple sets of experiments in a visual interface, adjust inference parameters (such as
TemperatureandTop-p) in real time, and call data sources for single-instance or batch validation.Experiment Records: An asset library for experiment executions. It records a snapshot of each task, including model service details, token consumption, Time to First Token (TTFT), and quantitative scores from evaluators.
Comparative Analysis: A multi-dimensional regression analysis tool that supports side-by-side comparison of two to five experiment records. It covers evaluation metric trends, configuration parameter differences, and sample-level semantic comparisons.
5. Long-term memory (Memory)
The AgentLoop Memory is the core memory layer for an AI agent, providing persistent memory capabilities:
Maintain consistency across sessions: Persistently store key information, such as conversation history, task status, and decision rationale. This information is efficiently retrieved and injected as context into new interactions to provide the model with relevant background.
Enable highly adaptive personalization: Systematically record user preferences, such as formatting requirements and communication style, as well as historical behavior patterns and long-term goals. This allows the model to generate highly customized outputs.
Support deep reasoning based on historical information: Enhance the continuity and intelligence of interactions by empowering the AI agent to remember, learn, and evolve.
Memory strategies:
Facts: Extracts specific facts, events, and user-related preferences.
Episodic: Records and recalls specific events or interaction experiences, documenting what happened, when, and where.
Summary: Condenses user interaction content to extract key information and form a concise, coherent semantic representation.
Custom strategy: An extraction strategy defined by the user.
Benefits
1. End-to-end feedback loop
AgentLoop is not just an isolated monitoring tool but a data flywheel platform that covers the entire lifecycle of AI applications:
Data collection: Automatically collects runtime data such as traces, logs, metrics, and conversations.
Visual observation: Provides rich dashboards and trace analysis to quickly identify issues.
Evaluation and experimentation: Supports prompt tuning, model comparison, and automated evaluation.
Data refinement: Refines high-quality data into evaluation sets and post-training datasets.
Continuous optimization: Improves online performance through long-term memory and the data flywheel.
2. Enterprise-grade security and compliance
Multi-tenant data isolation: Ensures data isolation by keeping user data strictly separated and invisible to other tenants.
Complete audit logs: All create, read, update, and delete (CRUD) operations are recorded to meet enterprise compliance and audit requirements.
Data security: Powered by Alibaba Cloud's mature security system, providing capabilities like data encryption and access control.
3. Auto scaling and high availability
Auto scaling: Automatically adjusts resources based on business load without manual intervention.
High-concurrency support: Ensures timely data writing and retrieval even during traffic peaks.
Massive data storage: Based on the underlying storage of Log Service (SLS), it supports petabyte-scale data storage and second-level queries.
4. Deep integration and open ecosystem
Framework integration: Deeply integrates with mainstream AI frameworks such as LangChain and LlamaIndex.
SDK support: Provides SDKs for multiple programming languages, including Python and Java.
Open APIs: Offers a complete set of APIs for custom integration and extension.
MCP Server: Supports MCP Server access for seamless integration with existing agent frameworks.
5. Cost optimization and FinOps
Token consumption analysis: Accurately calculates the token consumption and actual costs generated by each experiment and online call.
Cost optimization suggestions: Provides quantitative data to help decision-makers choose the most cost-effective model combination.
Resource utilization monitoring: Monitors storage and computing resource usage to prevent waste.
Core concepts
Dataset
A dataset is a new type of data storage designed by AgentLoop for AI scenarios. It is the core carrier for managing the entire lifecycle of AI application data.
Field types
Type | Description | Optional capabilities | Example |
text | Text type | chn: Enables Chinese word segmentation. embedding: Enables vector indexing. | question, answer |
long | Long integer type | --- | input_tokens, latency_ms |
double | Floating-point type | --- | score, confidence |
json | Nested JSON type | json_keys: Defines sub-field indexes. | metadata, scores |
Built-in fields
Field | Type | Description |
id | text | A unique primary key automatically generated by the system. UPDATE/DELETE operations must use this field. |
Evaluation
Evaluation is the core of the AgentLoop quality governance system. It transforms vague "semantic feelings" into precise "statistical metrics" through automated assessments.
Evaluation task components
Data source: Supports three data sources: traces/spans, logs (Logstore), and datasets.
Evaluator: An automated scoring mechanism based on LLM-as-a-Judge.
Sampling strategy: Supports setting a sampling rate and a maximum number of samples to balance evaluation coverage and cost.
Execution strategy: Supports two modes: continuous evaluation based on new data and evaluation based on historical data.
Evaluator types
Preset evaluator: A general-purpose evaluator built into the system. It covers dimensions such as toxicity, security, coherence, and completeness.
RAG evaluator: An evaluator designed for Retrieval-Augmented Generation (RAG) scenarios.
Tool use evaluator: Evaluates the correctness of AI agent tool selection and parameter passing.
Agent evaluator: Evaluates the AI agent's execution trace, the reasonableness of tool selection, and the call success rate.
Custom evaluator: You can write custom evaluation prompts based on your business scenarios.
Experiment
An experiment is a tool provided by AgentLoop for prompt engineering and model performance optimization. It supports batch experiment runs with multiple configuration sets and in-depth comparative analysis.
Experiment plan
An experiment plan establishes a business baseline. It saves combinations of model services, prompt templates, datasets, and evaluators to ensure experiment traceability and environmental consistency.
Playground
The playground is the experiment execution engine. It supports the following:
Load configurations from an experiment plan with one click.
Run batch inference tasks.
Get instant feedback on evaluation results from the LLM Judge.
Adjust inference parameters, such as Temperature and Top-p, in real time.
Experiment records
An experiment record is a task execution log and a results snapshot. It includes:
Model service details.
Token consumption (cost).
Time to First Token (TTFT).
Quantified scores from the evaluator.
Task execution status.
Comparative analysis
Comparative analysis is a decision support tool. It supports the following:
Select two to five experiment records for side-by-side comparison.
Set a baseline group to calculate the delta (difference) for the experimental group across dimensions such as accuracy, time consumption, and cost.
Highlight text-level differences to quickly locate output variations.
Memory
The AgentLoop MemoryStore is the core memory layer for AI agents. It provides persistent memory capabilities.
MemoryStore
A MemoryStore is a storage container for memory data. It stores all short-term and long-term memory information for an AI agent or application.
Memory Strategy
A memory strategy is a series of memory extraction rules that determine how information is processed from short-term memory into long-term memory:
Strategy | Description |
Facts | Extracts specific facts, events, and user-related preferences. |
Episodic | Records and recalls specific events or interaction experiences, including when, where, and what happened. |
Summary | Condenses and summarizes user interaction content. It extracts key information to form a concise and coherent semantic representation. |
Custom strategy | An extraction strategy defined by the user. |
Event
An event is the basic unit of short-term memory. It corresponds to a piece of raw data sent by the client.
Short-term memory
Short-term memory stores conversations to track immediate context. It is the core unit for recording single-event context and is mainly used to maintain real-time contextual consistency and continuity in a session.
Long-term memory
Long-term memory stores extracted insights. It is a core functional module for persistently storing key user information, behavioral patterns, and business knowledge. It supports context awareness and personalized services across sessions and over time.
Model application
A model application is the core observable object in AgentLoop. It represents an AI application instance.
Application detail dimensions
Instance overview: Number of requests, number of errors, time consumption, number of instances, and CPU usage.
Associated instances: Application interfaces, Kubernetes clusters, infrastructure, and upstream/downstream dependencies.
Associated topology: The upstream and downstream topology network related to the application.
Application overview: Number of model calls, token usage, number of traces, number of spans, number of sessions, and number of users.
Performance analysis: Trends in the number of model calls, number of errors, and time consumption.
Token analysis: Token usage, average token usage per session, and average token usage per request.
Operation analysis: Embedding analysis, retrieval augmentation, tool calls, and method calls.
Trace analysis: Span list, trace list, scatter plot, full-link aggregation, and full-link topology.
Relationship with CloudMonitor 2.0
AgentLoop is deeply integrated with the overall CloudMonitor 2.0 product:
Unified console: Provides a dedicated entry point for AI application observability in the CloudMonitor 2.0 console.
Data interoperability: The observability data of AI applications is integrated with infrastructure monitoring data to achieve full-stack observability.
Alert integration: Supports configuring alert rules based on AI application metrics, which are integrated with the CloudMonitor alert system.
Unified permissions: Reuses the Resource Access Management (RAM) permission system of CloudMonitor 2.0 to achieve unified identity authentication and access control.
Relationship with SLS
Log Service (SLS) serves as the underlying infrastructure, providing fundamental data storage and computing capabilities:
Logstore: Stores raw log data.
Metricstore: Stores metric data.
AgentLoop builds on SLS to provide business abstractions and high-level capabilities. This lets you focus on your application without managing underlying storage details.
Use cases
Scenario 1: AI application performance monitoring and troubleshooting
Description: A company needs to monitor the performance of its AI customer service application in real time. This helps them quickly find and resolve problems after the application goes live.
AgentLoop solution:
Integrate the AI application using the AgentLoop SDK to automatically collect trace, log, and metric data.
View real-time metrics, such as the number of requests, errors, and latency, on the model application page.
Use trace analysis to locate specific slow or failed requests.
Use token analysis to monitor costs.
Value: Reduces the average time to find failures from hours to minutes, minimizing business losses.
Scenario 2: Prompt engineering optimization and performance evaluation
Description: An AI application development team needs to continuously optimize prompts to improve the quality of model outputs.
AgentLoop solution:
Configure multiple sets of prompt experiments in the playground.
Use datasets to perform batch validation.
Use the evaluator to automatically score results and quantify prompt performance.
Use comparative analysis to identify the best prompt version.
Incorporate the optimized prompts into the experiment plan.
Value: Reduces the prompt iteration cycle from weeks to days and improves model output quality by over 30%.
Scenario 3: Model version upgrade regression testing
Description: A company plans to upgrade its underlying model version and needs to evaluate the new version's performance.
AgentLoop solution:
Use a dataset to build an evaluation benchmark set.
Create a comparative experiment to run the old and new model versions simultaneously.
Use the evaluator to compare performance across multiple dimensions.
Use comparative analysis to identify performance degradation.
Make an upgrade decision based on the data.
Value: Avoids business risks from blind upgrades and ensures the stability of the model upgrade process.
Scenario 4: Bad case management and the data flywheel
Description: After an AI application goes live, it generates many bad cases that need to be systematically managed and optimized.
AgentLoop solution:
Use evaluation tasks to automatically identify low-scoring samples.
Import bad cases into a dataset for manual annotation.
Update the data after annotation to generate optimization suggestions.
Compile the high-quality data into a training dataset.
Use the data flywheel to continuously optimize model performance.
Value: Establishes a data-driven, continuous optimization loop, making the AI application smarter with use.
Scenario 5: Building long-term memory for an AI agent
Scenario: You are building a personalized AI assistant that needs to remember user preferences and interaction history.
AgentLoop solution:
Create a MemoryStore.
Configure memory strategies, such as fact, episode, and summary.
Use the SDK to add conversation records.
During a conversation, retrieve relevant memories and inject them into the context.
Provides personalized responses based on memory.
Value: Enhances user experience and interaction consistency, allowing the AI assistant to truly 'understand' users.
Best practices
Practice 1: Establish a complete evaluation system
Define evaluation metrics: Determine core evaluation metrics, such as accuracy, security, and compliance, based on your business scenarios.
Build evaluation sets: Use Dataset to build evaluation datasets that cover core business scenarios.
Configure evaluation tasks: Create continuously running evaluation tasks to monitor online data quality.
Set alert thresholds: Set alerts for key evaluation metrics to promptly identify quality issues.
Practice 2: Establish prompt version management standards
Manage prompts with experiment plans: Create a separate experiment plan for each business scenario.
Follow version naming conventions: Use semantic version numbers, such as v1.0.0-basic.
Record changes: Document the reason for and effect of each change in the experiment description.
Perform regular regressions: Run historical experiments periodically to ensure new versions do not degrade.
Practice 3: Build a data flywheel
Collect data: Collect all runtime data from your AI applications.
Clean data: Use evaluation tasks to automatically identify and label problematic data.
Store data: Store high-quality data in Dataset to build your enterprise data assets.
Apply data: Use the data for model fine-tuning, prompt optimization, and knowledge base updates.
Practice 4: Optimize costs
Monitor token consumption: Use token analysis to monitor the cost consumption of each application.
Select models: Compare the cost-effectiveness of different models through experiments.
Use sampling strategies: Use sample-based evaluation for non-core data to reduce costs.
Clean up resources: Periodically delete unused datasets and experiment records.