Agent-lens - ApsaraDB for ClickHouse - Alibaba Cloud Documentation Center

Agent-lens is an agent observability feature provided by ApsaraDB for ClickHouse Enterprise Edition, built on ClickHouse and Langfuse. Through end-to-end tracing, prompt management, and automated evaluation, it makes agent behavior traceable, costs quantifiable, and performance evaluable. This solves common challenges in production environments, such as unpredictable outputs, hidden costs, and uncertain risks.

Core capabilities

LLM observability

Observability is crucial for understanding and debugging LLM applications. Unlike traditional software, LLM applications involve complex, non-deterministic interactions that make monitoring and debugging challenging. Agent-lens provides comprehensive tracing features to help you clearly understand everything happening in your LLM application, including token consumption, LLM call latency, and tool call analysis, with support for slicing data across different dimensions.

End-to-end tracing: Agent-lens divides complex execution processes into three core layers to provide complete observability.
- Session (session layer): A complete record of an agent session that involves multiple rounds of user interaction. You can use the Session view to review the entire context and pinpoint where an agent hallucinates or experiences context drift.
- Trace (trace layer): A single interaction from user input to agent output. The system breaks down inputs, outputs, execution time, and token consumption. It also visualizes the agent's execution path as a tree or graph, helping developers quickly identify performance bottlenecks and high-cost steps.
- Generation/Span (execution layer): A transparent view of all atomic operations within a trace. This includes the execution time, tokens consumed, and intermediate results for each Generation and Span, enabling targeted optimizations.
Session and user tracking: Track multi-turn conversations as sessions and associate them with user information.
Agent visualization: Display the agent's execution flow as a tree or graph.
Flexible data collection: Supports zero-code integration with mainstream large model development frameworks like Dify, as well as data collection using native Python/JS SDKs and OpenTelemetry. For a complete list of supported data collection methods, see the official integrations documentation.

Prompt management

Prompt management is a systematic approach to storing, versioning, and retrieving prompts used in your LLM applications. Instead of hardcoding prompts in your application, you can manage them centrally in Langfuse. By hosting prompts in Langfuse, non-technical team members can update them directly in the UI. The application automatically fetches the latest version, eliminating the need for engineering involvement or a new deployment.

Agent-lens prompt management provides the following capabilities:

Decouple prompts from code: Separate prompts and model parameters from your application code. This allows business teams to adjust prompt content directly without needing to redeploy the application.
Agile iteration: The application code only needs to reference a prompt's tag. Business teams can then modify the prompt associated with that tag, and the changes take effect in real time.
A/B testing: Supports multiple prompt versions for canary releases and A/B testing. You can route traffic to different prompts via your code and use the Trace view to quantitatively compare their performance.
Playground: Before pushing to production, you can preview and compare the execution paths and performance of an agent with different prompt versions. After confirming the results, you can release the new version.

Evaluation

Agent-lens provides a repeatable verification mechanism for LLM application behavior that enables data-driven decisions. Evaluation also helps you catch regressions before releasing changes. Agent-lens supports the following evaluation methods:

Offline evaluation: Test your application with a fixed dataset before deployment. You can run new prompts or models against test cases, review the scores, and iterate until you are satisfied with the performance before deploying the changes.
Online evaluation: Capture real-world issues in the production environment by scoring live trace data. When you discover an edge case not covered by your dataset, you can add it back to the dataset to create a closed-loop optimization cycle.
Scores: A score is a generic data object in Langfuse used to store evaluation results. Whether you are assessing the quality of LLM output through manual annotation, LLM-as-a-Judge models, programmatic checks, or end-user feedback, the results are stored uniformly as scores. Scores can be attached to traces, observations, sessions, or Dataset Runs.
Each score has three core attributes: a Name, a Value, and a Data Type. The supported data types are NUMERIC, CATEGORICAL, BOOLEAN, and TEXT.
LLM-as-a-Judge: Use a large language model as a judge to automatically score agent outputs based on predefined business rules and compliance requirements. This allows you to flexibly measure dimensions like relevance, safety, tone, or factual accuracy.
Manual scoring and annotation: Supports structured manual review processes. You can create and manage custom queues to assign specific traces or sessions to different reviewers or teams.
Score via SDK: Add scores programmatically using the Langfuse API or SDK. This supports building custom evaluation pipelines, performing deterministic checks (such as format validation and keyword matching), and integrating with automated workflows.

Unique advantages

100% open source compatibility: Agent-lens is 100% compatible with the open source Langfuse at the kernel level. Whether you are a new user or migrating from the open source version, you can enjoy a smooth integration without refactoring your existing code.
Simplified operations: Agent-lens offers one-click deployment and encapsulates a complex architecture into simple console operations.
Broader framework support: Agent-lens supports zero-code data collection for 14 mainstream frameworks, such as Dify and Ragflow, and data ingestion via the OpenTelemetry protocol. It also provides a data collection plugin for OpenClaw.
Integrated log and agent observation: You can quickly enable Agent-lens on your existing ClickHouse database. By using a unified data foundation, you can integrate traditional IT monitoring logs with agent operational data. You can then combine information like business order numbers and user IDs to quickly locate business anomalies caused by agent behavior.
Intelligent LLM analysis: Building on LLM-as-a-Judge, Agent-lens provides intelligent insights for low scores. When performance bottlenecks or call failures occur, it offers root cause analysis suggestions based on the trace context.

Use cases

Agent-lens is suitable for all stages of LLM application development, from prototyping to production monitoring.

Scenario	Capability	Description
Development and debugging	Issue troubleshooting	When an LLM produces an unexpected output, use end-to-end tracing to quickly determine whether the root cause is the prompt, incorrectly retrieved content, or misconfigured model call parameters.
	Complex workflow visualization	For complex applications involving multi-step reasoning, tool calls, or agent collaboration, visualize the execution path as a graph to help developers understand the internal logic.
	Cost and performance analysis	Monitor token consumption and latency for each call during the development phase to optimize prompts and reduce unnecessary costs.
Prompt engineering and management	Centralized management	Manage prompts scattered across your code centrally on the Langfuse platform, with support for version control and rollbacks.
	A/B testing and iteration	Non-technical team members, such as product managers, can modify prompts directly in the UI. The changes take effect in real time, accelerating the iteration cycle without requiring code redeployment.
	Performance comparison	Use the evaluation feature to compare the performance of different prompt versions on the same dataset and select the best option.
Quality assurance and evaluation	Offline regression testing	Before releasing new features or modifying prompts, run automated evaluation experiments to ensure changes do not cause regressions in existing functionality.
	Multi-dimensional scoring	Use LLM-as-a-Judge, rule-based checks, or manual annotation to quantitatively score outputs for accuracy, safety, relevance, and other dimensions.
	Golden dataset management	Maintain a standard set of test cases to serve as a benchmark for application quality.
Production environment monitoring	Real-time observability	Monitor key metrics in the production environment, such as error rates, average response times, and token costs, to promptly detect anomalies.
	User behavior analysis	Track the multi-turn interaction history of specific users or sessions to analyze user satisfaction and common issue patterns.
	Edge case discovery	Identify low-scoring outputs through online evaluation and automatically add them to a review queue or add them to your test dataset to create a closed-loop optimization cycle.
Team collaboration and compliance	Cross-functional collaboration	Engineers, product managers, and data scientists can share the same set of trace data and evaluation results. This breaks down information silos.
Team collaboration and compliance	Auditing and traceability	Retain complete input/output logs and operational records to meet internal compliance or external regulatory requirements.