All Products
Search
Document Center

Simple Log Service:Enable Controlled Operation of OpenClaw with One-Click SLS Integration

Last Updated:Mar 14, 2026

This topic describes how to use the Alibaba Cloud Simple Log Service (SLS) Integration Center to integrate OpenClaw AI Agent logs with one click. You can use the built-in audit and observability dashboards to create an out-of-the-box solution for security audits and operational monitoring.

Background information

OpenClaw security risks: Why controlled operation is crucial

OpenClaw is one of the most prominent open source AI Agent platforms of 2026. It allows large language models (LLMs) to directly manipulate the file system, run Shell commands, browse the web, and send messages. This capability transforms the inference capabilities of an LLM into actual system operations. This "autonomous execution" capability is both its core value and its core risk.

  • Industry security incidents: Risks are real, not hypothetical

    In early 2026, several security vendors disclosed a series of vulnerabilities and incidents related to OpenClaw.

    Source

    Findings

    Security research statistics

    Over 40,000 OpenClaw instances are accessible on the public internet across multiple countries. About 15,000 of these are unpatched or use default configurations, posing a risk of remote control. About 93% of exposed instances have severe authentication bypass vulnerabilities.

    GitHub Security Advisory

    (GHSA-g8p2-7wf7-98mq)

    The Control UI trusts the gatewayUrl in the URL parameters and connects automatically. If a user clicks a malicious link, the gateway token can be stolen and sent to an attacker's server. This leads to one-click remote code execution (RCE) with a CVSS score of 8.8, even if the gateway is only listening on the local machine. This was fixed in v2026.1.29.

    Skills supply chain

    Over 800 malicious Skills were found in the OpenClaw Skills registry, making up about 20% of all published packages. These malicious Skills include credential theft and back door implantation. Installing unreviewed Skills is equivalent to escalating Agent permissions.

    Research from Unit 42 and others

    Indirect Prompt Injection (IDPI) has been observed in real-world scenarios. Attackers embed hidden instructions in web page content. When the Agent fetches the content, it mistakenly runs the instructions, leading to data exfiltration or unauthorized operations.

    Regulation and warnings

    Regulators in multiple countries are focusing on the risks of AI agents. The Ministry of Industry and Information Technology (MIIT) issued a warning about the security risks of the open source OpenClaw AI agent, recommending timely updates and security hardening.

  • Code audit data: The frequency of OpenClaw's own security fixes

    Industry reports illustrate the external threat landscape. An audit of OpenClaw's own code repository reveals another dimension: the project itself frequently fixes security issues. By analyzing the security semantics of Git history and commit messages, you can quantify the scale and distribution of security-related code changes over a period. This helps determine where the attack surface is concentrated.

    Filtering and classifying recent commits in OpenClaw show that risks are highly concentrated in the entry and execution layers:

    Module

    Number of security fixes

    Percentage

    Main risks

    src/tools/

    52

    35%

    Command injection, directory traversal

    src/gateway/

    38

    26%

    Access control, authentication and authorization

    src/auth/

    18

    12%

    Authentication bypass, CSRF

    src/sandbox/

    15

    10%

    Directory traversal, SSRF

    src/hooks/

    12

    8%

    Prompt injection, information leakage

    The tool execution layer (tools/) and the ingress gateway layer (gateway/) are where the risks of "autonomous operation" and "multi-entry access" manifest. Static code audits can only cover committed changes. They cannot account for runtime behavior variations, configuration combinations, or attack paths driven by external inputs.

  • Why runtime protection alone is not enough

    When configured correctly, OpenClaw's architecture can effectively reduce the attack surface. However, from a security engineering perspective, it relies on runtime checks within the same trust domain, which has several inherent limitations:

    Protection Layer

    Mechanism and Capabilities

    Inherent Limitations

    Tool Policy Pipeline

    Before a tool is called, a policy based on factors such as sender, channel, and tool name decides whether to allow, deny, or require manual approval. It supports ACP approval flows.

    Policy misconfigurations, rule omissions, or policy bypasses (such as indirectly achieving high-risk effects through a legitimate toolchain) can lead to unauthorized execution. A lack of independent auditing after policy changes makes post-incident attribution difficult.

    Stuck/Loop Detection

    Detects when a session makes no substantive progress over several rounds (such as no new user or assistant messages, only repeated tool calls) and triggers an alert or termination.

    This feature can only identify "no-progress" loops. It cannot identify multi-step operation chains that are logically consistent but have disastrous results (such as gradual inducement to delete or exfiltrate data). False positives and false negatives depend on threshold tuning.

    Command allowlist/denylist

    Filters executable commands for tools such as exec and shell using an allowlist or denylist to reduce arbitrary command execution.

    Obfuscated or encoded commands (such as base64-decoded execution or aliased and multi-line concatenation) have historically bypassed filters, leading to corresponding CVEs and fixes. The lists often lag behind new attack techniques.

    Context and security instructions

    Injects constraints such as "Do not do X" or "Require approval before doing Y" through the System Prompt, relying on the model to comply.

    In long conversations, context window compression, summarization, or truncation can dilute or cause the model to "forget" key security instructions. Adversarial inputs can attempt to override or weaken these constraints (Prompt Injection).

    Therefore, runtime protection acts like a city wall. It can block most known attack paths but cannot guarantee that configurations are always correct. It also cannot prevent unknown bypasses or logical misuse. In a security architecture, you need a complementary "sentry" to continuously observe and audit the Agent's callers, consumption, tool call sequences, and results.

Solution overview

Observability acts as this "sentry." It uses logs, metrics, and traces to continuously monitor Agent behavior. This supports audit traceability and usage compliance. With anomaly detection, it helps answer questions such as "Who made the call?", "How much did it cost?", and "What exactly was done?". This allows for early detection and response when policies fail or new types of attacks occur, before the impact spreads.

Mapping the three pillars of observability to AI Agents

Observability is built on the three pillars of Logs, Metrics, and Traces. In the OpenClaw scenario, the relationship between these pillars, their data sources, and the core questions they answer are as follows:

Pillar

OpenClaw data source

Core question answered

Logs (Session audit logs)

~/.openclaw/agents/<id>/sessions/*.jsonl

What did the Agent do? Which tools did it call? How many tokens and how much cost did it consume?

Logs (Application operational logs)

/tmp/openclaw/openclaw-YYYY-MM-DD.log

What went wrong with the system? Webhook failures, authentication denials, gateway anomalies?

Metrics

OTLP output from the diagnostics-otel plug-in

Are the current costs and latency normal? Are there any stuck sessions or abnormal retries?

Traces

OTLP output from the diagnostics-otel plug-in

What steps did a single message go through from receipt to full response? How is the call chain connected?

All three pillars are indispensable. With only Metrics, you cannot answer "who" or "why" costs have spiked. With only Session logs, you cannot assess the overall system health and anomaly inflection points. With only application operational logs, you cannot see the Agent's business behavior and tool call sequences. Working together, the three pillars can support security auditing, cost management, and troubleshooting and operations and maintenance (O&M).

Alibaba Cloud SLS capabilities and advantages

As a foundational platform for observability, SLS has the following natural advantages for the OpenClaw scenario:

  • Powerful data ingestion capabilities that natively align with the OpenClaw technology stack

    LoongCollector has powerful OneAgent collection capabilities with native support for both logs and the OpenTelemetry Protocol (OTLP). Agent Session logs are often long because they carry model interaction context. LoongCollector provides high-performance collection for long text logs. It integrates seamlessly with OpenClaw's built-in diagnostics-otel plug-in, allowing Metrics and Traces to be written directly to SLS using OTLP.

  • Rich query, analysis, and processing operators

    Session logs are in a nested JSON format (such as message.content, message.usage.cost, and message.toolName). SLS provides an SQL + SPL compute engine and a rich set of parsing, filtering, and aggregation operators. You can index and analyze nested fields in real time without additional extract, transform, and load (ETL) processing.

  • Security and compliance capabilities

    RAM access control, sensitive data masking, and encrypted storage meet audit trail and compliance requirements. SLS is certified for network security products, making it a suitable observability and audit foundation for classified protection and industry compliance scenarios. Alerting channels support DingTalk, text messages, and email, which facilitates timely notification and response to security events and cost or anomaly alerts.

  • Fully managed, pay-as-you-go, and elastic scaling

    Log analysis is an all-in-one process that includes collection, storage, indexing, query, dashboard, and alerting, and is fully managed by Logstores and Metricstores. For small-scale Agents, the log volume is not large, and the pay-as-you-go cost is low. As traffic increases, the service scales elastically, which eliminates the need to reserve capacity or manually scale out. You do not need to build your own Elasticsearch, Prometheus, or other systems.

Therefore, SLS is well-suited as the observability and audit foundation for the controlled operation of OpenClaw, supporting auditing, cost management, anomaly detection, security compliance, and O&M across multiple scenarios.

SLS now offers a one-stop integration solution for OpenClaw:

  • Use the Integration Center to configure collection paths and parsing methods through a wizard. The configurations are automatically generated, delivered, and applied. This creates a unified entry point and a unified project for Session logs, application logs, and OTLP telemetry. This one-stop integration significantly reduces the complexity and O&M costs associated with fragmented data sources.

  • A single set of Session data can be used for security auditing, along with cost and behavior analysis, serving multiple scenarios.

  • Pre-built dashboards for auditing, cost analysis, and operational metrics provide an out-of-the-box, closed-loop solution for observing controlled operations.

Procedure

Step 1: Ingest logs (using session logs as an example)

Session logs are the core data source for security auditing. They record every conversation turn, every tool call, and every token consumed.

Prerequisites

Procedure

  1. Log on to the Simple Log Service console. In the right-side pane, click Quick Data Import. On the integration card, select OpenClaw-Session Log, and then select the destination project and Logstore.

  2. On the Machine Group Configurations tab, select the machine group that you created when you installed LoongCollector from the Source Machine Group list and add it to the Applied Machine Group list.

    Troubleshoot abnormal heartbeats
  3. On the Logtail Configuration page, Simple Log Service automatically fills in the built-in collection configuration. If no changes are needed, click Next.

    • The Configuration Name is set by default. You can change it as needed.

    • Under Other Global Configurations, the Log Topic Type is set by default.

      About topic generation mode: LoongCollector can automatically extract the topic and session_id from the file path. If you use a custom file path that does not match the pre-filled path, you must modify the configuration.
    • The File Path is automatically filled in by default.

      About the text file path: The pre-filled file path assumes a default installation by a non-root user on a Linux host. If your actual path is different, you must modify it.
    • In the Processing Method section, a combination of processing plug-ins is configured by default.

      About time parsing: By default, OpenClaw outputs logs in the UTC+0 time zone. If you have customized the time zone, you must also modify the time zone in the time parsing plug-in to avoid time mismatches.
  4. On the Query and Analysis Configurations page, Simple Log Service automatically creates built-in indexes and reports. You can view them later in the query and analysis interface and on the Dashboard page.

    • The built-in indexes are as follows: image

    • The dashboards are as follows:

      • OpenClaw Behavior Analysis Dashboard

      • OpenClaw Audit Dashboard

      • OpenClaw Metrics Dashboard

      • OpenClaw Token Analysis Dashboard

Step 2: Audit and observe

SLS provides pre-built dashboards for OpenClaw that cover four dimensions: security auditing, cost analysis, behavior analysis, and operational metrics.

  1. Log on to the Simple Log Service console. In the Projects section, select the destination project.

  2. In the Log Storage section, find the destination Logstore and click Search & Analyze to verify the integration and check the log format.

  3. In the Dashboard, view the preset dashboards.

    • Security audit dashboard

      Visibility into an Agent's behavior is directly related to system security and compliance risks. Abnormal behavior often shows signs before actual damage occurs. The security audit dashboard is the core dashboard for the controlled operation of OpenClaw. It focuses on answering the core question: "What is the Agent doing, are there any high-risk actions, and who is performing unauthorized operations?". It provides real-time behavior monitoring, threat detection, and post-incident traceability capabilities across dimensions such as behavior overview, high-risk commands, prompt injection, and data exfiltration.

      • Security audit statistics overview page:

        • This page provides a single-screen risk snapshot of OpenClaw's security posture, centered on a multi-dimensional count of high-risk operations within a specified time window. Six metrics are displayed side-by-side: high-risk command execution, outbound web requests, outbound command-line operations, outbound communication tool usage, sensitive file access, and prompt injection. Paired with day-over-day comparison data, this helps security teams quickly determine if the current risk level is abnormal without delving into details.

        • Pay special attention to the count of high-risk operations after a prompt injection event. Ordinary high-risk operations may stem from the legitimate needs of a task. However, high-risk behavior triggered after an injection is a strong threat signal. It indicates that the injected malicious instruction has driven the Agent to take action. Even if there are false positives, such signals should trigger the highest level of manual review, rather than waiting for further confirmation. Therefore, the "number of sessions with tool calls after injection" is the signal with the highest threat confidence level in the entire overview. Three such sessions often have a higher priority than hundreds of ordinary high-risk commands.

        • The high-risk sessions table aggregates risk counts for each dimension on a per-session basis. It automatically sorts sessions by a comprehensive risk score, bringing the sessions that most require manual intervention to the top. Security teams do not need to sift through logs one by one. They can start traceability directly from the highest-risk session, which significantly shortens the time window from detection to response.

      • Skills usage analysis

        • Skills usage analysis examines OpenClaw's capability boundaries from an attack surface perspective. Skills are OpenClaw's native capability extension mechanism and also a major attack vector for malicious prompt injection. Users often inadvertently install Skills with security vulnerabilities or embedded malicious instructions, which provides attackers with a controllable capability entry point. Therefore, the distribution of Skills calls is not just a usage statistic, but also an important basis for attack path analysis.

        • The usage distribution pie chart helps security teams quickly establish a baseline understanding of Skills calls: which Skills are high-frequency mainstream calls, and which are marginal and low-frequency. If the proportion of an uncommon Skill suddenly rises, or a new, unseen Skill appears, it often means the Agent is being guided down an unintended capability path and requires timely investigation.

        • The content in the new Skills table is particularly critical. Newly introduced Skills have not undergone a thorough security assessment. Their permission boundaries and behavior patterns are blind spots for the security team. By sorting in reverse chronological order of the first call time, you can capture newly emerging Skills in the environment at the earliest opportunity and complete a review before they are abused.

      • High-risk command execution monitoring

        • One of OpenClaw's innovative capabilities is the autonomous execution of system commands, which also makes it an ideal springboard for attackers. Once an Agent is subjected to prompt injection or controlled by a malicious Skill, an attacker can use the Agent's system access permissions to perform destructive operations such as deleting files, elevating privileges, or exfiltrating data. All these actions are initiated in the Agent's identity, which makes them extremely difficult to distinguish from normal task behavior.

        • The core value of high-risk command execution monitoring is to establish an independent observability layer outside of runtime protection. OpenClaw's tool permission system already implements controls at the runtime level. However, policy configuration errors, vaguely defined permission boundaries, or uncovered edge cases can all lead to high-risk commands passing through at the runtime level unnoticed. The observability layer operates independently of the protection mechanism, which ensures that even if there are oversights at runtime, high-risk operations will not go completely undetected.

        • The significance of the timeline view is not just counting, but helping security teams identify behavior patterns. An isolated, single high-risk command has a different risk meaning than a dense series of calls within a short period. The latter is often a typical feature of an Agent being controlled to systematically execute malicious instructions and requires immediate intervention. The details table provides the complete traceability context, which allows security teams to quickly trace from an abnormal signal to the specific session and original command.

      • Prompt injection detection

        • Prompt injection is the core attack method for driving an AI to perform harmful actions. Regardless of the attack path, whether it is direct user input, a return from a Skills call, or external data read by tools such as web_fetch or read, the malicious instruction must ultimately be incorporated into the prompt to influence the Agent. The prompt is the final convergence point for all attack paths.

        • The distribution of injection sources can help determine the actual nature of the risk. Injections from direct user input are usually intentional, whereas injections carried by a toolResult are often unknown to the user. For personal assistant-type Agents such as OpenClaw, indirect injection is the main threat. Skills installed by the user or external content accessed can all become injection vectors and are difficult for the user to proactively identify and avoid.

        • The value of injection classification lies in identifying the attack intent, not just flagging an anomaly. For the same injection event, ROLE_HIJACK and JAILBREAK mean the attacker is trying to break through the Agent's behavioral boundaries. HIDDEN_INSTRUCTION represents a more covert implantation technique. The response priority and handling methods for these types are different. Continuously observing changes in the classification distribution also helps to discover concentrated attempts on specific attack surfaces.

        • The details table records the triggering tool, session context, and original content for each injection event. This allows security teams to quickly drill down from classified statistics to specific events, which completes the full loop from pattern recognition to traceability and response.

      • Sensitive data exfiltration detection

        • In the context of an Agent, data exfiltration is often not a single event, but a chain of behavior that consists of multiple steps: the Agent is guided to read a sensitive file, the content enters the model's context, and is then transmitted out through subsequent tool calls. Observing any single step in isolation makes it difficult to judge the threat. Only by associating file access with outbound behavior can the full intent of the attack be reconstructed.

        • Sensitive data exfiltration detection uses a funnel analysis approach, which progressively narrows down the noise to accurately locate real threats. The first layer performs a full recording of sensitive file access, classified by asset type: SSH_KEY, ENV_FILE, CREDENTIALS, CONFIG_SECRET, and HISTORY, to establish an access baseline. The second layer independently tracks outbound behavior by channel (API_CALL, MESSAGE_SEND, WEB_ACCESS, EMAIL) to identify potential data exit points. The third layer correlates the two in the time dimension. If sensitive file access and an outbound operation occur in close succession within the same session and a short time window, it is marked as a high-priority exfiltration event.

        • The core value of this mechanism lies in causal positioning rather than single-point alerting. An Agent reading an SSH_KEY is not necessarily a threat. Initiating an API_CALL is not necessarily a threat. However, when both occur in the same session within a minute-level interval, and the outbound parameters carry sensitive file content, the threat confidence level increases significantly. The behavior chain analysis table directly presents the time difference between access_time and outbound_time and the complete call parameters, which allows security teams to complete traceability and judgment without manually correlating logs.

    • Token analysis dashboard

      Token consumption is directly related to operational costs, and its fluctuations are often an early signal of system anomalies (such as context expansion caused by prompt injection). The token analysis dashboard focuses on the core questions of "Where is the money being spent?", "Is the spending reasonable?", and "Are there any anomalies?". It provides usage monitoring, cost analysis, and anomaly detection capabilities from the perspectives of overall overview, model dimension trends, and sessions.

      About cost data: The cost field in the dashboard comes from OpenClaw's usage.cost. OpenClaw does not natively support tiered billing, and the cacheRead + cacheWrite calculation logic cannot be kept consistent with the provider's. It only estimates the cost of a single call based on inputTokens × input + outputTokens × output + .... Therefore, the costs on the dashboard should be considered a baseline for cost estimation, not an exact bill. For models where cost is not configured, the cost column shows 0.

      Qwen3.5-Plus configuration example

      Take the Qwen3.5-Plus model as an example. For information about the cost of Alibaba Cloud Model Studio API calls, see Model List.image.png

      The model cost configuration in .openclaw is:
      {
        "id": "qwen3.5-plus",
        "name": "Qwen3.5 Plus",
        "cost": {
          "input": 0.4, // From the lowest tier input price
          "output": 2.4, // From the lowest tier output price
          "cacheRead": 0.2, // Estimated as half of the input
          "cacheWrite": 0
        },
      }
      • Overall summary and model distribution

        • The top of the dashboard provides a 1-day comparison of total tokens and total cost: today's vs. yesterday's usage (unit: 10,000 tokens), today's vs. yesterday's cost (unit: CNY), and the day-over-day comparison ratio. This makes it easy to quickly determine if there has been a sudden increase in usage or cost. The day-over-day comparison is the first signal of a cost anomaly. If the day-over-day ratio exceeds a preset threshold (such as ±30%), it usually means there has been prompt expansion, a recursive invocation, or an abnormal session, and you should immediately drill down to investigate.

      • Consumption trends by provider/model (time series)

        • The model tokens trend and model cost trend are two time series charts (relative to 1 week) that share a timeline and legend. They show the token consumption and cost changes of each model over time, distinguished by color. Pay close attention to token spikes. This is often not just a cost issue, but a risk signal for security and stability. Prompt injection causing the context to be maliciously filled, tool calls getting stuck in an infinite loop, or sessions continuously expanding because loop detection was not triggered will all appear as a steep rise in one of the curves on the trend chart. The two charts are color-coded by model. A model switch is directly reflected as a change in the color composition, which lets you confirm the switch time and the models involved without extra inference, making it easy to determine if it was an expected change.

      • Top consumption by session and by host/pod (column charts)

        • The column charts form a 2×2 layout, answering "Who is spending the money?" and "Which machine or container is spending the money?" from the dimensions of session and host (or pod, in a container scenario), associating the data with specific responsible entities:

          • Top Tokens By Session / Top Cost By Session: The total tokens and costs for each session over the past week are sorted in descending order. In practice, the cost distribution of an Agent often exhibits a long-tail characteristic, where a few sessions account for the vast majority of consumption. Identifying these "head sessions" is the first step in cost optimization.

          • Top Tokens By Host / Top Cost By Host: Tokens and costs aggregated by host (instance) or pod, used for cost analysis and risk localization in multi-instance deployments. In an enterprise environment, a host or pod is usually tied to a specific team, line-of-business, or user. By combining this with asset ownership, you can map consumption data to the specific responsible party. This not only supports cost allocation but also helps to quickly lock down potential risk users or out-of-control sessions when an instance's consumption is abnormal.

      • Model token details table (cost breakdown)

        • The details table (relative to 1 week) lists the following for each model: totalTokens, inputTokens, outputTokens, cacheReadTokens, cacheWriteTokens, and their corresponding totalCost, inputCost, outputCost, cacheReadCost, and cacheWriteCost. It supports sorting and filtering, which lets you directly answer "Which model spent the most money?" and "How much did input/output account for?". The ratio of inputTokens to outputTokens reflects the Agent's interaction pattern: a high input ratio suggests a redundant prompt or context, while a high output ratio may mean the model is generating a lot of invalid content. The cacheReadTokens ratio directly reflects the benefits of the cache policy. A higher ratio means lower actual billing, which provides a quantitative basis for prompt engineering and cache tuning.

    • Behavior analysis dashboard

      The behavior analysis dashboard uses the session as the basic unit to record and classify the operational behavior of OpenClaw, answering the basic but critical question, "What did the Agent do in the current time window?".

      • Session statistics

        • The count cards at the top break down tool calls by behavior type into dimensions such as command execution, background processes, web requests, communication tools, and file read/write, which provides a quick snapshot of the overall behavior composition. Call exceptions are listed separately, making it easy to judge system stability at a glance.

        • The session statistics table is expanded by session, recording the number of calls for each session in each behavior dimension. In the screenshot, the total number of tool calls for the first session reaches 1,925, including 1,364 command executions and 561 file read/writes, which is an order of magnitude different from other sessions. Such abnormally active sessions are often worth prioritizing for review. The table is sorted by the last active time. Combined with the call distribution of each dimension, you can quickly identify sessions with abnormal behavior patterns.

      • Tool call volume statistics and error analysis

        • Tool calls are the only channel for the Agent to interact with the outside world. Changes in their call volume and error rate directly reflect the Agent's operational health. The tool call timeline shows the composition of call frequencies for each time period, color-coded by tool type. Abnormal spikes are the first entry point for troubleshooting. Combined with changes in the composition of tool types, you can quickly determine which type of operation drove the call surge. The error rate trend chart shares a timeline with the call volume timeline. A peak in the error rate does not necessarily coincide with a peak in call volume. The time difference between the two can often reveal the true source of the problem: whether a certain type of tool is continuously failing during a specific period, or whether a certain task has introduced an abnormal call pattern.

        • The full tool call log provides the protocol error, execution status, and return content for each call, which lets you quickly drill down from a trend anomaly to the specific failed call to locate the root cause.

      • External interactions

        • External interactions record all outbound behaviors initiated by the Agent during its operation, including API calls, web access, message sending, and email sending, presented and classified by session, tool name, and interaction type.

        • For an Agent, external interactions are both a necessary means to complete tasks and a potential risk exit point. A full record of external interaction behavior, on the one hand, helps the team to understand the Agent's actual capability boundaries and usage habits. On the other hand, it provides a complete behavioral context when an anomaly occurs, which supports cross-tool, cross-session association analysis and traceability.

Step 3: Explore observable data with custom queries

The built-in dashboards provide general-purpose audit and observation views. In actual security operations, dashboards are often the starting point for "discovering problems," not the end point. When the audit dashboard flags a high-risk session, the token trend chart shows an abnormal spike, or an operational metric alert is triggered, you often need to further drill down from the statistical overview to specific events to reconstruct the complete behavior chain and confirm the root cause. The SLS query and analysis engine provides flexible custom exploration capabilities for this process.

Log data models: The basis for custom analysis

Understanding the data structure is a prerequisite for custom exploration. The SLS integration solution has pre-built indexes based on audit analysis needs, so users can query directly without additional configuration. The following two types of logs form the core data source for custom analysis:

  • Session logs — Record the Agent's complete business behavior and are the main basis for security auditing and cost analysis. These are the logs ingested in Step 1: Ingest logs (using session logs as an example).

    Field path

    Type

    Audit analysis purpose

    __tag__:__session_id__

    text

    Unique session identifier. A key field for isolating and aggregating by session.

    type

    text

    Entry type: session (session metadata), message (dialogue message), or compaction (context compression summary). Used to filter for auditable dialogue records.

    message.role

    text

    Message role: user (user input), assistant (model response), or toolResult (tool return). Used to locate the acting entity.

    message.content

    text

    Message content. Includes user input, model output, and tool parameters/return values. Supports injection detection, sensitive data matching, and full-text search.

    message.provider

    message.model

    text

    Model provider and model name. Used for cost analysis and behavior statistics by model.

    message.usage.totalTokens

    message.usage.cost.total

    long / double

    Token usage and estimated cost. Used for abnormal consumption detection and session-level cost sorting.

    message.stopReason

    text

    Response termination reason: stop (normal end), toolUse (tool call triggered, next entry is usually toolResult), error / aborted / timeout (abnormal end). A key field for filtering abnormal sessions.

    message.toolName

    message.isError

    text / bool

    Tool call name and execution status. Used with the toolResult role for tool-level auditing.

    id, parentId

    text

    Entry ID and parent ID. Used to build the dialogue tree and restore message order. The id of a session type entry is the sessionId.

    timestamp

    text

    Event timestamp. Used for time window filtering, sorting, and defining alert scopes.

  • Runtime logs — Record the operational status of the gateway and various subsystems. They are the data foundation for troubleshooting and system health analysis.

    Note

    Select the OpenClaw-Runtime Log card and follow the steps in Step 1: Ingest logs (using session logs as an example) to ingest the logs.

    Field path

    Type

    Audit analysis purpose

    _meta.logLevelName

    text

    Log level (TRACE / DEBUG / INFO / WARN / ERROR / FATAL). Used to focus on ERROR and FATAL for anomaly troubleshooting.

    _meta.path

    text

    Source code file path and line number. Precisely correlates to the code location for stack analysis.

    Numeric key "0"

    object (JSON)

    Structured context. Usually contains the subsystem field (such as gateway, channels, telegram, or plugins).

    Numeric key "1" and subsequent keys

    text

    Log message content and stack content. Supports full-text search and keyword matching.

Session-level drill-down: From high-risk sessions to complete behavior chains

Typical scenario: The "High-Risk Sessions" list on the audit dashboard flags a high-risk session. The security team needs to reconstruct the complete interaction process of that session to confirm if the threat is real.

In a multi-instance deployment environment, the logs of each OpenClaw instance are centrally written to the same SLS Logstore. The first step in custom exploration is to isolate by Session ID to narrow the view to a single session. This clarifies "who triggered which requests, called which tools, and how the model responded at what time," which provides a clear boundary for compliance evidence.

  1. Log on to the Simple Log Service console. In the Projects section, select the destination project.

  2. In the Log Storage section, find the destination Logstore and click Search & Analyze to explore the data. Use the query * AND __tag__:__session_id__:<Session_Id> to filter the logs. Replace <Session_Id> with the actual session ID.

  3. After you filter the sessions, on the Raw Logs page, go to the Raw Data tab, find the target log, and click the Query Log-004 icon. You can then preview the context and reconstruct the complete behavior chain within the session in its original order: user input, model inference, tool calling requests, and tool execution results. The sequence of events is clear at a glance. This capability is especially crucial in auditing scenarios. It not only helps you identify abnormal call sequences (such as reading a sensitive file immediately followed by an exfiltration operation), but also provides a complete contextual view for reproducing security incidents and preserving evidence.

Runtime troubleshooting: Keyword search and aggregation analysis

Typical scenario: The operational metrics dashboard alerts you to a sudden increase in the error rate. You need to quickly locate the faulty module and root cause from a large volume of runtime logs.

SLS supports a combination of full-text search and structured field search. Combined with a time range, you can progressively narrow down the scope of your investigation. A typical troubleshooting path involves two steps: first, narrow the scope, then quantify the distribution.

Step 1: Filter progressively to lock down the problem

  1. Filter by log level: Use the query _meta.logLevelName: ERROR OR _meta.logLevelName: WARN OR _meta.logLevelName: FATAL to filter for all error and warning logs, focusing your attention on abnormal events.

  2. Drill down by subsystem: Add a field condition to the error logs. For example, "0.subsystem": plugins. The analytic statement would be (_meta.logLevelName: ERROR OR _meta.logLevelName: WARN OR _meta.logLevelName: FATAL) AND "0.subsystem": plugins. This narrows the scope to a specific subsystem. These two filtering steps can quickly locate the error logs.

Step 2: Use SQL aggregation to quantify the global distribution

Keyword filtering locates individual events, while SQL aggregation analysis elevates individual logs to a global statistical view. For example, grouping and aggregating the subsystem field with the analytic statement _meta.logLevelName: ERROR OR _meta.logLevelName: WARN OR _meta.logLevelName: FATAL | SELECT "0.subsystem" AS subsystem, count(1) AS c GROUP BY subsystem can visually present the error distribution across various subsystems. This helps to quickly identify concentrated anomalies and points the way for further investigation.

Step 4: Correlate multiple data sources for a closed-loop process from anomaly detection to root cause analysis

So far, we have introduced data ingestion, built-in dashboards, and custom exploration based on observable data. In actual O&M and auditing, observable data sources are not used in isolation. Instead, they follow a fixed collaborative pattern, progressively converging and mutually verifying:

OTEL Metrics → Application logs (error context) → Session audit logs (complete behavior chain). A typical investigation path is as follows: OTEL metrics detect an anomaly (such as a spike in latency, a surge in tokens, or a sudden increase in the error rate). You then locate the error details in the application logs for the corresponding time window (Webhook timeout, authentication failure, gateway anomaly). Finally, you drill down into the session audit logs to reconstruct the complete tool call sequence, model interaction content, and cost consumption for that session to confirm the root cause and preserve audit evidence.