×
Community Blog LoongCollector + ACS Agent Sandbox: Build a Production-grade AI Agent Runtime Platform

LoongCollector + ACS Agent Sandbox: Build a Production-grade AI Agent Runtime Platform

This article introduces a production-grade AI Agent runtime platform combining ACS Agent Sandbox for security and LoongCollector for observability.

1. Security and Observability Challenges of AI Agents

With the rapid development of Large Language Models (LLMs), AI Agents are moving from the lab to production. From intelligent customer service to code assistants, and from data analytics to automated O&M, AI Agents are transforming how we work. However, unlike traditional applications, AI Agents possess two distinct characteristics:

Unpredictable behavior: The same input might generate different outputs and invoke different toolchains.

Execution capability: Agents don't just "talk"; they "act"—accessing data, invoking APIs, and executing operations.

These two characteristics present entirely new challenges.

Core challenge 1: Runtime security (What are Agents permitted to do? Who defines the boundaries?)

Consider this scenario: A customer service Agent answering a query is subjected to a prompt injection attack. It accidentally accesses another user's order information, or even triggers a refund API. This is a real-world security risk, not science fiction.

AI Agent security risks primarily stem from two areas:

1. Lack of strong isolation in execution environments

Agents require data access and tool invocation at runtime. Without strict permission controls, prompt injections or accidental triggers can lead to unauthorized access, data leaks, or unintended operations—such as an Agent bypassing security checks to access a restricted database.

2. Lack of control over external capabilities

The greatest threats often arise from the abuse of external capabilities—such as abnormal outbound calls, SSRF/intranet probing, or sensitive data persistence and exfiltration. For example, an Agent might be tasked with "checking the weather" but actually initiates a scan of internal network services.

Core Challenge 2: Full-link Observability (What did the Agent do? Why did it do it? How effective was it?)

Traditional applications are deterministic; the same input yields the same output. AI Agents, however, may make different decisions each time, leading to three major observability hurdles:

1. Behavior is hard to reproduce and troubleshoot

For the same query, an Agent might use Tool A today, Tool B tomorrow, or simply provide a direct answer the day after. When errors occur, identifying the exact point of failure is difficult.

2. Difficulty in cost control and attribution

Costs are driven by LLM token consumption and external API calls, both of which fluctuate significantly. It is often unclear which users, tasks, or models are driving up expenses.

3. Quality is hard to measure and optimize

Output quality depends on model capability, prompt design, and retrieval data. Because these factors change constantly, it is difficult to pinpoint what is working, what isn't, and how to optimize.

Why Is a Specialized Solution Necessary?

Traditional monitoring and security solutions fall short in AI Agent scenarios:

Dimension Traditional application AI Agent Why they differ
Security boundary Controlled by code logic Requires mandatory runtime isolation Agent behavior is decided by the LLM; code cannot fully predict its actions.
Observability targets Requests, responses, and logs Reasoning chain, tool calls, and token consumption Must record the decision-making process, not just inputs and outputs.
Cost model Computing resources Tokens and API calls Costs correlate with data volume, model choice, and task complexity.
Troubleshooting Reviewing logs and error stacks Requires a full replay of the decision chain Non-determinism makes issues difficult to reproduce.

This is why a runtime platform and observability solution specifically designed for AI Agents are essential. Let's explore how ACS Agent Sandbox and LoongCollector address these challenges.

2. ACS Agent Sandbox and LoongCollector: Comprehensive Security and Observability

ACS Agent Sandbox provides a secure execution environment based on Kubernetes, while LoongCollector acts as a telemetry data collector to provide agents with comprehensive monitoring and analysis. Together, their deep integration forms a complete production-grade execution platform for AI Agents.

2.1 ACS Agent Sandbox: Providing Runtime Security

Alibaba Cloud Container Service (ACS) Agent Sandbox is a specialized environment launched by Alibaba Cloud. Built on Kubernetes, it provides a secure, isolated, and scalable platform for running AI Agents.

1

2.2 LoongCollector: Providing Sandbox Observability

LoongCollector is a unified telemetry collector open-sourced by the Alibaba Cloud Observability team. Designed for cloud-native and high-performance scenarios, it offers unique advantages for AI Agent use cases:

2

Extreme Performance and Ultra-low Overhead

AI Agents are compute-intensive, so observability components must be lightweight to avoid impacting business operations:

Zero-copy architecture: Utilizes Memory Arena and zero-copy to minimize unnecessary memory overhead.

Event pooling and reuse: High-frequency object pooling reduces memory allocation and Garbage Collection (GC) pressure.

High single-core throughput: A single core can support log collection throughput of up to 500 MB/s.

Unified Collection: Full Coverage of Logs, Metrics, and Traces

Logs: Supports stdout/stderr and file logs; automatically associates Kubernetes metadata such as Pods, Namespaces, and Labels.

Metrics: Native support for Prometheus Exporter, system metrics (CPU, memory, network, and disk I/O), and GPU metrics (NVIDIA DCGM).

Traces: Full support for OpenTelemetry.

Edge Computing: Moving Processing to the Data Source

Beyond collection, it performs edge-side preprocessing to reduce transmission and storage costs:

● High-performance C++ plugins and Structured Process Language (SPL) engine.

● Supports complex processing: Filtering, transformation, and aggregation.

● Edge-side dimensionality reduction: Minimizing noise and data volume at the source.

Enterprise-Grade Reliability: Ensuring Zero Data Loss and Stable Operations

Data reliability

● At-least-once delivery semantics.

● Local disk caching: Persisting data to disk during network anomalies and retransmitting upon recovery.

● Automatic retry and exponential backoff.

● Backpressure and rate limiting: Protects the system during downstream congestion.

Operational reliability:

● Multi-tenant pipeline isolation.

● Priority scheduling: Ensuring critical data is processed first.

● Hot updates and graceful changes: Configuration changes take effect without restarts or service interruptions.

Unified Management for Large-Scale Elastic Scenarios

ConfigServer: Centralized configuration management supporting tens of thousands of Agents.

Remote configuration delivery: Changes take effect in real-time without requiring manual login.

Status and performance monitoring: A unified view of health and resource overhead.

2.3 Deep Integration: LoongCollector Provides Zero-Intrusion, Automated, and Highly Reliable Observability for Sandbox

3

● ACS management automatically injects the LoongCollector container into the Sandbox.

● Via shared file path mounting.

● Use the Pod network to perform Prometheus scraping on AI Agents or receive OpenTelemetry data.

Through the deep integration of ACS Agent Sandbox and LoongCollector, we have built a comprehensive production-grade platform for AI Agents:

Capability dimension ACS Agent Sandbox LoongCollector Combined value
Security Container isolation, access control, and network isolation Collection isolation and send isolation Multi-layered security guarantees
Reliability Automatic fault recovery, multiple replicas, and health checks At-least-once, local cache, and automatic retries End-to-end reliability assurance
Scalability Automatic scaling and startup in seconds High performance, low overhead, and edge compute Elasticity to handle traffic fluctuations
Manageability Unified configuration and canary releases Configserver and remote delivery Unified management of large-scale clusters
Observability Automatic injection and meta information association Full-link collection and real-time processing Complete observability view

3. Running OpenClaw Using ACS Agent Sandbox and LoongCollector

OpenClaw is a trending AI application that redefines the boundaries of AI assistants. Its core value is no longer just answering questions, but understanding intent, planning steps, and invoking tools to complete tasks—acting as an "always-on" digital employee. Next, let's explore how to run OpenClaw securely and with full observability using ACS Agent Sandbox and LoongCollector.

3.1 Enabling Sandbox LoongCollector Injection for ACK and ACS Clusters

ACK clusters

Note: Install the following components in advance:

● Install the LoongCollector component in Components and Add-ons.

● Install the ACK Virtual Node component in Components and Add-ons.

● Install ack-agent-sandbox-controller components in Components and Add-ons.

● To expose services via EIP, install the ack-extend-network-controller component from the Marketplace. Refer to the help document for specific configuration steps.

Modify the eci-profile ConfigMap in the kube-system namespace. The slsMachineGroup parameter defines the Sandbox machine group identifier; we recommend using a unique identifier different from the ACK DaemonSet group.

ACS clusters

Note: Install the following components first:

● Go to Components and Add-ons and install the ack-agent-sandbox-controller component (version ≥0.5.3).

● To expose services via EIP, go to Components and Add-ons in the ACK cluster and install the ack-extend-network-controller component.

● Go to Components and Add-onsand install the in alibaba-log-controller component.

The machine group identifier is the unified ACS cluster group ID: k8s-log-${cluster_id}

3.2 Deploying OpenClaw in ACS Agent Sandbox

Enable the OpenTelemetry (OTel) plugin for OpenClaw

Note

● Ensure extensions/diagnostics-otel is included when packaging the OpenClaw image.

● You must enable diagnostics-otel in the configuration to report metrics and trace data.

Configure ~/.openclaw/openclaw.json

Note: The endpoint configured here will be required for the LoongCollector collection configuration later.

{  
  "plugins": {  
    "allow": ["diagnostics-otel"],  
    "entries": {  
      "diagnostics-otel": { "enabled": true }  
    }  
  },  
  "diagnostics": {  
    "enabled": true,  
    "otel": {  
      "enabled": true,  
      "endpoint": "http://127.0.0.1:4318",  
      "protocol": "http/protobuf",  
      "serviceName": "openclaw-gateway",  
      "traces": true,  
      "metrics": true,  
      "logs": true,  
      "sampleRate": 1,  
      "flushIntervalMs": 60000  
    }  
  }  
}  

OpenClaw sandbox deployment example

Below is a simplified example of creating an OpenClaw sandbox directly using a Sandbox CR:

apiVersion: agents.kruise.io/v1alpha1  
kind: Sandbox  
metadata:  
  name: openclaw  
  namespace: default  
spec:  
  template:  
    metadata:  
      labels:  
        alibabacloud.com/acs: 'true'  
        app: openclaw  
    spec:  
      containers:  
        - name: openclaw  
          # Replace with the actual OpenClaw image address  
          image: <open-claw image address>   
          imagePullPolicy: IfNotPresent   
          resources:  
            limits:  
              cpu: '4'  
              memory: 8Gi  
            requests:  
              cpu: '4'  
              memory: 8Gi  
          securityContext:  
            readOnlyRootFilesystem: false  
          terminationMessagePath: /dev/termination-log  
          terminationMessagePolicy: File  
      dnsPolicy: ClusterFirst  
      paused: true  
      restartPolicy: Always  
      schedulerName: default-scheduler  
      securityContext: {}  
      terminationGracePeriodSeconds: 1  

3.3 Full Observability Collection Configuration

As described in Is Your OpenClaw Really Running Under Control?, the observability data for OpenClaw is as follows:

Observability data OpenClaw data source
Logs (Session) ~/.openclaw/agents/<id>/sessions/*.jsonl
Logs (Application) /tmp/openclaw/openclaw-YYYY-MM-DD.log
Metrics diagnostics-otel plugin OTLP output
Traces diagnostics-otel plugin OTLP output

Session logs

apiVersion: telemetry.alibabacloud.com/v1alpha1  
kind: ClusterAliyunPipelineConfig  
metadata:  
  name: openclaw-session-log  
spec:  
  config:  
    aggregators: []  
    global: {}  
    inputs:  
      - Type: input_file  
        # This path varies depending on the run path of the openclaw image.  
        FilePaths:  
          - /home/node/.openclaw/agents/main/sessions/*.jsonl  
        MaxDirSearchDepth: 0  
        FileEncoding: utf8  
        EnableContainerDiscovery: true  
        # Filter containers based on the OpenClaw sandbox information.  
        ContainerFilters:  
          K8sPodRegex: ^(openclaw.*)$  
    processors:  
      - Type: processor_parse_json_native  
        SourceKey: content  
    flushers:  
      - Type: flusher_sls  
        Logstore: openclaw-session-log  
    sample: ''  
  # Replace this with the sandbox machine group name of the ACK or ACS cluster.  
  machineGroups:  
    - name: <your-sandbox-machine-group>  
  # The project to which logs are collected.  
  project:  
    name: k8s-log-xxx  
  # The Logstore to which logs are collected.  
  logstores:  
    - name: openclaw-session-log  

Application logs

apiVersion: telemetry.alibabacloud.com/v1alpha1  
kind: ClusterAliyunPipelineConfig  
metadata:  
  name: openclaw-app-log  
spec:  
  config:  
    aggregators: []  
    global: {}  
    inputs:  
      - Type: input_file  
        FilePaths:  
          - /tmp/openclaw/*.log  
        MaxDirSearchDepth: 0  
        FileEncoding: utf8  
        EnableContainerDiscovery: true  
        # Filter containers based on OpenClaw sandbox information.  
        ContainerFilters:  
          K8sPodRegex: ^(openclaw.*)$  
    processors:  
      - Type: processor_parse_json_native  
        SourceKey: content  
    flushers:  
      - Type: flusher_sls  
        Logstore: openclaw-app-log  
    sample: ''  
  # Replace this with the name of the sandbox machine group for your ACK or ACS cluster.  
  machineGroups:  
    - name: <your-sandbox-machine-group>  
  # The destination project for data collection.  
  project:  
    name: k8s-log-xxx  
  # The destination Logstore for data collection.  
  logstores:  
    - name: openclaw-app-log

OpenTelemetry

apiVersion: telemetry.alibabacloud.com/v1alpha1  
kind: ClusterAliyunPipelineConfig  
metadata:  
  name: openclaw-otel-config  
spec:  
  config:  
    # This corresponds to the logstores below. It distributes and stores OpenTelemetry logs, metrics, and trace data.  
    aggregators:  
      - Type: aggregator_opentelemetry  
        MetricsLogstore: openclaw-otel-metrics  
        TraceLogstore: openclaw-otel-traces  
        LogLogstore: openclaw-otel-logs  
    global: {}  
    inputs:  
      - Type: service_otlp  
        Protocals:  
          HTTP:  
            # Corresponds to the diagnostics-otel Endpoint enabled in OpenClaw.  
            Endpoint: '127.0.0.1:4318'  
            ReadTimeoutSec: 10  
            ShutdownTimeoutSec: 5  
            MaxRecvMsgSizeMiB: 64  
    processors: []  
    flushers:  
      - Type: flusher_sls  
        Logstore: openclaw-otel-logs  
  # Replace with the Sandbox machine group Name for the ACK or ACS cluster.  
  machineGroups:  
    - name: <your-sandbox-machine-group>  
  # The project for Collection.  
  project:  
    name: k8s-log-xxx  
  # The Logstore for Collection. Note that OpenTelemetry has three Data Types. You must define three Logstores.  
  # For metrics Data, set telemetryType to Metrics.  
  logstores:  
    - name: openclaw-otel-logs  
    - name: openclaw-otel-metrics  
      telemetryType: Metrics  
    - name: openclaw-otel-traces

3.4 Summary: Fully Resolving OpenClaw Security Challenges

Sandbox runs OpenClaw securely and in isolation

● Each Sandbox runs in an isolated kernel environment, preventing malicious code from attacking host system programs.

● Each Sandbox uses an isolated temporary file system to prevent unauthorized reading, tampering, or deletion of host files.

LoongCollector enables full-stack observability for OpenClaw

OpenClaw observable data Types Issues addressed
Session logs What did the Agent do? Which tools were called, which parameters were passed, and what was the result of each step?
Application logs Where did the system fail? For example, did a Webhook fail or a message queue get blocked?
OT-Traces What happened to a message from reception to response? How is the trace path linked?
OT-Metrics How much is it costing right now? Is latency within normal limits? Are there any frozen sessions?

4. Summary and Outlook

The production-readiness of AI Agents is not a matter of "if," but "how." Security and observability are not optional—they are essential requirements.

If you are building an AI agent application:

Start now by prioritizing runtime security and observability.

Choose the right tools instead of reinventing the wheel.

Establish best practices and promote them within your team.

Continually learn and optimize to ensure your Agents create real value.

Both ACS Agent Sandbox and LoongCollector are open platforms; we invite you to try them and share your feedback. Together, let's build a more secure, reliable, and efficient production environment for AI Agents. We hope this article provides valuable reference and inspiration for your observability journey.

0 0 0
Share on

You may also like

Comments

Related Products

  • Alibaba Cloud Model Studio

    A one-stop generative AI platform to build intelligent applications that understand your business, based on Qwen model series such as Qwen-Max and other popular models

    Learn More
  • Container Service for Kubernetes

    Alibaba Cloud Container Service for Kubernetes is a fully managed cloud container management service that supports native Kubernetes and integrates with other Alibaba Cloud products.

    Learn More
  • ACK One

    Provides a control plane to allow users to manage Kubernetes clusters that run based on different infrastructure resources

    Learn More
  • Qwen

    Full-range, open-source, multimodal, and multi-functional

    Learn More