All Products
Search
Document Center

Container Compute Service:Clone an Agent Sandbox with Checkpoint

Last Updated:Mar 31, 2026

When you develop complex AI Agents, you often need to save their running state to quickly reuse a sandbox environment. The Checkpoint feature allows you to use the E2B SDK to create a snapshot of a running container and then clone it. This process preserves both filesystem and memory data, reducing environment initialization costs.

Prerequisites

  1. Upgrade theacs-virtual-node component to v2.17.0 or later.

  2. Install related components.

Limitations

  1. The Checkpoint feature currently supports only ACS general-purpose computing.

  2. You can create a Checkpoint only after the Pod is in the Running and Ready state.

  3. Only one Checkpoint can be running for the same Pod at any given time. After a Checkpoint reaches a terminal state (Succeeded or Failed), you can create another one.

  4. After a Checkpoint task enters the Running state, you cannot interrupt it by deleting the Checkpoint resource.

Configure Checkpoint content retention

A Checkpoint can retain the following content:

  • Filesystem: Retained by default.

  • Memory: Optional.

By default, a Checkpoint created through the E2B SDK inherits thespec.persistentContents configuration from the original sandbox and automatically ignores theip retention setting.

Example 1: Inherit filesystem and memory retention settings

# Original sandbox configuration
apiVersion: agents.kruise.io/v1alpha1
kind: Sandbox
spec:
  persistentContents:
    - filesystem
    - memory
  ...
---
# Resulting Checkpoint configuration
apiVersion: agents.kruise.io/v1alpha1
kind: Checkpoint
spec:
  persistentContents: # Automatically inherits the original configuration
    - filesystem
    - memory
  ...

Example 2: Automatically filtering out the IP retention setting

# Original sandbox configuration
apiVersion: agents.kruise.io/v1alpha1
kind: Sandbox
spec:
  persistentContents:
    - ip
    - filesystem
  ...
---
# Resulting Checkpoint configuration
apiVersion: agents.kruise.io/v1alpha1
kind: Checkpoint
spec:
  persistentContents: # The ip setting is automatically removed, and only filesystem is retained.
    - filesystem
  ...

You can also use a SandboxSet to manage the retained content for all sandboxes created from its template:

apiVersion: agents.kruise.io/v1alpha1
kind: SandboxSet
metadata:
  name: code-interpreter-fs
  namespace: default
spec:
  replicas: 2
  persistentContents:
    - filesystem # Retain only the filesystem, not the memory.
  ...

Snapshot and clone a sandbox

E2B SDK

Create the original sandbox

The following examples use two different SandboxSet templates, one that retains memory and one that does not, to demonstrate the restore behavior.

  1. Deploy the SandboxSet. Save the following content as YAML files and run thekubectl apply -f <YAML_FILE> command.

    code-interpreter-mem.yaml

    apiVersion: agents.kruise.io/v1alpha1
    kind: SandboxSet
    metadata:
      name: code-interpreter-mem
      namespace: default
    spec:
      # The size of the warm pool. We recommend setting this slightly larger than your estimated request burst.
      replicas: 2
      persistentContents: # Retain memory.
        - memory
        - filesystem
      template:
        metadata:
          labels:
            # Optional. Schedules the sandbox Pod to ACS in an ACK cluster.
            alibabacloud.com/acs: "true"
        spec:
          initContainers:
            # Declare agent-runtime as a sidecar container to automatically inject runtime components like envd into the sandbox container.
            - name: runtime
              image: registry-cn-hangzhou-vpc.ack.aliyuncs.com/acs/agent-runtime:v0.0.2
              command: [ "sh", "/workspace/entrypoint_inner.sh" ]
              volumeMounts:
                # Shared directory with the main container.
                - name: envd-volume
                  mountPath: /mnt/envd
              env:
                - name: ENVD_DIR
                  value: /mnt/envd
                # This environment variable allows the sidecar to share the resources of the main container without incurring additional costs.
                - name: __IGNORE_RESOURCE__
                  value: "true"
              restartPolicy: Always
          containers:
          - name: sandbox
            # The officially maintained e2b code-interpreter image. It supports pulling from any region over a VPC.
            image: registry-cn-hangzhou-vpc.ack.aliyuncs.com/acs/code-interpreter:v1.6
            imagePullPolicy: IfNotPresent
            # We recommend setting resource requests. Otherwise, the sandbox may be assigned a very small specification in the ACS environment, which can affect performance.
            resources:
              limits:
                cpu: 1
                memory: 1Gi
              requests:
                cpu: 1
                memory: 1Gi
            startupProbe:
              failureThreshold: 10
              httpGet:
                path: /health
                port: 49999
              initialDelaySeconds: 1
              periodSeconds: 2
              timeoutSeconds: 1
            env:
              # Specify the location for the envd components injected by the runtime.
              - name: ENVD_DIR
                value: /mnt/envd
            volumeMounts:
              # Shared directory with the runtime container.
              - name: envd-volume
                mountPath: /mnt/envd
            lifecycle:
              postStart:
                exec:
                  command: [ "/bin/bash", "-c", "/mnt/envd/envd-run.sh" ]
          # Ensures the container is terminated quickly, increasing the probability of reuse.
          terminationGracePeriodSeconds: 1
          volumes:
            - name: envd-volume
              emptyDir: { }

    code-interpreter-no-mem.yaml

    apiVersion: agents.kruise.io/v1alpha1
    kind: SandboxSet
    metadata:
      name: code-interpreter-no-mem
      namespace: default
    spec:
      # The size of the warm pool. We recommend setting this slightly larger than your estimated request burst.
      replicas: 2
      persistentContents: # Do not retain memory.
        - filesystem
      template:
        metadata:
          labels:
            # Optional. Schedules the sandbox Pod to ACS in an ACK cluster.
            alibabacloud.com/acs: "true"
        spec:
          initContainers:
            # Declare agent-runtime as a sidecar container to automatically inject runtime components like envd into the sandbox container.
            - name: runtime
              image: registry-cn-hangzhou-vpc.ack.aliyuncs.com/acs/agent-runtime:v0.0.2
              command: [ "sh", "/workspace/entrypoint_inner.sh" ]
              volumeMounts:
                # Shared directory with the main container.
                - name: envd-volume
                  mountPath: /mnt/envd
              env:
                - name: ENVD_DIR
                  value: /mnt/envd
                # This environment variable allows the sidecar to share the resources of the main container without incurring additional costs.
                - name: __IGNORE_RESOURCE__
                  value: "true"
              restartPolicy: Always
          containers:
          - name: sandbox
            # The officially maintained e2b code-interpreter image. It supports pulling from any region over a VPC.
            image: registry-cn-hangzhou-vpc.ack.aliyuncs.com/acs/code-interpreter:v1.6
            imagePullPolicy: IfNotPresent
            # We recommend setting resource requests. Otherwise, the sandbox may be assigned a very small specification in the ACS environment, which can affect performance.
            resources:
              limits:
                cpu: 1
                memory: 1Gi
              requests:
                cpu: 1
                memory: 1Gi
            startupProbe:
              failureThreshold: 10
              httpGet:
                path: /health
                port: 49999
              initialDelaySeconds: 1
              periodSeconds: 2
              timeoutSeconds: 1
            env:
              # Specify the location for the envd components injected by the runtime.
              - name: ENVD_DIR
                value: /mnt/envd
            volumeMounts:
              # Shared directory with the runtime container.
              - name: envd-volume
                mountPath: /mnt/envd
            lifecycle:
              postStart:
                exec:
                  command: [ "/bin/bash", "-c", "/mnt/envd/envd-run.sh" ]
          # Ensures the container is terminated quickly, increasing the probability of reuse.
          terminationGracePeriodSeconds: 1
          volumes:
            - name: envd-volume
              emptyDir: { }
  2. Create a sandbox using the E2B SDK. For more information, seeCreate an Agent Sandbox.

    # Import the E2B SDK
    from e2b_code_interpreter import Sandbox
    
    # Create a sandbox with memory retention enabled.
    sbx_with_mem = Sandbox.create("code-interpreter-mem")
    print(f"mem-sandbox id: {sbx_with_mem.sandbox_id}")
    # Create a sandbox that retains only the filesystem.
    sbx_no_mem = Sandbox.create("code-interpreter-no-mem")
    print(f"fs-sandbox id: {sbx_no_mem.sandbox_id}")
  3. Initialize the sandbox state by writing a memory variable and filesystem data.

    def init_mem_fs(sbx):
      sbx.run_code("a = 1") # Write a variable to memory.
      sbx.files.write("/my-file", "hello") # Write data to a file.
      
      # Verify that the data was written successfully.
      print(sbx.run_code("print(a)"))
      print(sbx.files.read("/my-file"))
    
    init_mem_fs(sbx_with_mem)
    init_mem_fs(sbx_no_mem)

Create Checkpoint

Replace<YOUR_SANDBOX_..._ID> with your actual sandboxID to create a snapshot of the sandbox's current state.

sbx_with_mem = Sandbox.connect("<YOUR_SANDBOX_WITH_MEMORY_ID>")
sbx_no_mem = Sandbox.connect("<YOUR_SANDBOX_WITHOUT_MEMORY_ID>")

snapshot_with_mem = sbx_with_mem.create_snapshot()
snapshot_no_mem = sbx_no_mem.create_snapshot(headers={
  "x-e2b-kruise-snapshot-keep-running": "true",                 # Specifies whether the sandbox continues to run after the Checkpoint is created. If set to false, the Pod's status changes to Succeeded. The default value is true.
  "x-e2b-kruise-snapshot-ttl": "30m",                           # The time-to-live (TTL) for the created Checkpoint. It is automatically deleted after this period. If not set, the Checkpoint persists until manually deleted.
  "x-e2b-kruise-snapshot-persistent-contents": "filesystem",    # The content to retain in the Checkpoint. By default, it inherits the retention settings of the sandbox. Currently, only `filesystem` and the combination of `memory` and `filesystem` are supported.
  "x-e2b-kruise-snapshot-wait-success-seconds": "60",           # The timeout in seconds for Checkpoint creation to complete. The default value is 60.
})

print(f"Snapshot ID with memory: {snapshot_with_mem.snapshot_id}")
print(f"Snapshot ID without memory: {snapshot_no_mem.snapshot_id}")

# After the Checkpoint is created, you can safely kill the original sandbox.
sbx_with_mem.kill()
sbx_no_mem.kill()

Header parameters

The ack-sandbox-manager supports custom headers to extend Checkpoint capabilities:

Parameter

Description

Default

x-e2b-kruise-snapshot-keep-running

Specifies whether the sandbox continues to run after the Checkpoint is created. If set tofalse, the Pod status changes to Succeeded.

true

x-e2b-kruise-snapshot-ttl

The time-to-live (TTL) for the Checkpoint. The Checkpoint is automatically deleted after this period (for example,30m). If not set, it persists indefinitely.

None

x-e2b-kruise-snapshot-persistent-contents

Manually overrides the content that the Checkpoint retains. Supported values arefilesystem andmemory,filesystem.

Inherits from the sandbox configuration

x-e2b-kruise-snapshot-wait-success-seconds

The timeout in seconds for Checkpoint creation to complete.

60

Clone a sandbox from a Checkpoint

  1. To clone a sandbox, pass thesnapshot ID returned in the previous step as the template parameter to thecreate API. Standard extensions such astimeout,auto_pause, and CSI mounts still apply.

    # Use the snapshot ID as a template to create a new sandbox.
    clone_with_mem = Sandbox.create("<YOUR_SNAPSHOT_WITH_MEMORY_ID>")
    clone_no_mem = Sandbox.create("<YOUR_SNAPSHOT_WITHOUT_MEMORY_ID>")
  2. Check the data in the cloned sandbox to verify the restore results.

    # Verify the clone that has memory retention enabled.
    print(clone_with_mem.run_code("print(a)"))     
    print(clone_with_mem.files.read("/my-file"))    
    print(clone_no_mem.run_code("print(a)")) 
    print(clone_no_mem.files.read("/my-file"))

    Expected output:

    Execution(Results: [], Logs: Logs(stdout: ['1\n'], stderr: []), Error: None)
    hello
    Execution(Results: [], Logs: Logs(stdout: [], stderr: []), Error: ExecutionError(name='NameError', value="name 'a' is not defined", traceback="---------------------------------------------------------------------------NameError                                 Traceback (most recent call last)Cell In[1], line 3\n      1 import os; os.environ['E2B_SANDBOX'] = 'true'\n----> 3 print(a)\nNameError: name 'a' is not defined"))
    hello
    • Both cloned sandbox instances correctly restore the/my-file file in the filesystem.

    • Onlyclone_with_mem successfully restores the memory variable a.

Sandbox CR

Create a sandbox

Save the following content as asandbox.yaml file, and then run thekubectl apply -f sandbox.yaml command.

apiVersion: agents.kruise.io/v1alpha1
kind: Sandbox
metadata:
  name: code-demo
spec:
  template: 
    metadata:
      labels:
        agent: code-demo
        # Use ACS computing resources.
        alibabacloud.com/acs: "true"
    spec:
      automountServiceAccountToken: false
      containers:
      - name: my-session
        image: registry-ap-southeast-1.ack.aliyuncs.com/acs/code-interpreter:v1.6
        env:
        - name: GODEBUG
          value: multipathtcp=0
        resources:
          requests:
            cpu: 1
            memory: 1Gi
            ephemeral-storage: "30Gi" # Declare 30 GiB of storage space.
        ports:
        - containerPort: 49999
          name: interpreter

Create a Checkpoint

  1. Create a snapshot of the target sandbox by creating a Checkpoint CR. Save the following content as asandbox-checkpoint.yaml file, and then run thekubectl apply -f sandbox-checkpoint.yaml command.

    apiVersion: agents.kruise.io/v1alpha1
    kind: Checkpoint
    metadata:
      name: checkpoint-code-demo
      namespace: default
    spec:
      # The name of the target Pod.
      podName: code-demo
      # Specifies whether the Pod should remain in the Running state after the Checkpoint is created. If set to false, the Pod status changes to Succeeded.
      keepRunning: true
      # The time-to-live (TTL) for the Checkpoint. After this duration, the Checkpoint resource is automatically deleted. For example: 30m, 30h, 30d.
      # If not specified, the resource persists until you manually delete the Checkpoint CR.
      ttlAfterFinished: 30h
      # The content to preserve. Currently, only `filesystem` or a combination of `memory` and `filesystem` are supported. 
      # If not specified, both `memory` and `filesystem` are preserved by default.
      persistentContents:
      - memory
      - filesystem
  2. View thecheckpointId.

    kubectl get checkpoint checkpoint-code-demo -n default -o jsonpath='{.status.checkpointId}'

Clone a new sandbox

  1. Replace<CHECKPOINT_ID> with thecheckpointId from the previous step. Save the following content as asandbox-clone.yaml file, and then run thekubectl apply -f sandbox-clone.yaml command.

    apiVersion: agents.kruise.io/v1alpha1
    kind: Sandbox
    metadata:
      name: code-demo-clone
    spec:
      template:
        metadata:
          labels:
            agent: code-demo-clone
            # Use ACS computing resources.
            alibabacloud.com/acs: "true"
          annotations:
            # You must configure this annotation. Otherwise, you cannot create a Checkpoint for the Pod.
            ops.alibabacloud.com/pause-enabled: "true"
            # Replace with the correct Checkpoint ID.
            checkpoint.alibabacloud.com/restore-from: "<CHECKPOINT_ID>"
        spec: # The spec of the cloned sandbox must be consistent with that of the original Pod.
          automountServiceAccountToken: false
          containers:
          - name: my-session
            image: registry-ap-southeast-1.ack.aliyuncs.com/acs/code-interpreter:v1.6
            env:
            - name: GODEBUG
              value: multipathtcp=0
            resources:
              requests:
                cpu: 1
                memory: 1Gi
                ephemeral-storage: "30Gi" # Declare 30 GiB of storage space.
            ports:
            - containerPort: 49999
              name: interpreter
  2. View the status of the Sandbox resource and its corresponding Pod.

    kubectl get sandbox/code-demo-clone pod/code-demo-clone -o wide

    Expected output:

    NAME                                       STATUS    AGE   SHUTDOWN_TIME   PAUSE_TIME   MESSAGE
    sandbox.agents.kruise.io/code-demo-clone   Running   71m
    
    NAME                  READY   STATUS    RESTARTS   AGE   IP            NODE                            NOMINATED NODE   READINESS GATES
    pod/code-demo-clone   1/1     Running   0          71m   172.16.x.xx   virtual-kubelet-cn-hangzhou-h   <none>           <none>

Related documentation

Create an Agent Sandbox