When you develop complex AI Agents, you often need to save their running state to quickly reuse a sandbox environment. The Checkpoint feature allows you to use the E2B SDK to create a snapshot of a running container and then clone it. This process preserves both filesystem and memory data, reducing environment initialization costs.
Prerequisites
Upgrade the
acs-virtual-nodecomponent to v2.17.0 or later.Install related components.
Limitations
The Checkpoint feature currently supports only ACS general-purpose computing.
You can create a Checkpoint only after the Pod is in the Running and Ready state.
Only one Checkpoint can be running for the same Pod at any given time. After a Checkpoint reaches a terminal state (Succeeded or Failed), you can create another one.
After a Checkpoint task enters the Running state, you cannot interrupt it by deleting the Checkpoint resource.
Configure Checkpoint content retention
A Checkpoint can retain the following content:
Filesystem: Retained by default.
Memory: Optional.
By default, a Checkpoint created through the E2B SDK inherits thespec.persistentContents configuration from the original sandbox and automatically ignores theip retention setting.
You can also use a SandboxSet to manage the retained content for all sandboxes created from its template:
apiVersion: agents.kruise.io/v1alpha1
kind: SandboxSet
metadata:
name: code-interpreter-fs
namespace: default
spec:
replicas: 2
persistentContents:
- filesystem # Retain only the filesystem, not the memory.
...Snapshot and clone a sandbox
E2B SDK
Create the original sandbox
The following examples use two different SandboxSet templates, one that retains memory and one that does not, to demonstrate the restore behavior.
Deploy the SandboxSet. Save the following content as YAML files and run the
kubectl apply -f <YAML_FILE>command.Create a sandbox using the E2B SDK. For more information, seeCreate an Agent Sandbox.
# Import the E2B SDK from e2b_code_interpreter import Sandbox # Create a sandbox with memory retention enabled. sbx_with_mem = Sandbox.create("code-interpreter-mem") print(f"mem-sandbox id: {sbx_with_mem.sandbox_id}") # Create a sandbox that retains only the filesystem. sbx_no_mem = Sandbox.create("code-interpreter-no-mem") print(f"fs-sandbox id: {sbx_no_mem.sandbox_id}")Initialize the sandbox state by writing a memory variable and filesystem data.
def init_mem_fs(sbx): sbx.run_code("a = 1") # Write a variable to memory. sbx.files.write("/my-file", "hello") # Write data to a file. # Verify that the data was written successfully. print(sbx.run_code("print(a)")) print(sbx.files.read("/my-file")) init_mem_fs(sbx_with_mem) init_mem_fs(sbx_no_mem)
Create Checkpoint
Replace<YOUR_SANDBOX_..._ID> with your actual sandboxID to create a snapshot of the sandbox's current state.
sbx_with_mem = Sandbox.connect("<YOUR_SANDBOX_WITH_MEMORY_ID>")
sbx_no_mem = Sandbox.connect("<YOUR_SANDBOX_WITHOUT_MEMORY_ID>")
snapshot_with_mem = sbx_with_mem.create_snapshot()
snapshot_no_mem = sbx_no_mem.create_snapshot(headers={
"x-e2b-kruise-snapshot-keep-running": "true", # Specifies whether the sandbox continues to run after the Checkpoint is created. If set to false, the Pod's status changes to Succeeded. The default value is true.
"x-e2b-kruise-snapshot-ttl": "30m", # The time-to-live (TTL) for the created Checkpoint. It is automatically deleted after this period. If not set, the Checkpoint persists until manually deleted.
"x-e2b-kruise-snapshot-persistent-contents": "filesystem", # The content to retain in the Checkpoint. By default, it inherits the retention settings of the sandbox. Currently, only `filesystem` and the combination of `memory` and `filesystem` are supported.
"x-e2b-kruise-snapshot-wait-success-seconds": "60", # The timeout in seconds for Checkpoint creation to complete. The default value is 60.
})
print(f"Snapshot ID with memory: {snapshot_with_mem.snapshot_id}")
print(f"Snapshot ID without memory: {snapshot_no_mem.snapshot_id}")
# After the Checkpoint is created, you can safely kill the original sandbox.
sbx_with_mem.kill()
sbx_no_mem.kill()Clone a sandbox from a Checkpoint
To clone a sandbox, pass the
snapshot IDreturned in the previous step as thetemplateparameter to thecreateAPI. Standard extensions such astimeout,auto_pause, and CSI mounts still apply.# Use the snapshot ID as a template to create a new sandbox. clone_with_mem = Sandbox.create("<YOUR_SNAPSHOT_WITH_MEMORY_ID>") clone_no_mem = Sandbox.create("<YOUR_SNAPSHOT_WITHOUT_MEMORY_ID>")Check the data in the cloned sandbox to verify the restore results.
# Verify the clone that has memory retention enabled. print(clone_with_mem.run_code("print(a)")) print(clone_with_mem.files.read("/my-file")) print(clone_no_mem.run_code("print(a)")) print(clone_no_mem.files.read("/my-file"))Expected output:
Execution(Results: [], Logs: Logs(stdout: ['1\n'], stderr: []), Error: None) hello Execution(Results: [], Logs: Logs(stdout: [], stderr: []), Error: ExecutionError(name='NameError', value="name 'a' is not defined", traceback="---------------------------------------------------------------------------NameError Traceback (most recent call last)Cell In[1], line 3\n 1 import os; os.environ['E2B_SANDBOX'] = 'true'\n----> 3 print(a)\nNameError: name 'a' is not defined")) helloBoth cloned sandbox instances correctly restore the
/my-filefile in the filesystem.Only
clone_with_memsuccessfully restores the memory variablea.
Sandbox CR
Create a sandbox
Save the following content as asandbox.yaml file, and then run thekubectl apply -f sandbox.yaml command.
apiVersion: agents.kruise.io/v1alpha1
kind: Sandbox
metadata:
name: code-demo
spec:
template:
metadata:
labels:
agent: code-demo
# Use ACS computing resources.
alibabacloud.com/acs: "true"
spec:
automountServiceAccountToken: false
containers:
- name: my-session
image: registry-ap-southeast-1.ack.aliyuncs.com/acs/code-interpreter:v1.6
env:
- name: GODEBUG
value: multipathtcp=0
resources:
requests:
cpu: 1
memory: 1Gi
ephemeral-storage: "30Gi" # Declare 30 GiB of storage space.
ports:
- containerPort: 49999
name: interpreterCreate a Checkpoint
Create a snapshot of the target sandbox by creating a Checkpoint CR. Save the following content as a
sandbox-checkpoint.yamlfile, and then run thekubectl apply -f sandbox-checkpoint.yamlcommand.apiVersion: agents.kruise.io/v1alpha1 kind: Checkpoint metadata: name: checkpoint-code-demo namespace: default spec: # The name of the target Pod. podName: code-demo # Specifies whether the Pod should remain in the Running state after the Checkpoint is created. If set to false, the Pod status changes to Succeeded. keepRunning: true # The time-to-live (TTL) for the Checkpoint. After this duration, the Checkpoint resource is automatically deleted. For example: 30m, 30h, 30d. # If not specified, the resource persists until you manually delete the Checkpoint CR. ttlAfterFinished: 30h # The content to preserve. Currently, only `filesystem` or a combination of `memory` and `filesystem` are supported. # If not specified, both `memory` and `filesystem` are preserved by default. persistentContents: - memory - filesystemView the
checkpointId.kubectl get checkpoint checkpoint-code-demo -n default -o jsonpath='{.status.checkpointId}'
Clone a new sandbox
Replace
<CHECKPOINT_ID>with thecheckpointIdfrom the previous step. Save the following content as asandbox-clone.yamlfile, and then run thekubectl apply -f sandbox-clone.yamlcommand.apiVersion: agents.kruise.io/v1alpha1 kind: Sandbox metadata: name: code-demo-clone spec: template: metadata: labels: agent: code-demo-clone # Use ACS computing resources. alibabacloud.com/acs: "true" annotations: # You must configure this annotation. Otherwise, you cannot create a Checkpoint for the Pod. ops.alibabacloud.com/pause-enabled: "true" # Replace with the correct Checkpoint ID. checkpoint.alibabacloud.com/restore-from: "<CHECKPOINT_ID>" spec: # The spec of the cloned sandbox must be consistent with that of the original Pod. automountServiceAccountToken: false containers: - name: my-session image: registry-ap-southeast-1.ack.aliyuncs.com/acs/code-interpreter:v1.6 env: - name: GODEBUG value: multipathtcp=0 resources: requests: cpu: 1 memory: 1Gi ephemeral-storage: "30Gi" # Declare 30 GiB of storage space. ports: - containerPort: 49999 name: interpreterView the status of the Sandbox resource and its corresponding Pod.
kubectl get sandbox/code-demo-clone pod/code-demo-clone -o wideExpected output:
NAME STATUS AGE SHUTDOWN_TIME PAUSE_TIME MESSAGE sandbox.agents.kruise.io/code-demo-clone Running 71m NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/code-demo-clone 1/1 Running 0 71m 172.16.x.xx virtual-kubelet-cn-hangzhou-h <none> <none>
Related documentation
Create an Agent Sandbox