ACS Agent Sandbox exposes Prometheus metrics for instance lifecycle, resource status, and runtimes via its two core components: Sandbox Controller and Sandbox Manager. You can collect these metrics with either Managed Service for Prometheus or a self-hosted Prometheus instance and visualize them on a Grafana dashboard.
Prerequisites
On your cluster's Add-ons page, ensure the following components meet the minimum version requirements:
ack-agent-sandbox-controller: v0.5.14 or later.ack-sandbox-manager: v0.6.1 or later.
Enable Prometheus monitoring for Agent Sandbox
Alibaba Cloud Prometheus
Go to the ARMS Prometheus Integration Management page. In the upper-left corner, select the region of your cluster. On the Integrated Environments tab, locate your cluster and click its instance name to open the instance details page.
On the instance details page, click Add Integration next to Addon Type. In the panel that appears, find and click Agent Sandbox Monitoring, keep the default integration name, and then click OK.
Self-managed Prometheus
Configure scrape rules
Agent Sandbox monitoring involves scraping metrics from two components:
Sandbox Controller: Exposes metrics through the
/metricsendpoint on the Kubernetes API server.Sandbox Manager: Deployed in the
sandbox-systemnamespace and exposes metrics on HTTP port 8080 at the/metricspath.
Open Source Prometheus
In prometheus.yml, add the following scrape jobs to the scrape_configs section.
Sandbox Controller
scrape_configs:
- job_name: agent-sandbox-controller
scrape_interval: 30s
scrape_timeout: 30s
metrics_path: /metrics
scheme: https
honor_labels: true
honor_timestamps: true
params:
hosting: ["true"]
job: ["agent-sandbox-controller"]
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: [default]
authorization:
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
insecure_skip_verify: false
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
server_name: kubernetes
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_component]
separator: ;
regex: apiserver
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_service_label_provider]
separator: ;
regex: kubernetes
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
separator: ;
regex: https
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Node;(.*)
target_label: node
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Pod;(.*)
target_label: pod
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: service
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: job
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_service_label_component]
separator: ;
regex: (.+)
target_label: job
replacement: ${1}
action: replace
- separator: ;
regex: (.*)
target_label: endpoint
replacement: https
action: replaceSandbox Manager
scrape_configs:
- job_name: sandbox-manager
scrape_interval: 30s
scrape_timeout: 30s
metrics_path: /metrics
scheme: http
honor_labels: true
honor_timestamps: true
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- sandbox-system
relabel_configs:
- source_labels:
- __meta_kubernetes_endpoint_port_name
separator: ;
regex: manager
replacement: $1
action: keepPrometheus Operator
The community edition of Prometheus Operator uses the ServiceMonitor custom resource to define scrape rules.
Sandbox Controller
Save the following content as
sandbox-controller-servicemonitor.yaml.apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: labels: release: ack-prometheus-operator # Change this as needed based on the labelSelector configuration of your Prometheus Operator. name: sandbox-controller namespace: monitoring spec: endpoints: - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token bearerTokenSecret: key: '' honorLabels: true honorTimestamps: true interval: 30s params: hosting: - 'true' job: - agent-sandbox-controller path: /metrics port: https relabelings: - action: keep regex: https sourceLabels: - __meta_kubernetes_endpoint_port_name - action: replace sourceLabels: - __meta_kubernetes_namespace targetLabel: namespace - action: replace regex: Node;(.*) replacement: '${1}' separator: ; sourceLabels: - __meta_kubernetes_endpoint_address_target_kind - __meta_kubernetes_endpoint_address_target_name targetLabel: node - action: replace regex: Pod;(.*) replacement: '${1}' separator: ; sourceLabels: - __meta_kubernetes_endpoint_address_target_kind - __meta_kubernetes_endpoint_address_target_name targetLabel: pod - action: replace sourceLabels: - __meta_kubernetes_service_name targetLabel: service - action: replace regex: ^$ sourceLabels: - __meta_kubernetes_service_label_component targetLabel: __tmp_job_fallback - action: replace regex: (.+); replacement: '${1}' separator: ; sourceLabels: - __meta_kubernetes_service_name - __meta_kubernetes_service_label_component targetLabel: job - action: replace replacement: https targetLabel: endpoint scheme: https scrapeTimeout: 30s tlsConfig: ca: {} caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt cert: {} serverName: kubernetes jobLabel: component namespaceSelector: matchNames: - default selector: matchLabels: component: apiserver provider: kubernetesCreate the ServiceMonitor resource.
kubectl apply -f sandbox-controller-servicemonitor.yaml
Sandbox Manager
Save the following ServiceMonitor content as a YAML file and run kubectl apply -f to create the resource.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
release: ack-prometheus-operator # This must match the serviceMonitorSelector of your Prometheus Operator for discovery and scraping to work. Change this value as needed based on your actual labelSelector.
app.kubernetes.io/instance: ack-sandbox-manager
app.kubernetes.io/name: ack-sandbox-manager
component: sandbox-manager
name: sandbox-manager
namespace: sandbox-system
spec:
endpoints:
- interval: 30s
path: /metrics
port: manager
namespaceSelector:
matchNames:
- sandbox-system
selector:
matchLabels:
app.kubernetes.io/instance: ack-sandbox-manager
app.kubernetes.io/name: ack-sandbox-manager
component: sandbox-managerView monitoring dashboards
Alibaba Cloud Prometheus
Log on to the Alibaba Cloud console. In the left navigation bar, select Operations > Prometheus Monitoring. On the Others tab, you can view the following Sandbox monitoring dashboards.
Sandbox Instance: View the status, lifecycle, and resource usage of specific Sandbox instances.
Sandbox Controller: View cloud-side lifecycle management for Sandbox resources, including resource statistics and management performance.
Sandbox Manager: View the execution status of Sandbox resource declarations, such as the execution performance of the E2B protocol.
Self-managed Prometheus
If you use self-managed Prometheus, you can import the following Grafana dashboard JSON templates and configure the data source.
Name | Version | Description | Download |
Sandbox Instance | v1.0.0 | Monitors the metadata, current status, and resource usage of Sandbox instances. | |
Sandbox Controller | v1.0.0 | Monitors cloud-side lifecycle management for Sandbox resources, including resource statistics and management performance. | |
Sandbox Manager | v1.0.0 | Monitors the execution status of Sandbox resource declarations, such as the execution performance of the E2B protocol. |
Billing
Enabling Alibaba Cloud Prometheus Monitoring may incur additional charges. For details, see the billing overview.
Metrics
This section lists the Prometheus metrics exposed by the Sandbox Controller and Sandbox Manager components. You can use these metrics to configure alerting rules or create custom dashboards.
Sandbox controller metrics
The Sandbox Controller manages the lifecycle of Sandbox instances and SandboxSet resources. The following metrics are exposed from the controller's/metrics endpoint.
Sandbox instance metrics
Use these metrics to monitor the basic information, lifecycle status, and readiness of each Sandbox instance in the cluster.
Status-based metrics, such assandbox_status_unpaused,sandbox_status_unpaused_time, andsandbox_status_inplace_updating, generate a time series only when an instance enters the corresponding state. If no instances in the cluster are in that state (for example, if a pause operation has never been performed), querying the corresponding metric returns no data. This is expected and does not indicate a scrape failure.
Metric name | Type | Description | Labels |
| Gauge | The Unix timestamp of the Sandbox instance's creation. |
|
| Gauge | The current phase of the Sandbox instance. The value is 1 for the current phase. Possible values for the |
|
| Gauge | Indicates if the Sandbox instance is ready. |
|
| Gauge | The Unix timestamp of when the instance last entered the Ready state. |
|
| Gauge | Indicates if the |
|
| Gauge | Indicates if the |
|
| Gauge | The Unix timestamp when the |
|
| Gauge | The Unix timestamp when the |
|
Sandbox instance resource metrics
Each Sandbox instance corresponds to a pod. Its resource metrics are the same as the cluster's cAdvisor pod resource metrics. For detailed metric descriptions, see Container cluster basic metrics.
SandboxSet metrics
Use these metrics to monitor the replica status of SandboxSet resources and to identify issues such as insufficient replicas or scaling anomalies.
Metric name | Type | Description | Labels |
| Gauge | The current number of SandboxSet replicas. |
|
| Gauge | The current number of available SandboxSet replicas. |
|
| Gauge | The desired number of SandboxSet replicas. |
|
Sandbox manager metrics
Sandbox Manager handles sandbox resource claims, lifecycle operations (such as clone, delete, pause, resume, and snapshot), and the routing proxy. The Manager's /metrics endpoint exposes the following metrics.
Sandbox claim metrics
Use these metrics to monitor sandbox resource claim operations, including success rate, duration, and retry count.
Metric name | Type | Description | Label |
| Counter | Total number of claim operations. | - |
| Counter | The number of sandbox creation requests, broken down by result. |
|
| Histogram | Duration of claim operations, in seconds. | - |
| Histogram | Number of retries per claim operation. | - |
Lifecycle operation metrics
Use these metrics to monitor the duration and outcome of Sandbox lifecycle operations. This helps you evaluate performance and troubleshoot failures.
Metric name | Type | Description | Label |
| Histogram | Duration of the Sandbox clone operation, in seconds. | - |
| Histogram | Duration of the Sandbox delete operation, in seconds. | - |
| Counter | Total Sandbox delete requests by result. |
|
| Histogram | Duration of the Sandbox pause operation, in seconds. | - |
| Histogram | Duration of the Sandbox resume operation, in seconds. | - |
| Histogram | Duration of the Sandbox snapshot creation, in seconds. | - |
Route and network metrics
Use these metrics to monitor the proxy routing table and peer nodes in Sandbox Manager, and the performance of route synchronization.
Metric name | Type | Description | Tag |
| Gauge | Current number of routes in the proxy routing table. | - |
| Gauge | Current number of connected peer nodes. | - |
| Histogram | Duration of route synchronization, in seconds. | - |
| Counter | Total number of route synchronizations. | - |