View EAS events in CloudMonitor - Platform For AI - Alibaba Cloud Documentation Center

Elastic Algorithm Service (EAS) pushes two types of events to CloudMonitor in real time: Service events (service-level) and ServiceInstance events (pod instance-level). Use these events to monitor your inference services, trigger alert notifications when a pod crashes or a spot instance is preempted, audit configuration changes, and automate operational responses — for example, by routing critical events to a notification service or invoking a function to restart a failed pod.

View EAS events

Use the console

Log on to the CloudMonitor console.
In the left navigation pane, choose Event Center > System Event.
On the Event Monitoring tab, select PAI from the product drop-down list, and click Search.
In the Actions column of the target event, click Details to view the event details. Example:

Event parameters

Parameter	Description
Product	The service code. For Platform for AI (PAI), the code is `learn`.
Name	The event name. See the Name column in Supported EAS events.
Level	The event level: `INFO`, `WARN`, or `CRITICAL`.
Status	The event status. See the Status column in Supported EAS events.
RegionId	The region ID of the service, for example, `cn-shanghai` for China (Shanghai).
ResourceId	The resource ID. See Policy description.
InstanceName	The service name or service instance name.
Time	The time the event occurred, expressed as a UNIX millisecond timestamp.
GroupId	The CloudMonitor application group that the EAS service belongs to. Empty by default.
Content	The core content of the event. For Service events, this is a plain string. For ServiceInstance events, this is a JSON object — see Fields of the Content parameter.

Fields of the Content parameter

The Content field for ServiceInstance events is a JSON object. Example:

{
  "serviceName": "my-inference-service",
  "serviceId": "eas-cn-shanghai-1234567",
  "serviceGroup": "default",
  "resourceType": "DedicatedResource",
  "instanceType": "ecs.gn6i-c4g1.xlarge",
  "cpu": 4,
  "memory": 16384,
  "gpu": 1,
  "gpuMemory": 16,
  "nvidiaName": "Tesla T4",
  "role": "Standard",
  "isBurst": false,
  "isSpot": false,
  "callerUid": "123456789012345678",
  "timestamp": 1711929600000,
  "restartCount": 2,
  "exitCode": "1",
  "status": "CrashLoopBackOff",
  "reason": "OOMKilled",
  "message": "Container killed due to out-of-memory condition"
}

Field descriptions:

Field	Description
serviceName	The service name of the instance.
serviceId	The service ID of the instance.
serviceGroup	The service group the instance belongs to.
resourceType	The resource group type: `PublicResource` (public resource group) or `DedicatedResource` (dedicated resource group).
instanceType	The instance type.
cpu	The number of CPUs used by the instance.
memory	The memory used by the instance, in MB.
gpu	The number of GPUs used by the instance.
gpuMemory	The GPU memory used by the instance, in GB.
nvidiaName	The name of the GPU used by the instance.
role	The service role: `Queue` (queue service), `DataLoader` (offline service), or `Standard` (standard service).
isBurst	Whether auto scaling is enabled for the instance's resource group: `true` or `false`.
isSpot	Whether the instance is a preemptible instance: `true` or `false`.
callerUid	The UID of the Alibaba Cloud account used to deploy the EAS service.
timestamp	The last startup time of the container.
restartCount	The number of times the instance has been restarted.
exitCode	The exit status code of the instance. Empty by default.
status	The status of the instance. See the Status column in Supported EAS events.
reason	The reason the event occurred.
message	The event message.

Use the API

Call DescribeSystemEventAttribute to retrieve EAS events programmatically.

Create and enable alert rules

Set up alert rules to receive notifications when specific EAS events occur — for example, when a pod enters CrashLoopBackOff or a spot instance is about to be released.

Use CRITICAL event levels for events that require immediate action (such as CrashLoopBackOff or Unavailable) and WARN for events that warrant investigation (such as ErrImagePull or Evicted).

Use the console

Create a system event-triggered alert rule. Configure the following parameters:
- Product type: Select PAI.
- Event type: Select ServiceInstance or Service.
- Event level: Select one or more event levels.
- Event name: Select one or more event names from the Name column in Supported EAS events.
- Keyword filtering: Enter keywords to match against event content and filter which events trigger the alert.
(Optional) Use alert callbacks for system events (legacy) to send event data to an external endpoint, such as a webhook or notification service.

Use the API

Create an event-triggered alert rule and enable it.

FAQ

Does a service instance refer to an inference service or a pod?

A service instance refers to a pod instance, not the inference service itself. The Service event type covers service-level changes (such as replica count changes or service unavailability), while the ServiceInstance event type covers pod-level events (such as pod restarts, image pull failures, or preemption).

Appendix: Supported EAS events

EAS defines service-level and service instance-level events as follows.

Type	Name	Event level	Status	Description
ServiceInstance	EAS:ServiceInstance:Running	INFO	Running	The pod is running normally.
	EAS:ServiceInstance:Pending	INFO	Pending	The pod is waiting to be scheduled or started.
	EAS:ServiceInstance:Completed	INFO	Completed	The pod has completed execution.
	EAS:ServiceInstance:Terminating	INFO	Terminating	The pod is in the process of being terminated.
	EAS:ServiceInstance:Terminated	INFO	Terminated	The pod has been terminated.
	EAS:ServiceInstance:Unknown	WARN	Unknown	The pod state cannot be determined, typically due to a communication error with the node.
	EAS:ServiceInstance:Evicted	WARN	Evicted	The pod was evicted due to resource pressure on the node.
	EAS:ServiceInstance:ErrImagePull	WARN	ErrImagePull	The container image could not be pulled, often due to an invalid image reference or missing credentials.
	EAS:ServiceInstance:ImagePullBackOff	WARN	ImagePullBackOff	Image pull is being retried after repeated failures.
	EAS:ServiceInstance:CrashLoopBackOff	CRITICAL	CrashLoopBackOff	The container is repeatedly crashing. Check the `exitCode` and `reason` fields for the root cause.
	EAS:ServiceInstance:Error	CRITICAL	Error	The pod encountered a runtime error.
	EAS:ServiceInstance:Failed	CRITICAL	Failed	The pod failed to start or complete successfully.
	EAS:ServiceInstance:SpotToBeReleased	WARN	SpotToBeReleased	The preemptible instance is about to be released. Prepare for service disruption or failover.
Service	EAS:Service:ReplicasChanged	INFO	ReplicasChanged	The number of replicas for the service has changed, for example, due to auto scaling.
	EAS:Service:StatusChanged	INFO	StatusChanged	The service status has changed.
	EAS:Service:Unavailable	CRITICAL	Unavailable	The service is unavailable and cannot process requests.
	EAS:Service:UpdateFailed	CRITICAL	UpdateFailed	A service update failed to apply.