All Products
Search
Document Center

Platform For AI:View ServiceInstance events in CloudMonitor

Last Updated:May 08, 2023

Elastic Algorithm Service (EAS) defines the ServiceInstance event type in CloudMonitor to help you monitor the events of each EAS service instance. The EAS event controller pushes ServiceInstance events to CloudMonitor in real time. You can view ServiceInstance events, perform O&M or audits on the events, and configure alert rules for the events in the CloudMonitor console or by calling API operations. This topic describes how to view ServiceInstance events, and how to create and enable alert rules for the events.

View ServiceInstance events

In the CloudMonitor console

To view ServiceInstance events in the CloudMonitor console, you can perform the following steps:

  1. Log on to the CloudMonitor console.
  2. In the left-side navigation pane, choose Event Monitoring > System Event.
  3. On the Event Monitoring tab, select PAI from the product selection drop-down list and click Search to view system events of EAS. image

  4. Find the system event that you want to view and click Details in the Actions column.

    The following figure shows an example of the event details. image

    The following table describes the parameters.

    Parameter

    Description

    Product

    The code of the service. For example, the code of Machine Learning Platform for AI (PAI) is learn.

    Name

    The name of the event. For information about PAI system events, see the Event name column of the table in the Supported ServiceInstance events section.

    Level

    The severity level of the event. Valid values:

    • INFO

    • WARN

    • CRITICAL

    Status

    The status of the event. For information about the status of PAI system events, see the Event status column of the table in the Supported ServiceInstance events section.

    RegionId

    The region ID of the service. For example, the ID of the China (Shanghai) region is cn-shanghai.

    ResourceId

    The ID of the resource. For more information, see Policy description.

    InstanceName

    The name of the service instance.

    Time

    The time at which the event occurred. The timestamp follows the UNIX time format. It is the number of milliseconds that have elapsed since 00:00:00 Thursday, January 1, 1970.

    GroupId

    The CloudMonitor application group to which the EAS service belongs. By default, this parameter is empty.

    Content

    The content of the event. The value is in the JSON format. For more information, see Fields of the Content parameter.

    Fields of the Content parameter

    Field

    Description

    serviceName

    The service name of the instance.

    serviceId

    The service ID of the instance.

    serviceGroup

    The service group to which the instance belongs.

    resourceType

    The type of the resource group to which the instance belongs. Valid values:

    • PublicResource: public resource group.

    • DedicatedResource: dedicated resource group.

    instanceType

    The instance type.

    cpu

    The number of CPUs used by the instance.

    memory

    The memory usage of the instance. Unit: MB.

    gpu

    The number of GPUs used by the instance.

    gpuMemory

    The GPU memory usage of the instance. Unit: GB.

    nvidiaName

    The name of the GPU used by the instance.

    role

    The service role of the instance. Valid values:

    • Queue: the queue service.

    • DataLoader: the offline service.

    • Standard: the standard service.

    isBurst

    Specifies whether auto scaling is enabled for the resource group of the instance. Valid values:

    • false

    • true

    isSpot

    Specifies whether the instance is a preemptible instance. Valid values:

    • false

    • true

    callerUid

    The UID of the Alibaba Cloud account that is used to deploy the EAS service.

    timestamp

    The time when the event occurred. The timestamp is in the UTC format.

    restartCount

    The number of times for which the instance restarted.

    exitCode

    The exit status code of the instance. By default, this parameter is empty.

    status

    The status of the instance. For information about the valid values, see the Event status column of the table in the Supported ServiceInstance events section.

    reason

    The reason why the event occurred.

    message

    The information about the event.

API operation

You can also call an API operation to view ServiceInstance events. For more information, see DescribeSystemEventAttribute.

Create and enable an event-triggered alert rule

In the CloudMonitor console

  1. Create a system event-triggered alert rule. For more information, see Create a system event-triggered alert rule. Take note of the following parameters:

    • Product Type: Select PAI.

    • Event Type: Select ServiceInstance. You can set this parameter only to ServiceInstance. This value specifies the type of events that are related to EAS service instances.

    • Event Level: Select one or more severity levels based on your business requirements.

    • Event Name: Select the name of the event that you want to monitor. The available names are listed in the Event name column of the table in the Appendix section. You can select one or more event names.

    • Keyword Filtering: Specify the keywords and condition that are used to filter the events.

    image
  2. Enable the system event-triggered alert rule. For more information, see Enable system event-triggered alert rules.

API operation

You can also call an API operation to create and enable an event-triggered alert rule. For more information, see Create a system event-triggered alert rule and Enable system event-triggered alert rules.

Appendix: Supported ServiceInstance events

The following table describes the ServiceInstance events that are defined by EAS based on the lifecycle of a service instance.

Event type

Event name

Event level

Event status

ServiceInstance

EAS:ServiceInstance:Running

INFO

Running

EAS:ServiceInstance:Pending

INFO

Pending

EAS:ServiceInstance:Completed

INFO

Completed

EAS:ServiceInstance:Terminating

INFO

Terminating

EAS:ServiceInstance:Terminated

INFO

Terminated

EAS:ServiceInstance:Unknown

WARN

Unknown

EAS:ServiceInstance:Evicted

WARN

Evicted

EAS:ServiceInstance:ErrImagePull

WARN

ErrImagePull

EAS:ServiceInstance:ImagePullBackOff

WARN

ImagePullBackOff

EAS:ServiceInstance:CrashLoopBackOff

CRITICAL

CrashLoopBackOff

EAS:ServiceInstance:Error

CRITICAL

Error

EAS:ServiceInstance:Failed

CRITICAL

Failed