All Products
Search
Document Center

Platform For AI:Monitoring and logging

Last Updated:Feb 06, 2024

Alibaba Cloud Health Status

We recommend that you keep track of the health status of your Alibaba Cloud resources. This way, you can handle exceptions at the earliest opportunity. For more information, visit Alibaba Cloud Health Status.

On the Alibaba Cloud Health Status page, you can check the health status of each service in different regions and subscribe to Really Simple Syndication (RSS) feeds about service exceptions.

image..png

CloudMonitor

CloudMonitor Basic is a free service that can provide real-time monitoring capabilities for Platform for AI (PAI). CloudMonitor Basic allows you to track the operational status of cloud resources, resource usage in Elastic Compute Service (ECS), website performance, and disruptions in business operations.

To use the monitoring capabilities of CloudMonitor Basic, you must enable CloudMonitor Basic for PAI. For more information, see Cloud service monitoring.

Enable alerts for critical metrics at one time

CloudMonitor Basic allows you to enable alerts for multiple critical PAI metrics at the same time and establish an alert system efficiently. This way, you can gain comprehensive insights into the usage of your cloud resources and the operational status of your business. For more information, see Enable the initiative alert feature.

Configure custom alerts for desired metrics

You can create a custom dashboard to manage all metrics that you want to monitor on a single platform. For more information, see Manage the monitoring charts of a custom dashboard.

You can configure alert rules for each metric to receive important notifications by using multiple notification methods, including phone calls, text messages, emails, DingTalk chatbots, and the Alibaba Cloud app.

You can also create an alert blacklist to block alerts for specific metrics. For more information, see Manage blacklist policies.

Cloud Config

Cloud Config is a free auditing service that helps you monitor the configuration changes of all cloud resources and ensure the continuous compliance of your cloud infrastructure.

Track resource configuration changes

Cloud Config can audit the operations of your Alibaba Cloud account and Resource Access Management (RAM) users that are created by your Alibaba Cloud account. By default, configuration changes are recorded every 10 minutes.

Enable the compliance pre-check for MLPS 2.0

Cloud Config uses rules that align with the Baseline for Multi-Level Protection Scheme (MLPS) 2.0 to evaluate the compliance of cloud resource configurations. You can enable the compliance pre-check for MLPS 2.0 with a few clicks. The system automatically and continuously checks your resources for compliance. You can also download the pre-check report and submit it to an inspection agency.

Query and analyze audit data in real time

You can send the historical configuration changes and non-compliant events of your resources to a Logstore in Simple Log Service. This way, you can query and analyze audit data in a centralized manner. For more information, see Deliver resource data to a Logstore in Simple Log Service.

ActionTrail

You can enable ActionTrail for PAI to monitor and record the operations of your Alibaba Cloud account in a centralized manner, including logon to the PAI console and access to cloud resources. This way, you can perform security analysis, intrusion detection, resource change tracking, and compliance auditing based on the records.

ActionTrail can generate logs for cloud service access by using the Alibaba Cloud Management Console, calling API operations, and using developer tools. For information about the audit events, see Audit events of ECS.

By default, ActionTrail tracks and retains the events of the previous 90 days. If you need to retain events for a longer period of time, create a trail that sends events to a Simple Log Service Logstore or an Object Storage Service (OSS) bucket. For more information, see Getting Started.

After you create a trail to send events to a Simple Log Service Logstore or an OSS bucket, you can query or analyze the events in the Simple Log Service or OSS console. For more information, see Query events in the Simple Log Service or OSS console.

If you need to trace a historical event, submit a ticket to request the required permissions.

Workspace notification

PAI provides a notification mechanism for workspaces. You can create notification rules to monitor the status of Deep Learning Containers (DLC) jobs and pipeline jobs, or trigger related events based on the approval status of model versions. You can receive notifications through multiple notification methods, such as DingTalk, phone calls, and emails. For more information, see Workspace notification.

Tensorboard

You can create a Tensorboard in Machine Learning Designer or for a DLC job to view the analytical reports of model training in a visualized manner. For more information, see the following topics: