All Products
Search
Document Center

Platform For AI:Use the AI assistant of Lingjun

Last Updated:Jan 10, 2025

PAI AIMaster and the AI assistant of Lingjun form a fully automated system that is used to restore training jobs when exceptions or faults occur. After the AI assistant of Lingjun is installed and the job monitoring and recovery feature of Machine Learning Platform of Artificial Intelligence (PAI) is enabled, the system can automatically report the information about the fault or exception and isolate the faulty node if a training job becomes faulty or abnormal. This way, the training job can be quickly restored without manual intervention. This topic describes how to configure the AI assistant of Lingjun.

Prerequisites

A Lingjun cluster with Container Service for Kubernetes (ACK) activated is created. For more information, see Create a Lingjun cluster with ACK activated.

Features

After you install the AI assistant of Lingjun and use the Resource Access Management (RAM) Roles for Service Accounts (RRSA) feature to complete authorization, you must enable the AIMaster automatic fault tolerance and EasyCkpt features of PAI when you submit a training job. When a fault or an exception occurs, the AI assistant can automatically interact with PAI based on its underlying alert system, report the information about the fault or exception, and select a method to handle the fault or exception based on the phase in which the fault or exception occurs and the parallel policy to automatically isolate the faulty node, and quickly restore the job from the checkpoint. The AI assistant of Lingjun provides the following features:

  • Exception collection and reporting: The AI assistant of Lingjun automatically interacts with PAI based on its alert system.

  • Fault isolation: The AI assistant of Lingjun automatically isolates faulty nodes.

  • Exception handling: The AI assistant of Lingjun triggers PAI to create checkpoints and quickly restore training jobs when alerts are reported.

Procedure

  1. Log on to the ACK console.

  2. Install the ack-ai-installer component.

    1. On the Clusters page, find the cluster that you want to manage and click the name of the cluster. On the cluster details page, choose Operations > Add-ons in the left-side navigation pane.

    2. On the Add-ons page, click the Others tab. On the Others tab, find the ack-ai-installer component, and click Install in the lower-right corner of the component.

    3. In the message that appears, confirm the information and click OK.

  3. Enable the RRSA feature for the cluster.

    1. In the left-side pane of the cluster details page, click Cluster Information.

    2. On the cluster details page, click the Basic Information tab. In the Cluster Information section, click Enable RRSA to the right of RRSA OIDC. For more information, see Use RRSA to authorize different pods to access different cloud services.

    3. In the message that appears, click Confirm.

  4. Install the ack-pod-identity-webhook component. For more information, see ack-pod-identity-webhook.

  5. Create a RAM role named aiph-ack-rrsa-role.

    1. Log on to the RAM console with your Alibaba Cloud account.

    2. In the left-side navigation pane, choose Identities > Roles.

    3. On the Roles page, click Create Role.

    4. In the Create Role panel, select IdP for Select Trusted Entity and click Next.

    5. In the Configure Role step, configure the parameters that are described in the following table and click OK.

      Parameter

      Description

      RAM Role Name

      Enter aiph-ack-rrsa-role in the RAM Role Name field.

      IdP Type

      Set this parameter to OIDC.

      Select IdP

      Select an identity provider (IdP). The IdP is named in the ack-rrsa-<cluster_id> format. <cluster_id> indicates the ID of your cluster.

      Conditions

      • oidc:iss: Use the default value.

      • oidc:aud: Select sts.aliyuncs.com.

      • oidc:sub: Set the condition operator to StringEquals. The value is in the system:serviceaccount:aiph-ops:aiph-manager format.

  6. Attach the AliyunCSReadOnlyAccess policy and grant a custom permission that allows the AI assistant of Lingjun to call API operations to the RAM role that you created in the previous step. The following sample code provides an example of the custom permission that allows the AI assistant of Lingjun to call API operations. For more information about how to create a custom policy, see Create custom policies. For more information about how to grant permissions to a RAM role, see Grant permissions to a RAM role.

    # Note: If you grant the following permission to the RAM role, the AI assistant of Lingjun is authorized to perform automated O&M operations on Lingjun nodes. 
    
    {
       "Statement": [
          {
             "Effect": "Allow",
             "Action": [
               "eflo:*"
             ],
             "Resource": [
                "acs:eflo:*"
             ]
          },
          {
             "Effect": "Allow",
             "Action": "cms:DescribeSystemEventAttribute",
             "Resource": "acs:cms:*"
          }
       ],
       "Version": "1"
    }

What to do next

After you configure the AI assistant of Lingjun, you must enable the AIMaster automatic fault tolerance and EasyCkpt features of PAI when you submit a training job. This way, if a fault or an exception occurs, the system can quickly restore the training job.