All Products
Search
Document Center

Container Service for Kubernetes:Deploy the cloud-native AI suite

Last Updated:Mar 08, 2024

You can deploy the cloud-native AI suite in Container Service for Kubernetes (ACK) Pro clusters, ACK Serverless Pro clusters, and ACK Edge Pro clusters that run Kubernetes 1.18 or later. This topic describes how to deploy the cloud-native AI suite in a cluster. This topic also describes how to install and configure AI Dashboard and AI Developer Console.

Prerequisites

Deploy the cloud-native AI suite

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, click the name of the cluster that you want to manage and choose Applications > Cloud-native AI Suite in the left-side navigation pane.

  3. On the Cloud-native AI Suite page, click Deploy. On the page that appears, select the components that you want to install.

    The following table describes how to configure the parameters. The following table also describes the components and the supported cluster types.

    Configuration in the console

    Component configuration

    Supported cluster

    Configuration item

    Description

    Component name and description

    Namespace

    ACK Pro cluster

    ACK Serverless Pro cluster

    ACK Edge Pro cluster

    Elasticity

    Specify whether to enable the Elastic Training and Elastic Inference features. For more information, see Model training and Elastic inference services.

    ack-alibaba-cloud-metrics-adapter, an auto scaling component.

    kube-system

    对

    错

    错

    Acceleration

    Specify whether to enable the Fluid Data Acceleration feature. For more information, see Overview of Fluid.

    ack-fluid, a data caching and acceleration component.

    fluid-system

    对

    对

    对

    Scheduling

    Specify whether to enable the Scheduling Component (Batch Task Scheduling, GPU Sharing, Topology-aware GPU scheduling, and NPU scheduling) feature. You can click Advanced to configure custom parameters.

    ack-ai-installer, a scheduling component.

    kube-system

    对

    错

    对

    Specify whether to enable the Kube Queue feature. For more information, see Use ack-kube-queue to manage job queues.

    ack-kube-queue, a Kube queue component.

    kube-queue

    对

    对

    对

    Interaction Mode

    Arena: Select this option if you want to use the Arena CLI. To use the Arena CLI, you also need to install and configure the Arena client. After you install the Arena client, you can use the Arena CLI to integrate Kubeflow training operators. You can click Advanced to configure custom parameters.

    If you select Kube Queue, Console, and Kubeflow Pipelines at the same time, the Arena option is required. For more information, see Configure the Arena client.

    ack-arena, a machine learning CLI. This component is required.

    kube-system

    对

    对

    对

    Console: Deploy a lightweight machine learning platform for AI. You can click Advanced to configure custom parameters.

    ack-pai, a lightweight machine learning platform for AI.

    pai-system

    对

    错

    错

    Console: After you select this option to deploy AI Suite Console, configure parameters in the Note dialog box. For more information, see Install and configure AI Dashboard and AI Developer Console.

    ack-ai-dashboard, a visualized O&M console.

    kube-ai

    对

    错

    错

    ack-ai-dev-console, a deep learning development console.

    kube-ai

    对

    错

    错

    Console Data Storage

    After you set Interaction Mode to Console, set Console Data Storage to Pre-installed MySQL or ApsaraDB RDS. For more information, see Install and configure AI Dashboard and AI Developer Console.

    ack-mysql, a MySQL database component.

    kube-ai

    对

    错

    错

    Workflow

    After you select Kubeflow Pipelines, you can set Workflow Data Storage to Pre-installed MinIO or OSS. For more information, see Install and configure Kubeflow Pipelines.

    ack-ai-pipeline, a platform for building end-to-end machine learning workflows.

    kube-ai

    对

    错

    错

    Monitoring

    Specify whether to install the Monitoring Component. For more information, see Work with cloud-native AI dashboards.

    ack-arena-exporter, a cluster monitoring component.

    kube-ai

    对

    错

    错

  4. Click Deploy Cloud-native AI Suite in the lower part of the page. The system checks the environment and the dependencies of the selected components. After the environment and dependencies pass the check, the system deploys the selected components.

    After the components are installed, you can view the following information in the Components list:

    • You can view the names and versions of the components that are installed in the cluster. You can deploy or uninstall components.

    • If a component is updatable, you can update the component.

    • After you install ack-ai-dashboard and ack-ai-dev-console, you can find the hyperlinks to AI Dashboard and AI Developer Console in the upper-left corner of the Cloud-native AI Suite page. You can click a hyperlink to access the corresponding component.控制台

  5. After the installation is completed, you can find the hyperlinks to AI Dashboard and AI Developer Console in the upper-left corner of the Cloud-native AI Suite page. You can click a hyperlink to access the corresponding component.

Install and configure AI Dashboard and AI Developer Console

  1. In the Interaction Mode section of the Cloud-native AI Suite page, select Console. The Note dialog box appears, as shown in the following figure.

    • If Authorized is displayed, perform Step 3.

    • If Unauthorized is displayed in red and the OK button is dimmed, perform Step 2.

      提示框

  2. Create a custom policy to grant permissions to the RAM worker role.

    1. Create a custom policy.

      1. Log on to the RAM console. In the left-side navigation pane, choose Permissions > Policies.

      2. On the Policies page, click Create Policy.

      3. Click the JSON tab. Add the following content to the Action field and click Next to edit policy information.

         "log:GetProject",
         "log:GetLogStore",
         "log:GetConfig",
         "log:GetMachineGroup",
         "log:GetAppliedMachineGroups",
         "log:GetAppliedConfigs",
         "log:GetIndex",
         "log:GetSavedSearch",
         "log:GetDashboard",
         "log:GetJob",
         "ecs:DescribeInstances",
         "ecs:DescribeSpotPriceHistory",
         "ecs:DescribePrice",
         "eci:DescribeContainerGroups",
         "eci:DescribeContainerGroupPrice",
         "log:GetLogStoreLogs",
         "ims:CreateApplication",
         "ims:UpdateApplication",
         "ims:GetApplication",
         "ims:ListApplications",
         "ims:DeleteApplication",
         "ims:CreateAppSecret",
         "ims:GetAppSecret",
         "ims:ListAppSecretIds",
         "ims:ListUsers"
      4. Specify the Name parameter in the k8sWorkerRolePolicy-{ClusterID} format and click OK.

    2. Grant permissions to the RAM worker role of the cluster.

      1. Log on to the RAM console. In the left-side navigation pane, choose Identities > Roles.

      2. Enter the RAM worker role in the KubernetesWorkerRole-{ClusterID} format into the search box. Click Grant Permissions in the Actions column of the worker role.

      3. In the Select Policy section, click Custom Policy.

      4. Enter the name of the custom policy that you created in the k8sWorkerRolePolicy-{ClusterID} format into the search box and select the policy.

      5. Click OK.

    3. Return to the Note dialog box and click Authorization Check. If the authorization is successful, Authorized is displayed and the OK button becomes available. Then, perform Step 3.

      已授权

  3. Select a method to access AI Dashboard and a method to access AI Developer Console. Then, click OK.

    You can use a private IP address, internal domain name, or public domain name to access AI Dashboard or AI Developer Console.

    Note
    • If you want to use a private IP address to access AI Dashboard or AI Developer Console, select Private IP in the Note dialog box.

    • For more information about how to use a private IP address or internal domain name to access AI Dashboard or AI Developer Console, see Access AI Dashboard.

  4. Set the Console Data Storage parameter.

    After you select Console, the Console Data Storage parameter appears below the Interaction Mode parameter. You can select Pre-installed MySQL or ApsaraDB RDS.

    Pre-installed MySQL

    Make sure that the nodes in the cluster are attached with enhanced SSDs (ESSDs). The cloud-native AI suite automatically deploys a MySQL database in the cluster. This method does not require you to purchase ApsaraDB RDS instances. However, data security is not guaranteed by service-level agreements (SLAs).

    Important

    Data loss may occur if cluster exceptions occur or the MySQL database is accidentally deleted.

    To check the types of disks that are mounted to the cluster, perform the following steps:

    1. Log on to the ACK console and click Clusters in the left-side navigation pane.

    2. On the Clusters page, click the name of the cluster that you want to manage and choose Volumes > Persistent Volumes in the left-side navigation pane.

    3. On the Persistent Volumes page, take note of the name of the persistent volume (PV) that you want to manage.

    4. Go to the ECS console. In the left-side navigation pane, choose Storage & Snapshots > Disks. On the Disks page, search for the disk by using the name that you obtained. You can view the disk type in the Disk Category column.

    ApsaraDB RDS

    Note
    1. Purchase an ApsaraDB RDS instance and create a database and account. For more information, see Getting Started. For more information about the billing of ApsaraDB RDS, see Billing overview.

    2. Click Deploy Cloud-native AI Suite in the lower part of the page to install the cloud-native AI suite.

    3. Click the name of the cluster and choose Configurations > Secrets in the left-side navigation pane.

    4. Select kube-ai from the Namespace drop-down list on the top of the page.

    5. Click Create from YAML in the upper-right corner of the page.

    6. Copy the following YAML content to create a Secret named kubeai-rds:

      apiVersion: v1
      kind: Secret
      metadata:
        name: kubeai-rds
        namespace: kube-ai
      type: Opaque
      stringData:
        MYSQL_HOST: "Your RDS URL"
        MYSQL_DB_NAME: "Database name"
        MYSQL_USER: "Database username"
        MYSQL_PASSWORD: "Database password"

      Parameter

      Description

      name

      The name of the Secret.

      namespace

      The name of the namespace.

      • MYSQL_HOST

      • MYSQL_DB_NAME

      • MYSQL_USER

      • MYSQL_PASSWORD

      The parameters related to the ApsaraDB RDS for MySQL instance. For more information, see Create an ApsaraDB RDS for MySQL instance and Create accounts and databases.

Install and configure Kubeflow Pipelines

After you select Kubeflow Pipelines, you must set the Workflow Data Storage parameter.

Pre-installed MinIO

Make sure that the nodes in the cluster are attached with ESSDs. The cloud-native AI suite automatically deploys MinIO in the cluster. This method does not require you to purchase Object Storage Service (OSS) buckets. However, data security is not guaranteed by SLAs.

Important

Data loss may occur if cluster exceptions occur or MinIO is accidentally deleted.

To check the types of disks that are mounted to the cluster, perform the following steps:

  1. Log on to the ACK console and click Clusters in the left-side navigation pane.

  2. On the Clusters page, click the name of the cluster that you want to manage and choose Volumes > Persistent Volumes in the left-side navigation pane.

  3. On the Persistent Volumes page, take note of the name of the persistent volume (PV) that you want to manage.

  4. Go to the ECS console. In the left-side navigation pane, choose Storage & Snapshots > Disks. On the Disks page, search for the disk by using the name that you obtained. You can view the disk type in the Disk Category column.

OSS

  1. After you install the cloud-native AI suite, click the name of the cluster. Then, choose Configurations > Secrets in the left-side navigation pane.

  2. Select the namespace from the Namespace drop-down list on the top of the page.

  3. Click Create from YAML in the upper-right corner of the page.

  4. Copy the following YAML content to create a Secret named kubeai-oss. After the Secret is created, an OSS bucket named mlpipeline-<clusterid> is created. For more information about the billing of OSS, see Billing overview.

    apiVersion: v1
    kind: Secret
    metadata:
      name: kubeai-oss
      namespace: kube-ai
    type: Opaque
    stringData:
      ENDPOINT: "oss-cn-hangzhou.aliyuncs.com"   
      ACCESS_KEY_ID: "****"     
      ACCESS_KEY_SECRET: "****"  

    Parameter

    Description

    name

    The name of the Secret.

    namespace

    The name of the namespace.

    ENDPOINT

    The endpoint of OSS. In this example, the endpoint of OSS in the China (Hangzhou) region is used. For more information, see Regions and endpoints.

    • ACCESS_KEY_ID

    • ACCESS_KEY_SECRET

    Enter the AccessKey pair of your account. For information about how to obtain an AccessKey pair, see Obtain an AccessKey pair.

    Important

    To ensure data security, we recommend that you enter the AccessKey pair of a RAM user. After you log on as a RAM user, you need to attach the AliyunOSSFullAccess policy to the RAM user.