All Products
Search
Document Center

Container Service for Kubernetes:Overview of the cloud-native AI suite

Last Updated:Oct 19, 2023

The cloud-native AI suite is a Container Service for Kubernetes (ACK) solution powered by cloud-native AI technologies and products. The cloud-native AI suite can help you fully utilize cloud-native architectures and technologies to quickly develop an AI-assisted production system in ACK. The cloud-native AI suite also provides full-stack optimization for AI or machine learning applications and systems. This topic describes the architecture, key features, and use scenarios of the cloud-native AI suite. This topic also describes how to work with the cloud-native AI suite.

Architecture

The cloud-native AI suite uses Kubernetes as the base. It centrally manages heterogeneous resources, and provides standard Kubernetes clusters and APIs to run key components, manage and maintain resources, schedule and scale AI jobs, accelerate data access, orchestrate workflows, integrate big data services, manage the lifecycle of AI jobs, manage AI artifacts, and perform O&M tasks. The cloud-native AI suite also optimizes AI DevOps. It supports AI dataset management, and allows you to develop, train, and evaluate AI models and deploy models as inference services.

You can use key components through the CLI, SDKs for different programming languages, and the console. With the help of these components and tools, you can build, extend, or customize your AI production systems on demand. The cloud-native AI suite also allows you to integrate Alibaba Cloud AI services, open source AI frameworks, and third-party AI capabilities by using the same components and tools.

In addition, the cloud-native AI suite supports seamless integration with Machine Learning Platform for AI to help you develop a high-performance, elastic one-stop AI platform. You can use services such as Data Science Workshop (DSW), Deep Learning Containers (DLC), and Elastic Algorithm Service (EAS) provided by Machine Learning Platform for AI. ACK can greatly improve the elasticity and efficiency of AI model development, training, and inference for the preceding services. The cloud-native AI suite also allows you to deploy Lightweight Machine Learning Platform for AI in ACK clusters with a few clicks to make AI development much easier. You can integrate algorithms and engines that are deeply optimized by Machine Learning Platform for AI based on years of experience into containerized applications to greatly accelerate model training and inference. For more information about Machine Learning Platform for AI, see What is Machine Learning Platform for AI?

The following figure shows the architecture of the cloud-native AI suite.

architecture.png

Key features

The cloud-native AI suite uses Kubernetes as the base, and provides full-stack support and optimization for AI and machine learning applications and systems. The following table describes the key features provided by the cloud-native AI suite.

Feature

Description

References

Centralized management of heterogeneous resources

  • Support for heterogeneous resources: In addition to the resources supported by ACK, the cloud-native AI suite also supports heterogeneous resources such as NVIDIA GPUs, NPUs, VPUs, and RDMA. You can use the cloud-native AI suite to centrally schedule, manage, and maintain these resources.

  • Monitoring and maintenance: The cloud-native AI suite monitors GPUs in multiple dimensions and displays visualized information about the allocation, use, and health status of GPUs.

  • Resource utilization improvement: The cloud-native AI suite supports GPU sharing, GPU memory isolation, and topology-aware GPU scheduling to help you improve resource utilization.

AI job scheduling

  • Multiple scheduling policies: The ACK scheduler extends the Kubernetes-native scheduling framework for batch jobs such as AI distributed training jobs. A variety of batch scheduling policies are supported, including gang scheduling (coscheduling), First In First Out (FIFO) scheduling, capacity scheduling, fair sharing, and bin packing and spread.

  • Job queues: The cloud-native AI suite provides priority-based job queues to allow you to customize the priorities of jobs and configure elastic quotas for tenants.

  • Workflow orchestration: You can integrate Kubeflow Pipelines or Argo Workflows to orchestrate workflows for complex AI jobs.

Elastic AI jobs

Elastic scheduling for distributed deep learning jobs: The cloud-native AI suite dynamically scales the number of workers and the number of nodes without affecting the model training and model precision. The cloud-native AI suite adds workers to accelerate training when the cluster has idle resources and releases workers when the cluster cannot provide sufficient resources. This ensures that model training is not affected by resource shortages. This mode greatly improves the overall resource utilization of the cluster and helps avoid node failures. This mode also reduces the waiting time for launching jobs.

Elastic training

AI data orchestration and acceleration

Fluid: introduces the dataset concept. Fluid provides training jobs with a data abstraction and provides a data orchestration and acceleration platform to help you manage datasets, enforce access control, and accelerate data access. Fluid can ingest data from different storage services and aggregate the data into the same dataset. You can also connect Fluid to on-cloud or on-premises storage services in a hybrid cloud environment to manage data and accelerate data access. In addition, Fluid can be extended to support a variety of distributed cache services. You can configure a cache service for each dataset and use features such as dataset warmup, cache capacity monitoring, and elastic scaling to greatly reduce the overheads of remotely ingesting data for training jobs and improve the efficiency of GPU computing.

AI job lifecycle management

  • Arena: provides an abstraction of data preparation and management, model development, model training, model evaluation, model inference services, and online O&M. Arena is a command-line tool that can help you manage the key components in AI DevOps. Arena simplifies the management of underlying resources and environments, job scheduling, and GPU allocation and monitoring. Arena is compatible with mainstream AI frameworks and tools, including TensorFlow, PyTorch, Horovod, Spark, JupyterLab, TF-Serving, and Triton. Arena also provides SDKs for Golang, Java, and Python for secondary development.

  • Visualized O&M: provides easy-to-use dashboards and a developer console to allow you to view the status of your cluster and quickly submit training jobs.

Use scenarios

The cloud-native AI suite is suitable for continuously improving the utilization of heterogeneous resources and efficiently handling heterogeneous workloads such as AI jobs.Use scenario.png

Scenario 1: Continuously improve the utilization of heterogeneous resources

The cloud-native AI suite provides an abstraction of heterogeneous resources in the cloud, including computing resources (such as CPUs, GPUs, NPUs, VPUs, and FPGAs), storage resources (OSS, NAS, CPFS, and HDFS), and network resources (TCP and RDMA). You can use the cloud-native AI suite to centrally manage, maintain, and allocate these resources, and continuously improve the resource utilization based on resource scaling and software/hardware optimization.

Scenario 2: Efficiently handle heterogeneous workloads such as AI jobs

The cloud-native AI suite is compatible with mainstream open source engines such as TensorFlow, PyTorch, Horovod, Spark, and Flink, and also supports self-managed engines and runtimes. The cloud-native AI suite allows you to run heterogeneous workloads, manage the lifecycle of jobs, and schedule workflows to ensure the scale and performance of your training jobs. The cloud-native AI suite also continuously optimizes training jobs in terms of performance, efficiency, and costs, optimizes the user experience of development and maintenance, and improves the engineering efficiency.

User roles

The cloud-native AI suite defines the following user roles.

Role

Description

O&M administrator

Responsible for building AI infrastructure and daily administration. For more information, see Deploy the cloud-native AI suite, Manage users, Manage elastic quota groups, and Manage datasets.

Algorithm engineer and data scientist

Use the cloud-native AI suite to manage jobs. For more information, see Model training, Manage models, Model evaluation, and Model analysis and optimization.

Work with the cloud-native AI suite

Follow the steps in the following figure to use the cloud-native AI suite based on the user role that you assume.

Procedure.png

Step

Description

Console

1. Preparations

(O&M administrator)

Create an Alibaba Cloud account

Create an Alibaba Cloud account and complete the real-name verification. For more information, see Create an Alibaba Cloud account.

Alibaba Cloud signup page

Create an ACK cluster

Activate ACK and create an ACK cluster. We recommend that you use the following cluster configuration. For more information, see Create an ACK managed cluster.

  • Cluster type: ACK Pro cluster, ACK Serverless cluster, or ACK Edge Pro cluster.

  • Cluster version: 1.18 or later.

  • Region: the region in which you activated ACK.

ACK console

Configure cluster dependencies and create dependent cloud resources (optional)

  • Install and configure AI Dashboard and AI Developer Console:

    • Install the monitoring agent in the ACK cluster and activate Log Service.

    • Create a policy for the cluster in the Resource Access Management (RAM) console. For more information, see Authorization.

    • If you want to use an internal domain name or a public domain name to access AI Dashboard and AI Developer Console, install the NGINX Ingress controller and enable internal access or Internet access for the controller.

    • To use a pre-installed MySQL database as the storage, make sure that the nodes in the cluster are mounted with enhanced SSDs (ESSDs).

    • To use a Relational Database Service (RDS) database as the storage, you need to purchase an ApsaraDB RDS instance and create a Secret named kubeai-rds in the kube-ai namespace.

    For more information, see Install and configure AI Dashboard and AI Developer Console.

  • Install and configure Kubeflow Pipelines:

2. System and environment

(O&M administrator)

Activate and install the cloud-native AI suite

  1. Open the activation page and activate the cloud-native AI suite.

  2. Install the cloud-native AI suite and relevant components. For more information, see Install the cloud-native AI suite.

ACK console

Manage users and quotas

  1. Add quota nodes and set resource quotas.

  2. Create users and user groups, allocate resources, and associate quota groups.

    For more information, see Manage users, Manage user groups, and Manage elastic quota groups.

  3. Generate a kubeconfig file and a logon token for a newly created user. For more information, see Generate a kubeconfig file and a logon token for a newly created user.

AI Dashboard and kubectl

Prepare data

  1. Create datasets.

  2. Accelerate datasets. This step is optional. For more information, see Fluid overview.

(Algorithm engineer and data scientist)

The cloud-native AI suite allows algorithm engineers and data scientists to use Arena, the web console, and the AI Developer Console to develop models, train models, deploy inference services, and manage jobs.

ACK console

3. Model training and deployment

(Algorithm engineer and data scientist)

When you use Arena or the AI Developer Console, you can perform the following steps to train and deploy models:

Develop models

  1. Create and use a Jupyter notebook. For more information, see Create and use a Jupyter notebook.

  2. Use the Jupyter notebook to develop and test a model.

  3. Use the Jupyter notebook to submit code to a Git repository.

Train models

  1. Use the AI Developer Console or Arena to submit a training job.

  2. View the logs or TensorBoard data of the job.

    For more information, see Model management.

Manage and evaluate models

  1. Create a model and associate it with a training job.

  2. Submit a model evaluation job.

  3. Compare evaluation results.

    For more information, see Manage models and Evaluate a model.

Deploy models

Deploy a model as an inference service. For more information, see Deploy AI services.

AI Developer Console and Arena

Use Lightweight Machine Learning Platform for AI to develop, train, and deploy models.

Lightweight Machine Learning Platform for AI

4. Monitoring and maintenance

(O&M administrator)

Monitor and maintain resources

View the dashboards of various resources, including clusters, nodes, training jobs, and resource quotas. For more information, see Work with cloud-native AI dashboards.

AI Dashboard

Manage quotas

  • Create, query, update, and delete quota groups and resources in quota groups.

  • Change resource types.

    For more information, see Manage elastic quota groups.

Manage users

Create, query, update, and delete users or user groups. For more information, see Manage users and Manage user groups.

Manage datasets

  • Create, query, update, and delete datasets and data. For more information, see Manage datasets.

  • Accelerate datasets. For more information, see Overview of Fluid.

Manage elastic jobs

View elastic jobs and job details. For more information, see View elastic jobs.

5. Billing and payments

(O&M administrator)

Payments

Billing Management

Accounting

Billing

For more information, see Billing of the cloud-native AI suite.

References

Reference

Description

Getting Started (for beginners)

Helps you quickly apply the cloud-native AI suite to your development and O&M work through a few practices. For more information, see Cloud-native AI suite user guide and Cloud-native AI suite operations and maintenance guide.

Release notes

Describes the release notes for the cloud-native AI suite.