The cloud-native AI suite is a Container Service for Kubernetes (ACK) solution powered by cloud-native AI technologies and products. The cloud-native AI suite can help you fully utilize cloud-native architectures and technologies to quickly develop an AI-assisted production system in ACK. The cloud-native AI suite also provides full-stack optimization for AI or machine learning applications and systems. This topic describes the architecture, key features, and use scenarios of the cloud-native AI suite. This topic also describes how to work with the cloud-native AI suite.
The cloud-native AI suite uses Kubernetes as the base. It centrally manages heterogeneous resources, and provides standard Kubernetes clusters and APIs to run key components, manage and maintain resources, schedule and scale AI jobs, accelerate data access, orchestrate workflows, integrate big data services, manage the lifecycle of AI jobs, manage AI artifacts, and perform O&M tasks. The cloud-native AI suite also optimizes AI DevOps. It supports AI dataset management, and allows you to develop, train, and evaluate AI models and deploy models as inference services.
You can use key components through the CLI, SDKs for different programming languages, and the console. With the help of these components and tools, you can build, extend, or customize your AI production systems on demand. The cloud-native AI suite also allows you to integrate Alibaba Cloud AI services, open source AI frameworks, and third-party AI capabilities by using the same components and tools.
In addition, the cloud-native AI suite supports seamless integration with Machine Learning Platform for AI to help you develop a high-performance, elastic one-stop AI platform. You can use services such as Data Science Workshop (DSW), Deep Learning Containers (DLC), and Elastic Algorithm Service (EAS) provided by Machine Learning Platform for AI. ACK can greatly improve the elasticity and efficiency of AI model development, training, and inference for the preceding services. The cloud-native AI suite also allows you to deploy Lightweight Machine Learning Platform for AI in ACK clusters with a few clicks to make AI development much easier. You can integrate algorithms and engines that are deeply optimized by Machine Learning Platform for AI based on years of experience into containerized applications to greatly accelerate model training and inference. For more information about Machine Learning Platform for AI, see What is Machine Learning Platform for AI?
The following figure shows the architecture of the cloud-native AI suite.
The cloud-native AI suite uses Kubernetes as the base, and provides full-stack support and optimization for AI and machine learning applications and systems. The following table describes the key features provided by the cloud-native AI suite.
Centralized management of heterogeneous resources
AI job scheduling
Elastic AI jobs
Elastic scheduling for distributed deep learning jobs: The cloud-native AI suite dynamically scales the number of workers and the number of nodes without affecting the model training and model precision. The cloud-native AI suite adds workers to accelerate training when the cluster has idle resources and releases workers when the cluster cannot provide sufficient resources. This ensures that model training is not affected by resource shortages. This mode greatly improves the overall resource utilization of the cluster and helps avoid node failures. This mode also reduces the waiting time for launching jobs.
AI data orchestration and acceleration
Fluid: introduces the dataset concept. Fluid provides training jobs with a data abstraction and provides a data orchestration and acceleration platform to help you manage datasets, enforce access control, and accelerate data access. Fluid can ingest data from different storage services and aggregate the data into the same dataset. You can also connect Fluid to on-cloud or on-premises storage services in a hybrid cloud environment to manage data and accelerate data access. In addition, Fluid can be extended to support a variety of distributed cache services. You can configure a cache service for each dataset and use features such as dataset warmup, cache capacity monitoring, and elastic scaling to greatly reduce the overheads of remotely ingesting data for training jobs and improve the efficiency of GPU computing.
AI job lifecycle management
The cloud-native AI suite is suitable for continuously improving the utilization of heterogeneous resources and efficiently handling heterogeneous workloads such as AI jobs.
Scenario 1: Continuously improve the utilization of heterogeneous resources
The cloud-native AI suite provides an abstraction of heterogeneous resources in the cloud, including computing resources (such as CPUs, GPUs, NPUs, VPUs, and FPGAs), storage resources (OSS, NAS, CPFS, and HDFS), and network resources (TCP and RDMA). You can use the cloud-native AI suite to centrally manage, maintain, and allocate these resources, and continuously improve the resource utilization based on resource scaling and software/hardware optimization.
Scenario 2: Efficiently handle heterogeneous workloads such as AI jobs
The cloud-native AI suite is compatible with mainstream open source engines such as TensorFlow, PyTorch, Horovod, Spark, and Flink, and also supports self-managed engines and runtimes. The cloud-native AI suite allows you to run heterogeneous workloads, manage the lifecycle of jobs, and schedule workflows to ensure the scale and performance of your training jobs. The cloud-native AI suite also continuously optimizes training jobs in terms of performance, efficiency, and costs, optimizes the user experience of development and maintenance, and improves the engineering efficiency.
The cloud-native AI suite defines the following user roles.
Algorithm engineer and data scientist
Work with the cloud-native AI suite
Follow the steps in the following figure to use the cloud-native AI suite based on the user role that you assume.
Create an Alibaba Cloud account
Create an Alibaba Cloud account and complete the real-name verification. For more information, see Create an Alibaba Cloud account.
Create an ACK cluster
Activate ACK and create an ACK cluster. We recommend that you use the following cluster configuration. For more information, see Create an ACK managed cluster.
Configure cluster dependencies and create dependent cloud resources (optional)
2. System and environment
Activate and install the cloud-native AI suite
Manage users and quotas
AI Dashboard and kubectl
(Algorithm engineer and data scientist)
The cloud-native AI suite allows algorithm engineers and data scientists to use Arena, the web console, and the AI Developer Console to develop models, train models, deploy inference services, and manage jobs.
3. Model training and deployment
(Algorithm engineer and data scientist)
When you use Arena or the AI Developer Console, you can perform the following steps to train and deploy models:
Manage and evaluate models
Deploy a model as an inference service. For more information, see Deploy AI services.
AI Developer Console and Arena
Use Lightweight Machine Learning Platform for AI to develop, train, and deploy models.
4. Monitoring and maintenance
Monitor and maintain resources
View the dashboards of various resources, including clusters, nodes, training jobs, and resource quotas. For more information, see Work with AI dashboards.
Manage elastic jobs
View elastic jobs and job details. For more information, see View elastic jobs.
5. Billing and payments
For more information, see Billing of the cloud-native AI suite.
Getting Started (for beginners)
Helps you quickly apply the cloud-native AI suite to your development and O&M work through a few practices. For more information, see Cloud-native AI suite user guide and Cloud-native AI suite operations and maintenance guide.
Describes the release notes for the cloud-native AI suite.