How to Build a Cloud Native AI Platform Based on Kubernetes
A Brief History of AI and Cloud Primitives
Looking back at the development history of AI, we will find that this is no longer a new field. From being first defined at the Dartmouth academic seminar in 1956 to Geoffrey Hinton's proposal of the "Deep Believe Network" in 2006, AI has gone through three waves of development. Especially in the past decade, AI technology has made remarkable progress, driven by deep learning as the core algorithm, GPU as the representative of large computing power, and the accumulation of massive production data. Unlike the previous two attempts, this time AI has achieved breakthroughs in technologies such as machine vision, speech recognition, and natural language understanding, and has successfully landed in many industries such as commerce, healthcare, education, industry, safety, and transportation. It has even spawned new fields such as autonomous driving and AIoT.
However, with the rapid advancement and widespread application of AI technology, many enterprises and institutions have also found that it is not easy to ensure the efficient operation of the "algorithm+computing power+data" flywheel and produce AI capabilities with commercial implementation value on a large scale. The expensive investment in computing power and operation and maintenance costs, low production efficiency of AI services, and the lack of interpretability and universality of AI algorithms have all become heavy barriers in front of AI users.
Similar to the history of AI, the representative technology in the cloud native field - containers - is not a new concept. It can be traced back to the birth of UNIX chroot technology in 1979, which began to emerge as a prototype of containers. Next came a series of lightweight kernel based resource isolation technologies such as FreeBSD Jails, Linux VServer, OpenVZ, Warden, etc. Until the emergence of Docker in 2013, it redefined the interface for using data center computing resources with comprehensive container object encapsulation, standard API definitions, and a friendly development and operation experience. As everyone knows, Kubernetes won the competition with Docker Swarm and Mesos, becoming the de facto standard for container layout and scheduling. Kubernetes and containers (including Docker, Containered, CRI-O, and other container runtime and management tools) constitute the cloud native core technology defined by CNCF. After 5 years of rapid development, the cloud native technology ecosystem has covered all aspects of IT systems such as container runtime, network, storage and cluster management, observability, elasticity, DevOps, service grid, serverless architecture, databases, data warehouses, etc.
Why Cloud Native AI Appears
According to the 2021 CNCF survey report, 96% of surveyed companies are using or evaluating Kubernetes, which includes the needs of many AI related businesses. Gartner has predicted that by 2023, 70% of AI applications will be developed based on container and serverless technology.
In reality, in the process of developing the Alibaba Container Service for Kubernetes (ACK) product, we see an increasing number of users hoping to manage GPU resources, develop and run deep learning and big data tasks, deploy and flexibly manage AI services in Kubernetes container clusters. Users come from various industries, including fast-growing companies such as the internet, games, live streaming, and autonomous driving Emerging fields such as AIoT even include traditional industries such as government enterprises, finance, and manufacturing.
Users encounter many common challenges when developing, generating, and using AI capabilities, including high AI development barriers, low engineering efficiency, high costs, complex software environment maintenance, dispersed heterogeneous hardware management, uneven allocation of computing resources, and cumbersome storage access. Many of these users are already using cloud native technology and have successfully improved the development and operation efficiency of applications and microservices. They hope to replicate the same experience to the AI field and build AI applications and systems based on cloud native technology.
In the early stages, users can quickly build GPU clusters using Kubernetes, Kubelow, and NVIDIA Docker, access storage services through standard interfaces, and automatically implement AI job scheduling and GPU resource allocation. The trained model can be deployed in the cluster, which basically realizes the AI development and production process. Subsequently, users had higher requirements for production efficiency and also encountered more problems. For example, GPU utilization is low, distributed training scalability is poor, homework cannot be flexibly scaled, training data access is slow, there is a lack of dataset, model, and task management, which makes it difficult to easily obtain real-time logs, monitoring, and visualization. Model publishing lacks quality and performance verification, and there is a lack of service-oriented operation and governance measures after going online. Kubernetes and containers have high usage thresholds, and the user experience does not conform to the habits of data scientists, Difficulties in team collaboration and sharing, frequent resource competition, and even data security issues.
To fundamentally solve these problems, the AI production environment must be upgraded from a "single handedly small workshop" model to a "resource pooling+AI engineering platformization+multi role collaboration" model. In order to help "cloud native+AI" users with such demands achieve flexible, controllable, and systematic evolution, we have launched cloud native AI suite products on the basis of ACK. As long as users have an ACK Pro cluster or any standard Kubernetes cluster, they can quickly customize their own AI platform using cloud native AI suites. Free data scientists and algorithm engineers from the complex and inefficient environmental management, resource allocation, and task scheduling work, leaving them with more energy for "brain hole" algorithms and "alchemy".
How to define cloud native AI
With the gradual deepening of enterprise IT architecture towards cloud computing, cloud native technologies such as containers, Kubernetes, and service grids have helped a large number of application services quickly land on cloud platforms, and have gained great value in scenarios such as elasticity, microserviceization, service-oriented, and DevOps. At the same time, IT decision-makers are also considering how to use cloud native technology to support more types of workloads with a unified architecture and technology stack. To avoid the burden of using different architectures and technologies for different loads, resulting in a "chimney" system, repetitive investment, and fragmented operation and maintenance.
Deep learning and AI tasks are one of the important workloads for communities seeking cloud native support. In fact, more and more deep learning systems and AI platforms have been built on containers and Kubernetes clusters. In 2020, we explicitly proposed the concept, core scenarios, and reference technology architecture of "cloud native AI" in order to provide a concrete definition, a feasible roadmap, and best practices for this new field.
Alibaba Cloud Container Service ACK defines cloud native AI - fully utilizing cloud resource elasticity, heterogeneous computing power, standardized services, as well as cloud native technologies such as containers, automation, and microservices, to provide AI/ML with end-to-end solutions that are efficient, cost-effective, scalable, and replicable.
Core Scenarios
We will focus on two core scenarios in the field of cloud native AI: continuously optimizing the efficiency of heterogeneous resources, and efficiently running heterogeneous workloads such as AI.
Scenario 1: Optimizing Heterogeneous Resource Efficiency
Abstract, manage, operate, and allocate various heterogeneous computing (such as CPU, GPU, NPU, VPU, FPGA, ASIC), storage (OSS, NAS, CPFS, HDFS), and network (TCP, RDMA) resources within Alibaba Cloud IaaS or customer IDC, and continuously improve resource utilization through elasticity and software hardware collaborative optimization.
Scenario 2: Running heterogeneous workloads such as AI
Compatible with mainstream or user owned computing engines and runtimes such as Tensorflow, Python, Horovod, ONNX, Spark, Flink, etc., running various heterogeneous workload processes, managing job lifecycles, scheduling task workflows, and ensuring task scale and performance. On the one hand, continuously improving the cost-effectiveness of running tasks, and on the other hand, continuously improving the development and maintenance experience and engineering efficiency.
Around these two core scenarios, more customized user scenarios can be expanded, such as building MLOps processes that conform to user usage habits; Alternatively, based on the characteristics of CV class (Computer Vision) AI services, mixed scheduling CPUs, GPUs, and VPUs (Video Process Units) can support data processing pipelines at different stages; It can also be extended to support higher-order distributed training frameworks for large model pre training and fine-tuning scenarios, combined with task and data scheduling strategies, as well as elastic and fault-tolerant solutions, to optimize the cost and success rate of large model training tasks.
Reference Architecture
In order to support the core scenario mentioned above, we propose a cloud native AI reference technology architecture.
The cloud native AI reference technology architecture follows the principles of componentization, scalability, and assemblability in a white box design. It exposes interfaces using Kubernetes standard objects and APIs, supporting developers and maintenance personnel to select any component as needed for assembly and secondary development, and quickly customizing to build their own AI platform.
The reference architecture is based on the Kubernetes container service, encapsulating the unified management of various heterogeneous resources downwards, and providing standard Kubernetes cluster environment and API upwards to run various core components, achieving resource operation and maintenance management, AI task scheduling and elastic scaling, data access acceleration, workflow orchestration, big data service integration, AI job lifecycle management, various AI product management, unified operation and maintenance, and other services. Further targeting the main links in AI production processes (MLOps), supporting AI dataset management, AI model development, training, evaluation, and model inference services; And through unified command-line tools, multilingual SDKs, and console interfaces, it supports users to directly use or customize development. And through the same components and tools, it can also support the integration of cloud based AI services, open source AI frameworks, and third-party AI capabilities.
What cloud native AI capabilities do enterprises need
We have defined the concept, core scenarios, design principles, and reference architecture of cloud native AI. Next, what kind of cloud native AI products do you need to provide to users? By conducting user research on AI tasks running on ACK, observing communities such as cloud native, MLOps, and MLSys, and analyzing some containerized AI products in the industry, we have summarized the characteristics and key capabilities that a cloud native AI product needs to possess.
High efficiency
It mainly includes three dimensions: efficiency of heterogeneous resource utilization, efficiency of AI job operation and management, efficiency of tenant sharing and team collaboration
Good compatibility
Compatible with common AI computing engines and models. Support various storage services and provide universal data access performance optimization capabilities. It can be integrated with various big data services and easily integrated by business applications. In particular, it is necessary to comply with the requirements of data security and privacy protection.
• Scalable
The product architecture achieves scalability, assembly, and reproducibility. Provide standard APIs and tools for easy use, integration, and secondary development. Under the same architecture, the product implementation should strive to adapt to various environments such as public cloud, proprietary cloud, hybrid cloud, and edge for delivery.
Alibaba Cloud Container Service ACK Cloud Native AI Suite
Based on the cloud native AI reference architecture and targeting the core scenarios and user needs mentioned above, the Alibaba Cloud Container Service team officially released the ACK cloud native AI suite product in 2021 [5], which has been launched in 27 regions worldwide for public testing, helping Kubernetes users quickly customize and build their own cloud native AI platform.
The ACK cloud native AI suite is mainly aimed at AI platform development and operation teams, helping them quickly and low-cost manage AI infrastructure locally. Simultaneously encapsulate the functions of each component into command line and console tools for direct use by data scientists and algorithm engineers. After users install the cloud native AI suite in the ACK cluster, all components are ready to use out of the box, quickly achieving unified management of heterogeneous resources and low threshold AI tasks. For product installation and usage, please refer to the product documentation [6].
architecture design
The functions of the cloud native AI suite are divided from bottom to top:
1. Heterogeneous resource layer, including heterogeneous computing power optimization and heterogeneous storage access
2. AI scheduling layer, including optimization of various scheduling strategies
3. AI elastic layer, including elastic AI training tasks and elastic AI inference services
4. AI data acceleration and orchestration layer, including dataset management, distributed cache acceleration, and big data service integration
5. AI job management layer, including AI job lifecycle management services and toolsets
6. AI operation and maintenance layer, including monitoring, logging, elastic scaling, fault diagnosis, and multi tenant
7. AI product warehouse, including AI container images, models, and AI experimental records
The following will briefly introduce the main capabilities and basic architecture design of the heterogeneous resource layer, AI scheduling layer, AI elastic task layer, AI data acceleration and orchestration layer, and AI job management layer in the cloud native AI suite product.
1. Unified management of heterogeneous resources
The cloud native AI suite has added support for various heterogeneous resources such as Nvidia GPU, optical 800 NPU, FPGA, VPU, RDMA high-performance network, etc. on ACK, covering almost all device types on Alibaba Cloud. This makes using these resources in Kubernetes as simple as using CPU and memory, as simply declaring the required number of resources in task parameters. For expensive resources such as GPU and NPU, various resource utilization optimization methods are also provided.
GPU monitoring
We provide multi-dimensional monitoring capabilities for GPUs, making their allocation, usage, and health status clear at a glance. Automatically detect and alert GPU device anomalies through the built-in Kubernetes NPD (Node Problem Detector).
GPU elastic scaling
Combining the various elastic scaling capabilities provided by the ACK elastic node pool, GPUs can automatically scale as needed at both the number of resource nodes and the number of running task instances. The conditions that trigger resilience include both GPU resource usage indicators and user-defined indicators. The expansion node type supports regular EGS instances, as well as ECI instances, Spot instances, and so on.
GPU shared scheduling
Native Kubernetes only support scheduling based on the granularity of the entire GPU card, and the task container will monopolize the GPU card. If the task model is relatively small (computational complexity, graphics memory usage), it will cause idle and wasted GPU card resources. The cloud native AI suite provides GPU shared scheduling capabilities, allowing for the allocation of a GPU card to multiple task containers for shared use based on the model's GPU computing power and graphics storage requirements. In theory, this way, more tasks can be used to maximize the utilization of GPU resources.
At the same time, in order to avoid mutual interference between multiple containers sharing GPUs, Alibaba Cloud cGPU technology is also integrated to ensure the security and stability of GPU sharing. The use of GPU resources between containers is isolated from each other. There will be no overuse or mutual impact when errors occur.
GPU shared scheduling also supports multiple allocation strategies, including single container single GPU card sharing, commonly used to support model inference scenarios; Single container multi GPU card sharing, commonly used to debug distributed model training in the development environment; The GPU card Binpack/Spread strategy can balance the allocation density and availability of GPU cards. Similar fine-grained shared scheduling capabilities are also applicable to Alibaba's self-developed 800 AI inference chip in ACK clusters.
The improvement of GPU resource utilization through GPU shared scheduling and cGPU isolation is evident, and there have been many user cases. For example, a certain customer has deployed hundreds of different models of GPU instances in a unified pooling based on the visual memory dimension, utilizing GPU shared scheduling and automatic elastic scaling of visual memory, and deploying hundreds of AI models. It not only significantly improves resource utilization, but also significantly reduces operational complexity.
GPU topology aware scheduling
With the increasing complexity of deep learning models and the amount of training data, distributed training with multiple GPU cards has become very common. For algorithms that use gradient descent optimization, frequent data transmission is required between multiple GPU cards, and data communication bandwidth often becomes a bottleneck limiting GPU computing performance.
The GPU topology aware scheduling function obtains the topology structure of all GPU cards on the computing node. When selecting GPU resources for multi card training tasks, the scheduler considers all interconnection links such as NVLINK, PCIe Switch, QPI, and RDMA network cards based on topology information, and automatically selects the GPU card combination that can provide the maximum communication bandwidth to avoid bandwidth limitations affecting GPU computing efficiency. Enabling GPU topology aware scheduling is completely configurable and has zero intrusion on both the model and training task code.
By comparing the performance of scheduling and running multi card GPU distributed training tasks with ordinary Kubernetes, it can be found that GPU topology aware scheduling can bring efficiency improvement at zero cost for the same computing engine (such as Tensorflow, Python, etc.) and model tasks (Resnet50, VGG16, etc.). Compared to the smaller Resnet50 model, due to the large amount of data transmission introduced during the VGG16 model training process, a better GPU interconnect bandwidth will have a more significant impact on training performance improvement.
2. AI Task Scheduling (Cybernetes)
AI distributed training is a typical type of batch task. A training task (Job) submitted by a user is usually completed by a group of subtasks running together at runtime. Different AI training frameworks have different strategies for grouping subtasks. For example, Tensorflow training tasks can use grouping of multiple Parameter Servers subtasks and multiple Workers subtasks. When Horovod runs Ring all reduce training, it uses a strategy of one Launcher and multiple Workers.
In Kubernetes, a subtask typically corresponds to a Pod container process. Scheduling a training task is transformed into a joint scheduling of one or more sub task Pod containers. This is similar to Hadoop Yarn's scheduling of MapReduce and Spark batch jobs. The default scheduler for Kubernetes can only use a single Pod container as the scheduling unit, which cannot meet the needs of batch task scheduling.
The ACK scheduler (Cybernetes [7]) extends the Kubernetes native scheduling framework and implements many typical batch scheduling strategies, including Gang Scheduling (Coscheduling), FIFO scheduling, Capacity Scheduling, Fair sharing, Binpack/Spread, etc. A new priority task queue has been added, supporting customized task priority management and tenant elastic resource quota control; It can also integrate Kubelow Pipelines or Argo cloud native workflow engines to provide workflow orchestration services for complex AI tasks. The architecture composed of "task queue+task scheduler+workflow+task controller" provides basic support for building batch task systems based on Kubernetes, while also significantly improving the overall resource utilization of the cluster.
At present, the cloud native AI suite environment can not only support various deep learning AI task training such as TensorFlow, Python, Horovod, but also support big data tasks such as Spark and Flink, as well as MPI high-performance computing jobs. We have also contributed some batch task scheduling capabilities to the Kubernetes scheduler upstream open-source project and continue to drive the community's evolution towards cloud native task system infrastructure.
• Gang Scheduling
Only when the cluster can meet the needs of all subtasks of a task can resources be allocated to the task as a whole, otherwise no resources will be allocated to it. Avoid resource deadlock, which can cause large tasks to crowd out small tasks.
• Capacity Scheduling
By providing elastic quotas and setting tenant queues, it is ensured that the overall resource utilization rate is improved through elastic quota sharing while meeting the minimum resource allocation needs of tenants.
• Binpack Scheduling
Assignments are prioritized and concentrated at a certain node. When node resources are insufficient, they are sequentially distributed at the next node, which is suitable for single machine multi GPU card training tasks and avoids cross machine data transmission. It can also prevent resource fragmentation caused by a large number of small tasks.
• Resource Reservation Scheduling
Reserve resources for specific assignments, allocate them in a targeted manner, or reuse them. It not only ensures the certainty of obtaining resources for specific tasks, but also accelerates the speed of resource allocation, and supports user-defined reservation policies.
3. Elastic AI Tasks
Resilience is the most important fundamental capability for cloud native and Kubernetes container clusters. Taking into account resource conditions, application availability, and cost constraints, the number of intelligent scaling service instances and cluster nodes should not only meet the basic service quality of the application, but also avoid excessive cloud resource consumption. The cloud native AI suite supports elastic model training tasks and elastic model inference services, respectively.
Elastic Training
Support flexible scheduling of distributed deep learning training tasks. During the training process, the number of sub task worker instances and nodes can be dynamically scaled, while maintaining the overall training progress and model accuracy. When the cluster resources are idle, adding more workers can accelerate training. When resources are tight, some workers can be released to ensure the basic running progress of training. This can greatly improve the overall utilization of the cluster, avoid the impact of computing node failures, and significantly reduce the time for users to wait for jobs to start after submitting them.
It is worth noting that in addition to resource resilience, elastic deep learning training also requires the support of computing engines and distributed communication frameworks, as well as the coordination and optimization of algorithms, data segmentation strategies, training optimizers, and other aspects (finding the best method to ensure the adaptation of global batch size and learning rate) to ensure the accuracy goals and performance requirements of model training. At present, there is no universal means in the industry to achieve resilient training with strong replication on common models. Under certain constraints (model features, distributed training methods, resource scale, and elastic amplitude, etc.), such as ResNet50 within a 32 card GPU, BERT has stable methods to achieve satisfactory elastic training benefits using Python and Tensorflow.
Elastic Inference
By leveraging Alibaba Cloud's rich resource resilience capabilities, we can fully utilize various resource types such as downtime non recovery, elastic resource pooling, scheduled resilience, bidding instance resilience, and Spotlet mode to automatically expand and shrink AI inference services. Maintain an optimized balance between service performance and cost.
In fact, the elasticity and service-oriented operation of AI model online inference services are quite similar to microservices and web applications. Many cloud native existing technologies can be directly used in online inference services. However, AI model inference still has many unique aspects, such as model optimization methods, pipeline service-oriented, refined scheduling and arrangement, and heterogeneous running environment adaptation. There are specialized demands and processing methods, which will not be expanded here.
4. AI Data Orchestration and Acceleration (Fluid)
Running AI, big data, and other tasks on the cloud through the cloud native architecture can enjoy the advantages of computing resource elasticity, but at the same time, it will also encounter challenges such as data access latency and high remote data bandwidth overhead caused by the separation of computing and storage architecture. Especially in GPU deep learning training scenarios, iterative remote reading of large amounts of training data can seriously slow down GPU computing efficiency.
On the other hand, Kubernetes only provides a heterogeneous storage service access and management standard interface (CSI), without defining how applications use and manage data in container clusters. When running training tasks, data scientists need to be able to manage dataset versions, control access permissions, preprocess datasets, and accelerate heterogeneous data reads. However, there is no such standard solution in Kubernetes, which is one of the important capabilities missing from the cloud native container community.
The ACK cloud native AI suite abstracts the process of using data for computing tasks, proposes the concept of an elastic dataset, and implements it as the "first class citizen" in Kubernetes. Around the elastic dataset, ACK created the data orchestration and acceleration system Fluid to implement capabilities such as dataset management (CRUD operations), permission control, and access acceleration.
Fluid can configure caching services for each dataset, which can automatically cache data locally in the calculation task during the training process for the next round of iterative calculation, and can also schedule new training tasks to run on the calculation nodes with existing dataset caching data. In addition, dataset preheating, cache capacity monitoring, and elastic scaling can greatly reduce the cost of tasks remotely pulling data. Fluid can aggregate multiple different storage services as data sources (such as OSS, HDFS) into the same dataset for use, and can also access storage services from different locations to achieve data management and access acceleration in a mixed cloud environment.
Algorithm engineers use Fluid Dataset in AI task containers in a similar way to using PVC (Persistent Volume Claim) in Kubernetes. They only need to specify the name of the Dataset in the task description file and mount it as a volume to the path where the task reads data in the container. We hope to increase GPU resources for training tasks, which can almost linearly accelerate the training speed of the model. However, it is often limited by the bandwidth limitations of more GPUs simultaneously pulling OSS buckets, and simply increasing GPUs cannot effectively speed up training. By using the distributed caching capability of Fluid Dataset, the performance bottleneck of remote concurrent data pulling has been effectively resolved, resulting in better acceleration ratio for distributed training extension GPUs.
The ACK team, in collaboration with Nanjing University and Alluxio Company, jointly initiated the Fluid open-source project [8] and hosted it as the CNCF sandbox project. We hope to work with the community to promote the development and implementation of cloud native data orchestration and acceleration in the field. Fluid's architecture can be extended and compatible with multiple distributed caching services through the CacheRuntime plugin. Currently, it supports Alibaba Cloud EMR's JindoFS, open-source Alluxio, Juicefs, and other caching engines. Both in the co construction of open source communities and the landing of cloud users have made good progress.
The main capabilities of Fluid Cloud's native data orchestration and acceleration include:
Through data affinity scheduling and distributed caching engine acceleration, the integration between data and computing is achieved, thereby accelerating computing access to data.
Manage data independently of storage and isolate resources through Kubernetes' namespace to achieve secure data isolation.
Combining data from different stores for computation provides an opportunity to break the data island effect caused by the differences in different stores.
5. AI Job Lifecycle Management (Arena)
All components provided by the ACK cloud native AI suite are delivered to AI platform developers and maintenance personnel in the form of Kubernetes standard interfaces (CRDs) and APIs for invocation. This is very convenient and easy-to-use for building a cloud native AI platform based on Kubernetes.
But for data scientists and algorithm engineers to develop and train AI models, Kubernetes' syntax and operations are a "burden". They are more accustomed to debugging code in IDEs such as Jupyter Notebook, and submitting and managing training tasks using the command line or Web interface. The logging, monitoring, storage access, GPU resource allocation, and cluster maintenance during task execution should all have built-in capabilities, which can be easily operated using tools.
The production process of AI services mainly includes stages such as data preparation and management, model development and construction, model training, model evaluation, and online operation and maintenance of model inference services. These links are iterated in a cyclic manner, continuously optimizing and updating the model based on changes in online data, publishing the model online through inference services, collecting new data, and entering the next iteration.
The cloud native AI suite abstracts the main work steps in this production process and implements management using the command-line tool Arena [9]. Arena completely shields the complexity of underlying resource and environmental management, task scheduling, GPU allocation, and monitoring, and is compatible with mainstream AI frameworks and tools, including Tensorflow, Python, Horovod, Spark, JupyterLab, TF Service, Triton, and more.
In addition to command-line tools, Arena also supports the Golang/Java/Python language SDK, making it easy for users to develop it again. In addition, considering that many data scientists are accustomed to using web interfaces to manage resources and tasks, the cloud native AI suite also provides a simple operation and maintenance platform and development console, meeting the needs of users to quickly browse cluster status and submit training tasks. These components together form the cloud native AI job management toolset Arena.
Kubelow [10] is a mainstream open-source project in the Kubernetes community that supports machine learning workloads. Arena already has built-in TensorflowJob, PytorchJob, MPIJob task controllers for Kubelow, as well as open-source SparkApplication task controllers. It can also integrate KFService and Kubelow Pipelines projects. Users do not need to install Kubelow separately in the ACK cluster, they only need to use Arena.
It is worth mentioning that the ACK team also contributed the Arena command line and SDK tools to the Kubelow community in the early stages. Many enterprise users have built their own AI platform MLOps process and experience by extending and encapsulating Arena.
The cloud native AI suite mainly solves problems for the following two types of users:
AI developers, such as data scientists and algorithm engineers
• AI platform operation and maintenance personnel, such as AI cluster administrators
AI developers can use the arena command line or development console web interface to create a Jupyter notebook development environment, submit and manage model training tasks, query GPU allocation, view training task logs, monitoring and visualization data in real-time, compare and evaluate different model versions, and deploy the selected models to the cluster. They can also perform AB testing, grayscale publishing, and Flow control and elastic expansion.
expectation
The demand for digitization and intelligence in the wave of enterprise IT transformation is becoming increasingly strong, and the most prominent demand is how to quickly and accurately mine new business opportunities and model innovation from massive business data, in order to better cope with the ever-changing and uncertain market challenges. AI artificial intelligence is undoubtedly one of the most important means to help enterprises achieve this goal in the present and future. Obviously, how to continuously improve the efficiency of AI engineering, optimize AI production costs, continuously lower AI thresholds, and achieve AI capability inclusiveness has great practical value and social significance.
We clearly see that the advantages of cloud native in distributed architecture, standardized APIs, and ecological richness are rapidly becoming a new interface for users to efficiently use cloud computing, helping businesses improve service agility and scalability. More and more users are actively exploring how to improve the production efficiency of AI services based on cloud native container technology. Furthermore, utilizing the same technology stack to support various heterogeneous workloads within the enterprise.
Alibaba Cloud ACK provides cloud native AI suite products, which support AI production and operation and maintenance processes from the underlying heterogeneous computing resources, AI task scheduling, AI data acceleration, to the upper layer computing engine compatibility and AI job lifecycle management. Accelerate the process of customizing cloud native AI platforms for users with a simple user experience, scalable architecture, and unified optimized product implementation. We will also share some of our achievements with the community to jointly promote the development and implementation of cloud native AI field.
Looking back at the development history of AI, we will find that this is no longer a new field. From being first defined at the Dartmouth academic seminar in 1956 to Geoffrey Hinton's proposal of the "Deep Believe Network" in 2006, AI has gone through three waves of development. Especially in the past decade, AI technology has made remarkable progress, driven by deep learning as the core algorithm, GPU as the representative of large computing power, and the accumulation of massive production data. Unlike the previous two attempts, this time AI has achieved breakthroughs in technologies such as machine vision, speech recognition, and natural language understanding, and has successfully landed in many industries such as commerce, healthcare, education, industry, safety, and transportation. It has even spawned new fields such as autonomous driving and AIoT.
However, with the rapid advancement and widespread application of AI technology, many enterprises and institutions have also found that it is not easy to ensure the efficient operation of the "algorithm+computing power+data" flywheel and produce AI capabilities with commercial implementation value on a large scale. The expensive investment in computing power and operation and maintenance costs, low production efficiency of AI services, and the lack of interpretability and universality of AI algorithms have all become heavy barriers in front of AI users.
Similar to the history of AI, the representative technology in the cloud native field - containers - is not a new concept. It can be traced back to the birth of UNIX chroot technology in 1979, which began to emerge as a prototype of containers. Next came a series of lightweight kernel based resource isolation technologies such as FreeBSD Jails, Linux VServer, OpenVZ, Warden, etc. Until the emergence of Docker in 2013, it redefined the interface for using data center computing resources with comprehensive container object encapsulation, standard API definitions, and a friendly development and operation experience. As everyone knows, Kubernetes won the competition with Docker Swarm and Mesos, becoming the de facto standard for container layout and scheduling. Kubernetes and containers (including Docker, Containered, CRI-O, and other container runtime and management tools) constitute the cloud native core technology defined by CNCF. After 5 years of rapid development, the cloud native technology ecosystem has covered all aspects of IT systems such as container runtime, network, storage and cluster management, observability, elasticity, DevOps, service grid, serverless architecture, databases, data warehouses, etc.
Why Cloud Native AI Appears
According to the 2021 CNCF survey report, 96% of surveyed companies are using or evaluating Kubernetes, which includes the needs of many AI related businesses. Gartner has predicted that by 2023, 70% of AI applications will be developed based on container and serverless technology.
In reality, in the process of developing the Alibaba Container Service for Kubernetes (ACK) product, we see an increasing number of users hoping to manage GPU resources, develop and run deep learning and big data tasks, deploy and flexibly manage AI services in Kubernetes container clusters. Users come from various industries, including fast-growing companies such as the internet, games, live streaming, and autonomous driving Emerging fields such as AIoT even include traditional industries such as government enterprises, finance, and manufacturing.
Users encounter many common challenges when developing, generating, and using AI capabilities, including high AI development barriers, low engineering efficiency, high costs, complex software environment maintenance, dispersed heterogeneous hardware management, uneven allocation of computing resources, and cumbersome storage access. Many of these users are already using cloud native technology and have successfully improved the development and operation efficiency of applications and microservices. They hope to replicate the same experience to the AI field and build AI applications and systems based on cloud native technology.
In the early stages, users can quickly build GPU clusters using Kubernetes, Kubelow, and NVIDIA Docker, access storage services through standard interfaces, and automatically implement AI job scheduling and GPU resource allocation. The trained model can be deployed in the cluster, which basically realizes the AI development and production process. Subsequently, users had higher requirements for production efficiency and also encountered more problems. For example, GPU utilization is low, distributed training scalability is poor, homework cannot be flexibly scaled, training data access is slow, there is a lack of dataset, model, and task management, which makes it difficult to easily obtain real-time logs, monitoring, and visualization. Model publishing lacks quality and performance verification, and there is a lack of service-oriented operation and governance measures after going online. Kubernetes and containers have high usage thresholds, and the user experience does not conform to the habits of data scientists, Difficulties in team collaboration and sharing, frequent resource competition, and even data security issues.
To fundamentally solve these problems, the AI production environment must be upgraded from a "single handedly small workshop" model to a "resource pooling+AI engineering platformization+multi role collaboration" model. In order to help "cloud native+AI" users with such demands achieve flexible, controllable, and systematic evolution, we have launched cloud native AI suite products on the basis of ACK. As long as users have an ACK Pro cluster or any standard Kubernetes cluster, they can quickly customize their own AI platform using cloud native AI suites. Free data scientists and algorithm engineers from the complex and inefficient environmental management, resource allocation, and task scheduling work, leaving them with more energy for "brain hole" algorithms and "alchemy".
How to define cloud native AI
With the gradual deepening of enterprise IT architecture towards cloud computing, cloud native technologies such as containers, Kubernetes, and service grids have helped a large number of application services quickly land on cloud platforms, and have gained great value in scenarios such as elasticity, microserviceization, service-oriented, and DevOps. At the same time, IT decision-makers are also considering how to use cloud native technology to support more types of workloads with a unified architecture and technology stack. To avoid the burden of using different architectures and technologies for different loads, resulting in a "chimney" system, repetitive investment, and fragmented operation and maintenance.
Deep learning and AI tasks are one of the important workloads for communities seeking cloud native support. In fact, more and more deep learning systems and AI platforms have been built on containers and Kubernetes clusters. In 2020, we explicitly proposed the concept, core scenarios, and reference technology architecture of "cloud native AI" in order to provide a concrete definition, a feasible roadmap, and best practices for this new field.
Alibaba Cloud Container Service ACK defines cloud native AI - fully utilizing cloud resource elasticity, heterogeneous computing power, standardized services, as well as cloud native technologies such as containers, automation, and microservices, to provide AI/ML with end-to-end solutions that are efficient, cost-effective, scalable, and replicable.
Core Scenarios
We will focus on two core scenarios in the field of cloud native AI: continuously optimizing the efficiency of heterogeneous resources, and efficiently running heterogeneous workloads such as AI.
Scenario 1: Optimizing Heterogeneous Resource Efficiency
Abstract, manage, operate, and allocate various heterogeneous computing (such as CPU, GPU, NPU, VPU, FPGA, ASIC), storage (OSS, NAS, CPFS, HDFS), and network (TCP, RDMA) resources within Alibaba Cloud IaaS or customer IDC, and continuously improve resource utilization through elasticity and software hardware collaborative optimization.
Scenario 2: Running heterogeneous workloads such as AI
Compatible with mainstream or user owned computing engines and runtimes such as Tensorflow, Python, Horovod, ONNX, Spark, Flink, etc., running various heterogeneous workload processes, managing job lifecycles, scheduling task workflows, and ensuring task scale and performance. On the one hand, continuously improving the cost-effectiveness of running tasks, and on the other hand, continuously improving the development and maintenance experience and engineering efficiency.
Around these two core scenarios, more customized user scenarios can be expanded, such as building MLOps processes that conform to user usage habits; Alternatively, based on the characteristics of CV class (Computer Vision) AI services, mixed scheduling CPUs, GPUs, and VPUs (Video Process Units) can support data processing pipelines at different stages; It can also be extended to support higher-order distributed training frameworks for large model pre training and fine-tuning scenarios, combined with task and data scheduling strategies, as well as elastic and fault-tolerant solutions, to optimize the cost and success rate of large model training tasks.
Reference Architecture
In order to support the core scenario mentioned above, we propose a cloud native AI reference technology architecture.
The cloud native AI reference technology architecture follows the principles of componentization, scalability, and assemblability in a white box design. It exposes interfaces using Kubernetes standard objects and APIs, supporting developers and maintenance personnel to select any component as needed for assembly and secondary development, and quickly customizing to build their own AI platform.
The reference architecture is based on the Kubernetes container service, encapsulating the unified management of various heterogeneous resources downwards, and providing standard Kubernetes cluster environment and API upwards to run various core components, achieving resource operation and maintenance management, AI task scheduling and elastic scaling, data access acceleration, workflow orchestration, big data service integration, AI job lifecycle management, various AI product management, unified operation and maintenance, and other services. Further targeting the main links in AI production processes (MLOps), supporting AI dataset management, AI model development, training, evaluation, and model inference services; And through unified command-line tools, multilingual SDKs, and console interfaces, it supports users to directly use or customize development. And through the same components and tools, it can also support the integration of cloud based AI services, open source AI frameworks, and third-party AI capabilities.
What cloud native AI capabilities do enterprises need
We have defined the concept, core scenarios, design principles, and reference architecture of cloud native AI. Next, what kind of cloud native AI products do you need to provide to users? By conducting user research on AI tasks running on ACK, observing communities such as cloud native, MLOps, and MLSys, and analyzing some containerized AI products in the industry, we have summarized the characteristics and key capabilities that a cloud native AI product needs to possess.
High efficiency
It mainly includes three dimensions: efficiency of heterogeneous resource utilization, efficiency of AI job operation and management, efficiency of tenant sharing and team collaboration
Good compatibility
Compatible with common AI computing engines and models. Support various storage services and provide universal data access performance optimization capabilities. It can be integrated with various big data services and easily integrated by business applications. In particular, it is necessary to comply with the requirements of data security and privacy protection.
• Scalable
The product architecture achieves scalability, assembly, and reproducibility. Provide standard APIs and tools for easy use, integration, and secondary development. Under the same architecture, the product implementation should strive to adapt to various environments such as public cloud, proprietary cloud, hybrid cloud, and edge for delivery.
Alibaba Cloud Container Service ACK Cloud Native AI Suite
Based on the cloud native AI reference architecture and targeting the core scenarios and user needs mentioned above, the Alibaba Cloud Container Service team officially released the ACK cloud native AI suite product in 2021 [5], which has been launched in 27 regions worldwide for public testing, helping Kubernetes users quickly customize and build their own cloud native AI platform.
The ACK cloud native AI suite is mainly aimed at AI platform development and operation teams, helping them quickly and low-cost manage AI infrastructure locally. Simultaneously encapsulate the functions of each component into command line and console tools for direct use by data scientists and algorithm engineers. After users install the cloud native AI suite in the ACK cluster, all components are ready to use out of the box, quickly achieving unified management of heterogeneous resources and low threshold AI tasks. For product installation and usage, please refer to the product documentation [6].
architecture design
The functions of the cloud native AI suite are divided from bottom to top:
1. Heterogeneous resource layer, including heterogeneous computing power optimization and heterogeneous storage access
2. AI scheduling layer, including optimization of various scheduling strategies
3. AI elastic layer, including elastic AI training tasks and elastic AI inference services
4. AI data acceleration and orchestration layer, including dataset management, distributed cache acceleration, and big data service integration
5. AI job management layer, including AI job lifecycle management services and toolsets
6. AI operation and maintenance layer, including monitoring, logging, elastic scaling, fault diagnosis, and multi tenant
7. AI product warehouse, including AI container images, models, and AI experimental records
The following will briefly introduce the main capabilities and basic architecture design of the heterogeneous resource layer, AI scheduling layer, AI elastic task layer, AI data acceleration and orchestration layer, and AI job management layer in the cloud native AI suite product.
1. Unified management of heterogeneous resources
The cloud native AI suite has added support for various heterogeneous resources such as Nvidia GPU, optical 800 NPU, FPGA, VPU, RDMA high-performance network, etc. on ACK, covering almost all device types on Alibaba Cloud. This makes using these resources in Kubernetes as simple as using CPU and memory, as simply declaring the required number of resources in task parameters. For expensive resources such as GPU and NPU, various resource utilization optimization methods are also provided.
GPU monitoring
We provide multi-dimensional monitoring capabilities for GPUs, making their allocation, usage, and health status clear at a glance. Automatically detect and alert GPU device anomalies through the built-in Kubernetes NPD (Node Problem Detector).
GPU elastic scaling
Combining the various elastic scaling capabilities provided by the ACK elastic node pool, GPUs can automatically scale as needed at both the number of resource nodes and the number of running task instances. The conditions that trigger resilience include both GPU resource usage indicators and user-defined indicators. The expansion node type supports regular EGS instances, as well as ECI instances, Spot instances, and so on.
GPU shared scheduling
Native Kubernetes only support scheduling based on the granularity of the entire GPU card, and the task container will monopolize the GPU card. If the task model is relatively small (computational complexity, graphics memory usage), it will cause idle and wasted GPU card resources. The cloud native AI suite provides GPU shared scheduling capabilities, allowing for the allocation of a GPU card to multiple task containers for shared use based on the model's GPU computing power and graphics storage requirements. In theory, this way, more tasks can be used to maximize the utilization of GPU resources.
At the same time, in order to avoid mutual interference between multiple containers sharing GPUs, Alibaba Cloud cGPU technology is also integrated to ensure the security and stability of GPU sharing. The use of GPU resources between containers is isolated from each other. There will be no overuse or mutual impact when errors occur.
GPU shared scheduling also supports multiple allocation strategies, including single container single GPU card sharing, commonly used to support model inference scenarios; Single container multi GPU card sharing, commonly used to debug distributed model training in the development environment; The GPU card Binpack/Spread strategy can balance the allocation density and availability of GPU cards. Similar fine-grained shared scheduling capabilities are also applicable to Alibaba's self-developed 800 AI inference chip in ACK clusters.
The improvement of GPU resource utilization through GPU shared scheduling and cGPU isolation is evident, and there have been many user cases. For example, a certain customer has deployed hundreds of different models of GPU instances in a unified pooling based on the visual memory dimension, utilizing GPU shared scheduling and automatic elastic scaling of visual memory, and deploying hundreds of AI models. It not only significantly improves resource utilization, but also significantly reduces operational complexity.
GPU topology aware scheduling
With the increasing complexity of deep learning models and the amount of training data, distributed training with multiple GPU cards has become very common. For algorithms that use gradient descent optimization, frequent data transmission is required between multiple GPU cards, and data communication bandwidth often becomes a bottleneck limiting GPU computing performance.
The GPU topology aware scheduling function obtains the topology structure of all GPU cards on the computing node. When selecting GPU resources for multi card training tasks, the scheduler considers all interconnection links such as NVLINK, PCIe Switch, QPI, and RDMA network cards based on topology information, and automatically selects the GPU card combination that can provide the maximum communication bandwidth to avoid bandwidth limitations affecting GPU computing efficiency. Enabling GPU topology aware scheduling is completely configurable and has zero intrusion on both the model and training task code.
By comparing the performance of scheduling and running multi card GPU distributed training tasks with ordinary Kubernetes, it can be found that GPU topology aware scheduling can bring efficiency improvement at zero cost for the same computing engine (such as Tensorflow, Python, etc.) and model tasks (Resnet50, VGG16, etc.). Compared to the smaller Resnet50 model, due to the large amount of data transmission introduced during the VGG16 model training process, a better GPU interconnect bandwidth will have a more significant impact on training performance improvement.
2. AI Task Scheduling (Cybernetes)
AI distributed training is a typical type of batch task. A training task (Job) submitted by a user is usually completed by a group of subtasks running together at runtime. Different AI training frameworks have different strategies for grouping subtasks. For example, Tensorflow training tasks can use grouping of multiple Parameter Servers subtasks and multiple Workers subtasks. When Horovod runs Ring all reduce training, it uses a strategy of one Launcher and multiple Workers.
In Kubernetes, a subtask typically corresponds to a Pod container process. Scheduling a training task is transformed into a joint scheduling of one or more sub task Pod containers. This is similar to Hadoop Yarn's scheduling of MapReduce and Spark batch jobs. The default scheduler for Kubernetes can only use a single Pod container as the scheduling unit, which cannot meet the needs of batch task scheduling.
The ACK scheduler (Cybernetes [7]) extends the Kubernetes native scheduling framework and implements many typical batch scheduling strategies, including Gang Scheduling (Coscheduling), FIFO scheduling, Capacity Scheduling, Fair sharing, Binpack/Spread, etc. A new priority task queue has been added, supporting customized task priority management and tenant elastic resource quota control; It can also integrate Kubelow Pipelines or Argo cloud native workflow engines to provide workflow orchestration services for complex AI tasks. The architecture composed of "task queue+task scheduler+workflow+task controller" provides basic support for building batch task systems based on Kubernetes, while also significantly improving the overall resource utilization of the cluster.
At present, the cloud native AI suite environment can not only support various deep learning AI task training such as TensorFlow, Python, Horovod, but also support big data tasks such as Spark and Flink, as well as MPI high-performance computing jobs. We have also contributed some batch task scheduling capabilities to the Kubernetes scheduler upstream open-source project and continue to drive the community's evolution towards cloud native task system infrastructure.
• Gang Scheduling
Only when the cluster can meet the needs of all subtasks of a task can resources be allocated to the task as a whole, otherwise no resources will be allocated to it. Avoid resource deadlock, which can cause large tasks to crowd out small tasks.
• Capacity Scheduling
By providing elastic quotas and setting tenant queues, it is ensured that the overall resource utilization rate is improved through elastic quota sharing while meeting the minimum resource allocation needs of tenants.
• Binpack Scheduling
Assignments are prioritized and concentrated at a certain node. When node resources are insufficient, they are sequentially distributed at the next node, which is suitable for single machine multi GPU card training tasks and avoids cross machine data transmission. It can also prevent resource fragmentation caused by a large number of small tasks.
• Resource Reservation Scheduling
Reserve resources for specific assignments, allocate them in a targeted manner, or reuse them. It not only ensures the certainty of obtaining resources for specific tasks, but also accelerates the speed of resource allocation, and supports user-defined reservation policies.
3. Elastic AI Tasks
Resilience is the most important fundamental capability for cloud native and Kubernetes container clusters. Taking into account resource conditions, application availability, and cost constraints, the number of intelligent scaling service instances and cluster nodes should not only meet the basic service quality of the application, but also avoid excessive cloud resource consumption. The cloud native AI suite supports elastic model training tasks and elastic model inference services, respectively.
Elastic Training
Support flexible scheduling of distributed deep learning training tasks. During the training process, the number of sub task worker instances and nodes can be dynamically scaled, while maintaining the overall training progress and model accuracy. When the cluster resources are idle, adding more workers can accelerate training. When resources are tight, some workers can be released to ensure the basic running progress of training. This can greatly improve the overall utilization of the cluster, avoid the impact of computing node failures, and significantly reduce the time for users to wait for jobs to start after submitting them.
It is worth noting that in addition to resource resilience, elastic deep learning training also requires the support of computing engines and distributed communication frameworks, as well as the coordination and optimization of algorithms, data segmentation strategies, training optimizers, and other aspects (finding the best method to ensure the adaptation of global batch size and learning rate) to ensure the accuracy goals and performance requirements of model training. At present, there is no universal means in the industry to achieve resilient training with strong replication on common models. Under certain constraints (model features, distributed training methods, resource scale, and elastic amplitude, etc.), such as ResNet50 within a 32 card GPU, BERT has stable methods to achieve satisfactory elastic training benefits using Python and Tensorflow.
Elastic Inference
By leveraging Alibaba Cloud's rich resource resilience capabilities, we can fully utilize various resource types such as downtime non recovery, elastic resource pooling, scheduled resilience, bidding instance resilience, and Spotlet mode to automatically expand and shrink AI inference services. Maintain an optimized balance between service performance and cost.
In fact, the elasticity and service-oriented operation of AI model online inference services are quite similar to microservices and web applications. Many cloud native existing technologies can be directly used in online inference services. However, AI model inference still has many unique aspects, such as model optimization methods, pipeline service-oriented, refined scheduling and arrangement, and heterogeneous running environment adaptation. There are specialized demands and processing methods, which will not be expanded here.
4. AI Data Orchestration and Acceleration (Fluid)
Running AI, big data, and other tasks on the cloud through the cloud native architecture can enjoy the advantages of computing resource elasticity, but at the same time, it will also encounter challenges such as data access latency and high remote data bandwidth overhead caused by the separation of computing and storage architecture. Especially in GPU deep learning training scenarios, iterative remote reading of large amounts of training data can seriously slow down GPU computing efficiency.
On the other hand, Kubernetes only provides a heterogeneous storage service access and management standard interface (CSI), without defining how applications use and manage data in container clusters. When running training tasks, data scientists need to be able to manage dataset versions, control access permissions, preprocess datasets, and accelerate heterogeneous data reads. However, there is no such standard solution in Kubernetes, which is one of the important capabilities missing from the cloud native container community.
The ACK cloud native AI suite abstracts the process of using data for computing tasks, proposes the concept of an elastic dataset, and implements it as the "first class citizen" in Kubernetes. Around the elastic dataset, ACK created the data orchestration and acceleration system Fluid to implement capabilities such as dataset management (CRUD operations), permission control, and access acceleration.
Fluid can configure caching services for each dataset, which can automatically cache data locally in the calculation task during the training process for the next round of iterative calculation, and can also schedule new training tasks to run on the calculation nodes with existing dataset caching data. In addition, dataset preheating, cache capacity monitoring, and elastic scaling can greatly reduce the cost of tasks remotely pulling data. Fluid can aggregate multiple different storage services as data sources (such as OSS, HDFS) into the same dataset for use, and can also access storage services from different locations to achieve data management and access acceleration in a mixed cloud environment.
Algorithm engineers use Fluid Dataset in AI task containers in a similar way to using PVC (Persistent Volume Claim) in Kubernetes. They only need to specify the name of the Dataset in the task description file and mount it as a volume to the path where the task reads data in the container. We hope to increase GPU resources for training tasks, which can almost linearly accelerate the training speed of the model. However, it is often limited by the bandwidth limitations of more GPUs simultaneously pulling OSS buckets, and simply increasing GPUs cannot effectively speed up training. By using the distributed caching capability of Fluid Dataset, the performance bottleneck of remote concurrent data pulling has been effectively resolved, resulting in better acceleration ratio for distributed training extension GPUs.
The ACK team, in collaboration with Nanjing University and Alluxio Company, jointly initiated the Fluid open-source project [8] and hosted it as the CNCF sandbox project. We hope to work with the community to promote the development and implementation of cloud native data orchestration and acceleration in the field. Fluid's architecture can be extended and compatible with multiple distributed caching services through the CacheRuntime plugin. Currently, it supports Alibaba Cloud EMR's JindoFS, open-source Alluxio, Juicefs, and other caching engines. Both in the co construction of open source communities and the landing of cloud users have made good progress.
The main capabilities of Fluid Cloud's native data orchestration and acceleration include:
Through data affinity scheduling and distributed caching engine acceleration, the integration between data and computing is achieved, thereby accelerating computing access to data.
Manage data independently of storage and isolate resources through Kubernetes' namespace to achieve secure data isolation.
Combining data from different stores for computation provides an opportunity to break the data island effect caused by the differences in different stores.
5. AI Job Lifecycle Management (Arena)
All components provided by the ACK cloud native AI suite are delivered to AI platform developers and maintenance personnel in the form of Kubernetes standard interfaces (CRDs) and APIs for invocation. This is very convenient and easy-to-use for building a cloud native AI platform based on Kubernetes.
But for data scientists and algorithm engineers to develop and train AI models, Kubernetes' syntax and operations are a "burden". They are more accustomed to debugging code in IDEs such as Jupyter Notebook, and submitting and managing training tasks using the command line or Web interface. The logging, monitoring, storage access, GPU resource allocation, and cluster maintenance during task execution should all have built-in capabilities, which can be easily operated using tools.
The production process of AI services mainly includes stages such as data preparation and management, model development and construction, model training, model evaluation, and online operation and maintenance of model inference services. These links are iterated in a cyclic manner, continuously optimizing and updating the model based on changes in online data, publishing the model online through inference services, collecting new data, and entering the next iteration.
The cloud native AI suite abstracts the main work steps in this production process and implements management using the command-line tool Arena [9]. Arena completely shields the complexity of underlying resource and environmental management, task scheduling, GPU allocation, and monitoring, and is compatible with mainstream AI frameworks and tools, including Tensorflow, Python, Horovod, Spark, JupyterLab, TF Service, Triton, and more.
In addition to command-line tools, Arena also supports the Golang/Java/Python language SDK, making it easy for users to develop it again. In addition, considering that many data scientists are accustomed to using web interfaces to manage resources and tasks, the cloud native AI suite also provides a simple operation and maintenance platform and development console, meeting the needs of users to quickly browse cluster status and submit training tasks. These components together form the cloud native AI job management toolset Arena.
Kubelow [10] is a mainstream open-source project in the Kubernetes community that supports machine learning workloads. Arena already has built-in TensorflowJob, PytorchJob, MPIJob task controllers for Kubelow, as well as open-source SparkApplication task controllers. It can also integrate KFService and Kubelow Pipelines projects. Users do not need to install Kubelow separately in the ACK cluster, they only need to use Arena.
It is worth mentioning that the ACK team also contributed the Arena command line and SDK tools to the Kubelow community in the early stages. Many enterprise users have built their own AI platform MLOps process and experience by extending and encapsulating Arena.
The cloud native AI suite mainly solves problems for the following two types of users:
AI developers, such as data scientists and algorithm engineers
• AI platform operation and maintenance personnel, such as AI cluster administrators
AI developers can use the arena command line or development console web interface to create a Jupyter notebook development environment, submit and manage model training tasks, query GPU allocation, view training task logs, monitoring and visualization data in real-time, compare and evaluate different model versions, and deploy the selected models to the cluster. They can also perform AB testing, grayscale publishing, and Flow control and elastic expansion.
expectation
The demand for digitization and intelligence in the wave of enterprise IT transformation is becoming increasingly strong, and the most prominent demand is how to quickly and accurately mine new business opportunities and model innovation from massive business data, in order to better cope with the ever-changing and uncertain market challenges. AI artificial intelligence is undoubtedly one of the most important means to help enterprises achieve this goal in the present and future. Obviously, how to continuously improve the efficiency of AI engineering, optimize AI production costs, continuously lower AI thresholds, and achieve AI capability inclusiveness has great practical value and social significance.
We clearly see that the advantages of cloud native in distributed architecture, standardized APIs, and ecological richness are rapidly becoming a new interface for users to efficiently use cloud computing, helping businesses improve service agility and scalability. More and more users are actively exploring how to improve the production efficiency of AI services based on cloud native container technology. Furthermore, utilizing the same technology stack to support various heterogeneous workloads within the enterprise.
Alibaba Cloud ACK provides cloud native AI suite products, which support AI production and operation and maintenance processes from the underlying heterogeneous computing resources, AI task scheduling, AI data acceleration, to the upper layer computing engine compatibility and AI job lifecycle management. Accelerate the process of customizing cloud native AI platforms for users with a simple user experience, scalable architecture, and unified optimized product implementation. We will also share some of our achievements with the community to jointly promote the development and implementation of cloud native AI field.
Related Articles
-
A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team
-
What Does IOT Mean
Knowledge Base Team
-
6 Optional Technologies for Data Storage
Knowledge Base Team
-
What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers
-
Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00