Apsara Conference 2022|AI Cloud-Native Middle-Office Practice
Date: Oct 1, 2022
Foreword
The arrival of the AI era has brought greater challenges to the enrichment and agility of the underlying IT resources of enterprises. Utilizing Alibaba Cloud's stable and elastic GPU cloud server, leading GPU containerized sharing and isolation technology, and K8S cluster management platform, TAL achieves flexible resource scheduling through cloud-native architecture, laying a solid foundation for its AI middle-end platform to be agile and efficient. Solid technical base.
At the 2020 Yunqi Conference, Liu Dongdong, the head of TAL's AI mid-stage, shared his understanding of AI cloud-native and TAL's AI mid-stage practice. This article organizes the speech content.
Hello everyone, I'm Liu Dongdong, technical director of TAL's AI Middle Office. The theme of the speech I bring to you today is "A Brief Talk on TAL's Cloud Native AI".
My sharing is mainly divided into four parts:
First, the challenge of AI services to cloud native.
Second, AI and cloud-native service deployment.
Third, AI and cloud-native service governance.
Finally, I want to talk about the organic combination of K8S and Spring Cloud.
Challenges of AI Services to Cloud Native
First, let's talk about the challenges of AI services to cloud native. In the cloud-native era, one of the biggest features of AI services is that they require greater computing power support and greater stability of a service.
Our service is not just an original single service, but now it has been transferred to a cluster service. At the same time, the stability requirements for performance have been challenged from 3 9s to 5 9s.
Then these problems can no longer be solved by the original traditional technical architecture. So we need a new technical architecture.
What is this new technical architecture? It's cloud native.
Let's take a look at the changes that cloud native has brought to us. I summarize the biggest changes brought by cloud native into four main points and two aspects.
The four main points are the four characteristics of DevOps, continuous delivery, microservices, and containers. The two aspects are service deployment and service governance. Of course, it also has an overall system summary of 12 elements.
Today's focus is on service deployment and service governance.
Under the cloud-native wave, how do we deal with service deployment and service governance?
First, we use AI and cloud-native service deployment, that is, through K8S, plus a resource virtualization, resource pooling and other technologies to solve the order-of-magnitude growth demand of AI services for various hardware resources.
Second, AI services are organically combined with cloud-native service governance. Through the technology of service governance, including service discovery, HPA, load balancing, etc., the SLA requirements of five 9s for AI services are solved.
Cloud-native deployment of AI services
The first point is to talk about how to combine AI with cloud-native service deployment.
First, let’s take a look at what are the characteristics of service deployment in the AI era?
The first is a contradiction between hardware resource requirements and cost growth. The demand for hardware for AI services has grown by orders of magnitude, but hardware budgets have not.
Second, the hardware requirements for AI services are diverse. For example, high GPU requirements, high CPU requirements, high memory requirements, and even some mixed requirements.
Third, AI services require isolation of resources. Each AI service can use these resources independently and without disturbing each other.
Fourth, AI services can have requirements for resource pooling. AI service does not need to perceive the specific configuration of the machine. Once all resources are pooled, resource fragmentation can be reduced and utilization rate can be improved.
Finally, AI services have requests for burst resources. Because traffic is unpredictable, enterprises need to maintain the ability to expand the resource pool at any time.
What is our solution?
First, we use Docker's virtualization technology to isolate resources.
Then use GPU sharing technology to pool GPU, memory, CPU and other resources, and then manage the entire resources uniformly.
Finally, use K8S resources, including technical features such as taints and tolerances, to achieve flexible configuration of services.
In addition, it is recommended that you buy some high-profile machines. These high-profile machines are mainly to further reduce fragmentation.
Of course, it is also necessary to monitor the hardware of the entire cluster, make full use of the various complex time rule scheduling features of ECS (the cron in the figure below is a time-based job scheduling task), and cope with peak traffic.
Next, let's take a closer look at how TAL's AI mid-stage solves these AI deployment problems.
This page is the service management of one of our Nodes. Through this business, we can clearly see the deployment situation on each server, including resource usage, which pods are deployed, which nodes, and so on.
The second is actually the service deployment page of the AI mid-stage. We can precisely control the memory, CPU, and GPU usage of each pod by pressing the code file. At the same time, through technologies such as taint, the diversified deployment of servers can be satisfied.
According to our comparative experiment, using the cloud-native approach to deploy compared to the user's self-deployment can save about 65% of the cost. Moreover, such advantages will benefit more in terms of economic benefits and temporary traffic expansion as the AI cluster grows.
AI and Cloud Native Service Governance
Next, we will discuss AI and cloud-native service governance.
Briefly introduce what is a microservice? In fact, microservice is just an architectural style of service. It actually develops a single service as a set of small services, and then each application has its own process to run, and through lightweight, such as HTTP, API, etc. to communicate.
These services are actually built around the business itself, and can be managed centrally through automated deployment and other means. At the same time, it is written in different languages and uses different storage resources.
In summary, what are the characteristics of microservices?
First, microservices are small enough that they can only do one thing.
Second, microservices are stateless.
Third, microservices are independent of each other, and they are interface-oriented.
Finally, microservices are highly autonomous, and everyone is responsible only to themselves.
After seeing the characteristics of these microservices, let's think about the characteristics of AI services and microservices. We found that AI services are naturally suitable for microservices. Each microservice actually does only one thing in essence. For example, OCR, OCR service, only provides OCR service; ASR, mainly provides ASR service.
In turn, each AI service request is independent. For a simple example, an OCR request is essentially unrelated to another OCR request.
AI services are inherently demanding for horizontal scaling. Why? Because the thirst for resources of the AI service team is very large. Therefore, this expansion is very necessary.
Dependencies between AI services are also particularly small. For example, like our OCR service, it may not have much requirements for NLP services or other AI services.
All AI services can provide AI capabilities by writing declarative HTTP or even API.
Taking a closer look at AI services, you will find that not all AI services can be microserviced. So what did we do?
First, the AI service needs to be made into a stateless service. These stateless services are animalized, stateless, and disposable, and do not use any disk or memory request methods to do some storage. Function. This allows the service to be deployed on any node, anywhere.
Of course, not all services can be stateless. What if it has state? We will store these request statuses through databases such as configuration center, log center, Redis, MQ, and SQL. At the same time, high reliability of these components is ensured.
This is the overall architecture diagram of TAL's AI mid-stage PaaS. First, you can look at the outermost layer is the service interface layer. The outermost interface layer provides AI capabilities externally.
The most important layer in the platform layer is the service gateway, which is mainly responsible for some dynamic routing, traffic control, load balancing, and authentication. Further down are some of our service discovery, registry, fault tolerance, configuration management, elastic scaling and other functions.
Below is the business layer. These business layers are what we call some AI inference services.
At the bottom is the K8S cluster provided by Alibaba Cloud.
That is to say, the overall architecture is that K8S is responsible for service deployment, and SpringCloud is responsible for service governance.
How do we achieve the overall architecture diagram just mentioned through technical means?
The first is to use Eureka as a registry to realize service discovery and registration of distributed systems. The configuration properties of the server are managed through the configuration center Apollo, and dynamic updates are supported. Gateway Gateway can achieve the effect of isolating the inner and outer layers. Fusing Hystrix is mainly divided into time-sharing and quantitative fuses, and then protects our services from being blocked.
Load balancing plus Fegin operation can achieve load balancing of the overall traffic and consume our Eureka-related registration information. The consumer bus Kafka is a component for asynchronous processing. Then the authentication is done through the method of Outh2+RBAC, which realizes the user's login including the authentication management of the interface, and ensures safety and reliability.
Link tracking uses Skywalking. Through this APM architecture, we can track the status of each request, which is convenient for locating and alerting each request.
Finally, the log system collects the logs of the entire cluster in a distributed manner through Filebeat+ES.
At the same time, we have also developed some of our own services, such as deployment services and Contral services. It is mainly responsible for communicating with K8S, collecting service deployment and K8S-related hardware information of services in the entire K8S cluster.
Then the alarm system is done through Prometheus+Monitor, which can collect hardware data and be responsible for resource, business and other related alarms.
The data service is mainly used for downloading, including data return, and then intercepting the data in our inference scenario.
The throttling service is to limit the requests and QPS-related functions of each customer.
HPA is actually the most important part. HPA not only supports memory-level or CPU-level HPA, but also supports some P99, QPS, GPU and other related rules.
The last is the statistics service, which is mainly used to count related calls, such as requests.
We provide a one-stop solution for AI developers through a unified console, solve all service governance problems through a platform, improve the automation of operation and maintenance, and make an AI service that originally required several people to maintain. The situation has become that one person can maintain more than a dozen AI services.
This page displays configuration pages related to service routing, load balancing, and current limiting.
This page shows some of our alarms at the interface level, as well as hardware alarms at the deployment level.
This is log retrieval, including real-time log related functions.
This is the manual scaling and auto scaling operation page. The automatic scaling includes HPA at the CPU and memory levels, as well as HPA and timing based on the corresponding response time.
The organic combination of K8S and Spring Cloud
Finally, let's talk about the organic combination of K8S and SpringCloud.
Take a look at these two pictures. The image on the left is a diagram of our SpringCloud data center to routing. On the right is a diagram of K8S's service to its pod.
The two graphs are very close in structure. How do we do it? In fact, our Application is bound to the K8S service, that is to say, the address of the LB registered in our SpringCloud is actually converted into the address of the K8S service. This makes it possible to combine K8S with SpringCloud. This is the route level collection. With this set, the final effect can be achieved.
SprigCloud It is a Java technical language station. The languages of AI services are diverse, including C++, Java, and even PHP.
In order to achieve cross-language, we introduced the sidecar technology, which can shield the language features by communicating the AI service and the sidecar through RPC.
The main functions of Sidecar are application service discovery and registration, route tracking, link tracking, and health check.
Foreword
The arrival of the AI era has brought greater challenges to the enrichment and agility of the underlying IT resources of enterprises. Utilizing Alibaba Cloud's stable and elastic GPU cloud server, leading GPU containerized sharing and isolation technology, and K8S cluster management platform, TAL achieves flexible resource scheduling through cloud-native architecture, laying a solid foundation for its AI middle-end platform to be agile and efficient. Solid technical base.
At the 2020 Yunqi Conference, Liu Dongdong, the head of TAL's AI mid-stage, shared his understanding of AI cloud-native and TAL's AI mid-stage practice. This article organizes the speech content.
Hello everyone, I'm Liu Dongdong, technical director of TAL's AI Middle Office. The theme of the speech I bring to you today is "A Brief Talk on TAL's Cloud Native AI".
My sharing is mainly divided into four parts:
First, the challenge of AI services to cloud native.
Second, AI and cloud-native service deployment.
Third, AI and cloud-native service governance.
Finally, I want to talk about the organic combination of K8S and Spring Cloud.
Challenges of AI Services to Cloud Native
First, let's talk about the challenges of AI services to cloud native. In the cloud-native era, one of the biggest features of AI services is that they require greater computing power support and greater stability of a service.
Our service is not just an original single service, but now it has been transferred to a cluster service. At the same time, the stability requirements for performance have been challenged from 3 9s to 5 9s.
Then these problems can no longer be solved by the original traditional technical architecture. So we need a new technical architecture.
What is this new technical architecture? It's cloud native.
Let's take a look at the changes that cloud native has brought to us. I summarize the biggest changes brought by cloud native into four main points and two aspects.
The four main points are the four characteristics of DevOps, continuous delivery, microservices, and containers. The two aspects are service deployment and service governance. Of course, it also has an overall system summary of 12 elements.
Today's focus is on service deployment and service governance.
Under the cloud-native wave, how do we deal with service deployment and service governance?
First, we use AI and cloud-native service deployment, that is, through K8S, plus a resource virtualization, resource pooling and other technologies to solve the order-of-magnitude growth demand of AI services for various hardware resources.
Second, AI services are organically combined with cloud-native service governance. Through the technology of service governance, including service discovery, HPA, load balancing, etc., the SLA requirements of five 9s for AI services are solved.
Cloud-native deployment of AI services
The first point is to talk about how to combine AI with cloud-native service deployment.
First, let’s take a look at what are the characteristics of service deployment in the AI era?
The first is a contradiction between hardware resource requirements and cost growth. The demand for hardware for AI services has grown by orders of magnitude, but hardware budgets have not.
Second, the hardware requirements for AI services are diverse. For example, high GPU requirements, high CPU requirements, high memory requirements, and even some mixed requirements.
Third, AI services require isolation of resources. Each AI service can use these resources independently and without disturbing each other.
Fourth, AI services can have requirements for resource pooling. AI service does not need to perceive the specific configuration of the machine. Once all resources are pooled, resource fragmentation can be reduced and utilization rate can be improved.
Finally, AI services have requests for burst resources. Because traffic is unpredictable, enterprises need to maintain the ability to expand the resource pool at any time.
What is our solution?
First, we use Docker's virtualization technology to isolate resources.
Then use GPU sharing technology to pool GPU, memory, CPU and other resources, and then manage the entire resources uniformly.
Finally, use K8S resources, including technical features such as taints and tolerances, to achieve flexible configuration of services.
In addition, it is recommended that you buy some high-profile machines. These high-profile machines are mainly to further reduce fragmentation.
Of course, it is also necessary to monitor the hardware of the entire cluster, make full use of the various complex time rule scheduling features of ECS (the cron in the figure below is a time-based job scheduling task), and cope with peak traffic.
Next, let's take a closer look at how TAL's AI mid-stage solves these AI deployment problems.
This page is the service management of one of our Nodes. Through this business, we can clearly see the deployment situation on each server, including resource usage, which pods are deployed, which nodes, and so on.
The second is actually the service deployment page of the AI mid-stage. We can precisely control the memory, CPU, and GPU usage of each pod by pressing the code file. At the same time, through technologies such as taint, the diversified deployment of servers can be satisfied.
According to our comparative experiment, using the cloud-native approach to deploy compared to the user's self-deployment can save about 65% of the cost. Moreover, such advantages will benefit more in terms of economic benefits and temporary traffic expansion as the AI cluster grows.
AI and Cloud Native Service Governance
Next, we will discuss AI and cloud-native service governance.
Briefly introduce what is a microservice? In fact, microservice is just an architectural style of service. It actually develops a single service as a set of small services, and then each application has its own process to run, and through lightweight, such as HTTP, API, etc. to communicate.
These services are actually built around the business itself, and can be managed centrally through automated deployment and other means. At the same time, it is written in different languages and uses different storage resources.
In summary, what are the characteristics of microservices?
First, microservices are small enough that they can only do one thing.
Second, microservices are stateless.
Third, microservices are independent of each other, and they are interface-oriented.
Finally, microservices are highly autonomous, and everyone is responsible only to themselves.
After seeing the characteristics of these microservices, let's think about the characteristics of AI services and microservices. We found that AI services are naturally suitable for microservices. Each microservice actually does only one thing in essence. For example, OCR, OCR service, only provides OCR service; ASR, mainly provides ASR service.
In turn, each AI service request is independent. For a simple example, an OCR request is essentially unrelated to another OCR request.
AI services are inherently demanding for horizontal scaling. Why? Because the thirst for resources of the AI service team is very large. Therefore, this expansion is very necessary.
Dependencies between AI services are also particularly small. For example, like our OCR service, it may not have much requirements for NLP services or other AI services.
All AI services can provide AI capabilities by writing declarative HTTP or even API.
Taking a closer look at AI services, you will find that not all AI services can be microserviced. So what did we do?
First, the AI service needs to be made into a stateless service. These stateless services are animalized, stateless, and disposable, and do not use any disk or memory request methods to do some storage. Function. This allows the service to be deployed on any node, anywhere.
Of course, not all services can be stateless. What if it has state? We will store these request statuses through databases such as configuration center, log center, Redis, MQ, and SQL. At the same time, high reliability of these components is ensured.
This is the overall architecture diagram of TAL's AI mid-stage PaaS. First, you can look at the outermost layer is the service interface layer. The outermost interface layer provides AI capabilities externally.
The most important layer in the platform layer is the service gateway, which is mainly responsible for some dynamic routing, traffic control, load balancing, and authentication. Further down are some of our service discovery, registry, fault tolerance, configuration management, elastic scaling and other functions.
Below is the business layer. These business layers are what we call some AI inference services.
At the bottom is the K8S cluster provided by Alibaba Cloud.
That is to say, the overall architecture is that K8S is responsible for service deployment, and SpringCloud is responsible for service governance.
How do we achieve the overall architecture diagram just mentioned through technical means?
The first is to use Eureka as a registry to realize service discovery and registration of distributed systems. The configuration properties of the server are managed through the configuration center Apollo, and dynamic updates are supported. Gateway Gateway can achieve the effect of isolating the inner and outer layers. Fusing Hystrix is mainly divided into time-sharing and quantitative fuses, and then protects our services from being blocked.
Load balancing plus Fegin operation can achieve load balancing of the overall traffic and consume our Eureka-related registration information. The consumer bus Kafka is a component for asynchronous processing. Then the authentication is done through the method of Outh2+RBAC, which realizes the user's login including the authentication management of the interface, and ensures safety and reliability.
Link tracking uses Skywalking. Through this APM architecture, we can track the status of each request, which is convenient for locating and alerting each request.
Finally, the log system collects the logs of the entire cluster in a distributed manner through Filebeat+ES.
At the same time, we have also developed some of our own services, such as deployment services and Contral services. It is mainly responsible for communicating with K8S, collecting service deployment and K8S-related hardware information of services in the entire K8S cluster.
Then the alarm system is done through Prometheus+Monitor, which can collect hardware data and be responsible for resource, business and other related alarms.
The data service is mainly used for downloading, including data return, and then intercepting the data in our inference scenario.
The throttling service is to limit the requests and QPS-related functions of each customer.
HPA is actually the most important part. HPA not only supports memory-level or CPU-level HPA, but also supports some P99, QPS, GPU and other related rules.
The last is the statistics service, which is mainly used to count related calls, such as requests.
We provide a one-stop solution for AI developers through a unified console, solve all service governance problems through a platform, improve the automation of operation and maintenance, and make an AI service that originally required several people to maintain. The situation has become that one person can maintain more than a dozen AI services.
This page displays configuration pages related to service routing, load balancing, and current limiting.
This page shows some of our alarms at the interface level, as well as hardware alarms at the deployment level.
This is log retrieval, including real-time log related functions.
This is the manual scaling and auto scaling operation page. The automatic scaling includes HPA at the CPU and memory levels, as well as HPA and timing based on the corresponding response time.
The organic combination of K8S and Spring Cloud
Finally, let's talk about the organic combination of K8S and SpringCloud.
Take a look at these two pictures. The image on the left is a diagram of our SpringCloud data center to routing. On the right is a diagram of K8S's service to its pod.
The two graphs are very close in structure. How do we do it? In fact, our Application is bound to the K8S service, that is to say, the address of the LB registered in our SpringCloud is actually converted into the address of the K8S service. This makes it possible to combine K8S with SpringCloud. This is the route level collection. With this set, the final effect can be achieved.
SprigCloud It is a Java technical language station. The languages of AI services are diverse, including C++, Java, and even PHP.
In order to achieve cross-language, we introduced the sidecar technology, which can shield the language features by communicating the AI service and the sidecar through RPC.
The main functions of Sidecar are application service discovery and registration, route tracking, link tracking, and health check.