TAL AI Middle Platform Practice
First, let's talk about the challenges AI services pose to cloud native. In the cloud-native era, one of the biggest features of AI services is that they require greater computing power support and a more powerful service stability.
Our service is not just a single service, but has now been transferred to a cluster service. At the same time, the performance stability requirements have been challenged from 3 9s to 5 9s.
Then these problems can no longer be solved by the original traditional technical architecture. So we need a new technical architecture.
What is this new technical architecture? It is cloud native.
Let's take a look at the changes that cloud native has brought to us. I summarize the biggest changes brought about by cloud native into four main points and two major aspects.
The four main points are the four characteristics of DevOps, continuous delivery, microservices, and containers. The two aspects are service deployment and service governance. Of course, it also has an overall system summary of 12 elements.
Today's focus is on service deployment and service governance.
Under the wave of cloud native, how do we deal with service deployment and service governance?
First of all, we use AI and cloud-native service deployment, that is, through K8S, plus a resource virtualization, resource pooling and other technologies, to solve the order-of-magnitude growth demand of AI services for various hardware resources.
Second, AI services are organically combined with cloud-native service governance. Through the technology of service governance, including service discovery, HPA, load balancing, etc., it can solve the requirement of AI service for the SLA of 5 nines.
Cloud-native deployment of AI services
The first point is to talk about how to combine AI with cloud-native service deployment.
First, let’s take a look at what are the characteristics of service deployment in the AI era?
The first is a contradiction between hardware resource requirements and cost increases. The demand for hardware for AI services has increased by orders of magnitude, but the hardware budget has not increased by orders of magnitude.
Second, AI services have diverse hardware requirements. For example, high GPU requirements, high CPU requirements, high memory requirements, and even some mixed requirements.
Third, AI services require isolation of resources. Each AI service can use these resources independently without interfering with each other.
Fourth, AI services can require resource pooling. The AI service does not need to perceive the specific configuration of the machine. Once all resources are pooled, resource fragmentation can be reduced and utilization can be improved.
Finally, AI services have requests for bursty resources. Because the traffic is unpredictable, enterprises need to maintain the ability to expand the resource pool at any time.
What is our solution?
First, we use Docker's virtualization technology to isolate resources.
Then use GPU sharing technology to pool resources such as GPU, memory, and CPU, and then manage the entire resource in a unified manner.
Finally, use K8S resources, including technical features such as taints and tolerances, to achieve flexible configuration of services.
In addition, it is recommended that you buy some high-configuration machines. These high-configuration machines are mainly to further reduce fragmentation.
Of course, it is also necessary to monitor the entire cluster hardware, and make full use of ECS's various complex time-rule scheduling features (cron in the figure below is a time-based job scheduling task) to deal with peak traffic.
Next, let's take a closer look at how TAL's AI middle platform solves these AI deployment problems.
This page is our Node service management. Through this business, we can clearly see the deployment status on each server, including resource usage, which pods are deployed, which nodes, and so on.
The second one is actually the service deployment page of the AI platform. We can precisely control the memory, CPU, and GPU usage of each pod through compressed files. At the same time, through technologies such as taint, the diversified deployment of servers can be satisfied.
According to our comparative experiments, using cloud-native deployment compared with self-deployment by users saves about 65% of the cost. Moreover, such advantages will benefit more in terms of economic benefits and temporary traffic expansion as the AI cluster grows.
AI and Cloud Native Service Governance
Next, let’s discuss AI and cloud-native service governance.
Briefly introduce what is microservice? In fact, microservice is just an architectural style of service. It actually develops a single service as a set of small services, and then each application has its own process to run, and through some lightweight, such as HTTP, API, etc. for communication.
These services are actually built around the business itself and can be managed centrally through automated deployment and other means. At the same time, it is written in different languages and uses different storage resources.
In summary, what are the characteristics of microservices?
**First, microservices are small enough that they can only do one thing.
**Second, microservices are stateless.
Third, microservices are independent of each other, and they are interface-oriented.
Finally, microservices are highly autonomous, and everyone is only responsible for themselves.
After seeing the characteristics of these microservices, think again about the characteristics of AI services and microservices. We found that AI services are naturally suitable for microservices. Each microservice actually only does one thing in essence. For example, OCR, OCR service, only OCR service; ASR, mainly ASR service.
Then, each AI service request is independent. To give a simple example, an OCR request is essentially unrelated to another OCR request.
AI services are inherently demanding on horizontal expansion. Why? Because the thirst for AI service team resources is very large. Therefore, this expansion becomes very necessary.
The dependencies between AI services are also extremely small. For example, our OCR service may not have much requirements for NLP services or other AI services.
All AI services can provide AI capabilities by writing declarative HTTP or even API.
Taking a closer look at AI services, you will find that not all AI services can be micro-serviced. So, what did we do?
First, it is necessary to make the AI service a stateless service. These stateless services are animalized, stateless, and discardable, and do not use any disk or memory request methods to do some storage Function. In this way, the service can be deployed on any node, anywhere.
Of course, not all services can be stateless. What if it has state? We will store the status of these requests through databases such as configuration center, log center, Redis, MQ, and SQL. At the same time, high reliability of these components is ensured.
This is the overall architecture diagram of TAL's AI middle platform PaaS. First of all, you can look at the outermost layer is the service interface layer. The outermost interface layer provides AI capabilities for the outside world.
The most important layer in the platform layer is the service gateway, which is mainly responsible for some dynamic routing, flow control, load balancing, authentication, etc. Further down are some of our service discovery, registration center, fault tolerance, configuration management, elastic scaling and other functions.
Below that is the business layer, which is what we call some AI reasoning services.
At the bottom is the K8S cluster provided by Alibaba Cloud.
In other words, the overall architecture is that K8S is responsible for service deployment, and SpringCloud is responsible for service governance.
How do we realize the overall architecture diagram just mentioned through technical means?
The first is to use Eureka as a registration center to realize service discovery and registration of distributed systems. The configuration properties of the server are managed through the configuration center Apoll, and dynamic updates are supported. Gateway Gateway can achieve the effect of isolating the inner and outer layers. Fuse Hystrix is mainly divided into time-sharing fuses and quantity fuses, and then protects our services from being blocked.
Load balancing plus Fegin operation can achieve load balancing of the overall traffic and consume our Eureka-related registration information. The consumption bus Kafka is a component of asynchronous processing. Then authentication is done through the method of Outh2+RBAC, which realizes user login including interface authentication management, ensuring safety and reliability.
Link tracking uses Skywalking. Through this APM architecture, we can track the status of each request, which is convenient for locating and alerting each request.
Finally, the log system collects the logs of the entire cluster in a distributed manner through Filebeat+ES.
At the same time, we have also developed some of our own services, such as deployment services and Contral services. It is mainly responsible for communicating with K8S, collecting service deployment of services in the entire K8S cluster, and K8S-related hardware information.
Then the alarm system is done through Prometheus+Monitor, which can collect hardware data and be responsible for alarms related to resources and services.
Data service is mainly used for downloading, including data backflow, and then intercepting the data in our reasoning scenarios.
Throttle service is to limit each customer's request and QPS related functions.
The HPA is actually the most important part. HPA not only supports memory-level or CPU-level HPA, but also supports some related rules such as P99, QPS, and GPU.
The last is the statistical service, which is mainly used to count related calls, such as requests.
We provide a one-stop solution for AI developers through a unified console, solve all service governance problems through one platform, improve the automation of operation and maintenance, and make an AI service that requires several people to maintain The situation has become that one person can maintain more than a dozen AI services.
This page shows the configuration pages related to service routing, load balancing, and current limiting.
This page shows some of our alarms at the interface level, as well as hardware alarms at the deployment level.
This is log retrieval, including real-time log related functions.
This is the manual scaling and auto scaling operation page. Among them, automatic scaling includes HPA at the CPU and memory levels, as well as HPA and timing HPA based on the corresponding response time.
The organic combination of K8S and Spring Cloud
Finally, let’s talk about the organic combination of K8S and SpringCloud.
You can take a look at these two pictures. The picture on the left is a picture of our SpringCloud data center to route. On the right is a diagram of a K8S service to its pods.
The two graphs are very close in structure. How do we do it? In fact, our Application is bound to the service of K8S, that is to say, it is finally registered to the address of LB in our Spring Cloud, and it is actually converted into the address of K8S service. In this way, K8S can be combined with Spring Cloud. This is the routing level collection. With this collection, the final effect can be achieved.
SprigCloud It is a technical language station for Java. The languages of AI services are diverse, including C++, Java, and even PHP.
In order to achieve cross-language, we introduced sidecar technology, and communicated with AI service and sidecar through RPC, which can shield language features.
The main functions of Sidecar are application service discovery and registration, route tracking, link tracking, and health check.
Knowledge Base Team
Knowledge Base Team
Knowledge Base Team
Knowledge Base Team
Explore More Special Offers
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00