Community Blog Alibaba Cloud Assists TuSimple in Improving Performance and Model Iteration Acceleration

Alibaba Cloud Assists TuSimple in Improving Performance and Model Iteration Acceleration

This article shares how TuSimple dealt with fluctuations in computing power with Elastic GPU Service and Alibaba Cloud’s suite of Kubernetes container services.

About TuSimple

Founded in 2015, TuSimple is an artificial intelligence enterprise focusing on the research, development and application of L4 unmanned truck technology. It has realized unmanned driving of trucks in trunk logistics scenarios and semi-closed hub scenarios. The products of TuSimple brand – TuSimple L4-level driverless truck can realize the core functions of automatic driving such as environment perception, position navigation, decision control, etc. It can be applied to expressway freight transportation, port container terminal transportation, and similar scenarios.

The company completed the D round of financing with a total amount of 215 million dollars in September, 2019. The accumulated financing of UPS, CDH Investments and Mando China exceeded 300 million dollars. With its latest market valuation exceeds 1.2 billion dollars, TuSimple is a leader in the unmanned truck industry and one of the world's first unmanned truck unicorn enterprise.


Pain Points of TuSimple’s Legacy System

Low GPU Utilization

A self-driving truck generates about 50TB of data in just two weeks. In TuSimple, more than 70 trucks are currently on the road, generating a large amount of data day by day. To make self-driving trucks smarter, it is necessary to accumulate more real data sets to train its capability in target detection and object recognition frameworks.

With rapid business development and faster iterations, TuSimple model is getting more complex. For every time of model iteration, large-scale GPU resources must be scheduled in a short time to train the model in a distributed manner. However, as the GPU Server procurement cost is higher and the operation and maintenance is complex, TuSimple will have to focus more on O&M. More importantly, it comes to TuSimple that though the number of GPUs used increases, GPU utilization is not high.

Large Fluctuations in Computing Power Needed by Model Training

After completing the training of iteration, TuSimple needs to test the optimized model. But testing every time on the road brings about high costs and high risks, without the verification of all kinds of extreme situations. Thankfully, TuSimple owns a simulation platform to simulate a variety of environments, whether it is sunny, cloudy, rainy, foggy and hazy, or at night, to test the processing capabilities of model.

This kind of test task relies on the development rhythm, which is irregular, temporary and short-term, with demand of a very large scale of computing power. If you purchase a large number of computing power in subscription mode, it will be idle most of the time, while being in shortage when in need. Simulation tasks need to be queued for completion, which affects the development efficiency of developers and the iteration speed of models.

Alibaba Cloud’s GPU and Containerization Solutions

Theoretically, the more GPUs, the greater the overall computational power will be generated. However, as the number of machines increases, the cooperation between GPUs of different machines becomes more difficult, and the utilization rate of a single GPU card will decrease. Therefore, the card cost is increased dozens of times, but the performance is difficult to increase linearly accordingly.

To solve this problem, Alibaba Cloud's Apsara AI Accelerator (AIACC) team made deep optimizations at the underlying layer for communication, computing, latency, and bandwidth for TuSimple scenarios, improving training performance by nearly 60%. It greatly shortens the TuSimple model optimization time, accelerates model iteration, and improves the technical threshold.


Since containerization has been implemented in TuSimple’s overall business architecture, they are always ready for temporary peak scenarios with an agile operational reserve. With Alibaba Cloud Serverless Kubernetes (ASK) container service, TuSimple can launch large-scale container clusters on Alibaba Cloud in seconds when tests are needed, instantly gaining massive computing power to shorten the model testing time by 60%. It will rapidly release computing power after testing to avoid resource waste.

Alibaba Cloud ASK is an O&M-free Serverless Kubernetes container service. It uses Elastic Container Instance (ECI) at the underlying layer as the container computing infrastructure and provides high elasticity, low costs, and O&M-free Serverless container runtime environment. This frees users from O&M and capacity planning of container clusters to save the workload for TuSimple O&M significantly.

In addition, the ASK has a second-level billing granularity, suitable for sudden high-concurrent short-term task simulation computing. For long-term training tasks, TuSimple uses the subscription Alibaba Cloud Kubernetes (ACK). The combination of ACK for long-term tasks and ASK for short-term tasks not only improves the resource utilization of TuSimple but also saves the cost.


Benefits of Migrating to the Cloud

  • AIACC improves TuSimple training performance by nearly 60%, greatly reducing the time for model optimization, accelerating model iteration, and improving the technical threshold.
  • AIACC shortens model testing time by 60%, quickly releasing the computing power after completion to avoid the waste of resources.
  • AIACC frees users from container cluster O&M and capacity planning to reduce the TuSimple O&M workload.
  • With the combination of ACK for long-term tasks and ASK for short-term tasks, the resource utilization of TuSimple is improved with the cost being saved.

Products Used in the Solution

Alibaba Cloud Kubernetes (ACK)

Alibaba Cloud Kubernetes (ACK) provides the management for high-performance and scalable containers and supports container lifecycle management for enterprise-level containerized applications. It features Alibaba Cloud virtualization, storage, network, and security capabilities and provides an optimal environment for running cloud-based containerized applications.

For more information about ACK, see https://www.alibabacloud.com/product/kubernetes

AIACC Acceleration Engine

Apsara AI Acceleration (AIACC) engine by Alibaba Cloud Elastic GPU Service is one of the first acceleration engine in the AIACC industry to uniformly accelerate Tensorflow, MXNet, Caffe, PyTorch, and other mainstream deep learning frameworks. It topped 4 lists of the Stanford Dawnbench deep learning ranking lists in image recognition.

For more information about AIACC, see https://www.alibabacloud.com/help/doc-detail/163494.htm

Alibaba Cloud Serverless Kubernetes (ASK)

Alibaba Cloud Serverless Kubernetes (ASK) is a secure and reliable container service based on the Alibaba Cloud elastic computing architecture. It is fully compatible with the Kubernetes ecosystem. With ECI, you can create Kubernetes applications easily without having to manage or maintain clusters. You are billed based on the amount of CPU and memory resources used by applications, which allows you to focus on the application itself rather than the infrastructure that runs the application.

For more information about ASK, see https://www.alibabacloud.com/help/doc-detail/86366.htm

Elastic Container Instance (ECI)

Elastic Container Instance (ECI) provides a secure service for running Serverless containers. It allows you to run containers with packaged container images without the need to manage servers. Only resources that are consumed by the containers are paid.

For more information about ECI, see https://www.alibabacloud.com/product/elastic-container-instance

0 0 0
Share on

Alibaba Container Service

102 posts | 26 followers

You may also like