PAI-TensorFlow is a service-oriented product provided by Machine Learning Platform for AI (PAI). It aims to improve the efficiency of deep learning, optimize the kernel of native TensorFlow, and develop common tools. PAI-TensorFlow features distributed scheduling, global computing scheduling, online model predication, and GPU mapping.

Background information

TensorFlow is an open source deep learning computing framework developed by Google. It supports multiple neural network models such as convolutional neural network (CNN), recurrent neural network (RNN), and long short-term memory (LSTM). TensorFlow has high efficiency in training models of various aspects, such as video, image, and text. Based on various features and highly flexible APIs, TensorFlow gains wide attention from the industry.

PAI-TensorFlow is a product provided by PAI. It aims to improve the efficiency of deep learning, optimize the kernel of native TensorFlow, and develop common tools. PAI-TensorFlow is fully compatible with code of native TensorFlow and achieves high performance in industrial production scenarios. PAI-TensorFlow is launched and deployed in some Alibaba Cloud services, such as PAI and E-MapReduce (EMR).


PAI-TensorFlow has the following features:
  • Service orientation

    Based on the Apsara system, Alibaba Cloud develops a big data computing service, MaxCompute, which is applied to numerous enterprises and individual developers. PAI-TensorFlow helps you use computing frameworks of TensorFlow in MaxCompute. The API version of PAI-TensorFlow is the same as that of TensorFlow. You can use the TensorFlow Training Script API to submit a task to the MaxCompute compute cluster.

  • Distributed scheduling

    PAI provides you with large amounts of computing resources. You can use GPU Quota to manage the resources. Based on the underlying distributed scheduling system, PAI-TensorFlow dynamically schedules tasks to different machines. When you submit a PAI-TensorFlow task, you do not need to request GPU hosts in advance. The required GPU resources are dynamically allocated and released.

  • Global computing scheduling

    When you use MaxCompute, you can submit SQL tasks and PAI-TensorFlow tasks in a project at the same time. The global computing scheduling service of MaxCompute automatically schedules PAI-TensorFlow tasks to related GPU clusters. It also combines data preprocessing tasks based on CPU clusters with model training tasks based on GPU clusters.

  • Mapped GPUs

    PAI-TensorFlow assigns different operators to specified CPUs or GPUs. Because GPUs are mapped, you do not need to understand the GPU structure of the host. PAI-TensorFlow automatically maps GPUs that the task requests to the process workspace. This way, GPUs are displayed as GPU:0, GPU:1, and so on.

  • Online model prediction

    PAI provides you with PAI-Elastic Algorithm Service (EAS) for online prediction. Models generated during PAI-TensorFlow training can be deployed in PAI-EAS. PAI-EAS covers a wide range of features, including dynamic scaling of models, rollover, A/B testing, high throughout, and low latency.

Supported Python libraries

PAI-TensorFlow has installed the common Python libraries such as NumPy and Six. You can import a library to a TensorFlow task.