FastGPU is a set of fast deployment tools of AI computing provided by Alibaba Cloud. Based on the convenient interfaces and automatic tools, you can deploy AI intelligent training and inference tasks on infrastructure as a service (IaaS) resources of Alibaba Cloud.

Introduction

FastGPU is critical for connecting your offline AI algorithms to large amounts of online Alibaba Cloud GPU resources. FastGPU makes it easy to build AI computing tasks on Alibaba Cloud IaaS resources. You can use FastGPU to build AI computing tasks without the need to deploy computing, storage, or network resources at the IaaS layer.

FastGPU contains the following components:
  • The runtime component ncluster: provides interfaces to deploy offline AI training and inference scripts to Alibaba Cloud IaaS resources. For more information about the runtime component, see Use FastGPU SDK for Python.
  • The command line-based component ecluster: provides command line-based tools to manage the status of Alibaba Cloud AI computing tasks and the lifecycle of clusters. For more information about the command line-based component, see Command reference.

Modules

The following figure shows the modules of FastGPU. fastgpu-arc
  • Bottom layer: the interaction layer of resources on Alibaba Cloud by calling the OpenAPI operations.
  • Intermediate layer: the Alibaba Cloud backend layer formed by encapsulating the objects for IaaS layer resources when the AI tasks are running.
  • Upper layer: the user control layer formed by mapping AI tasks to Alibaba Cloud instance resources.

    You need only to call the user control layer to build IaaS-level AI computing tasks on Alibaba Cloud.

Process

For example, use FastGPU to complete a training task:
  1. When you start to use FastGPU:

    Upload the training dataset to Object Storage Service (OSS) and create an ECS instance as the development host to store the training code.

  2. When FastGPU immediately builds computing tasks:
    1. Deploy the cluster on the development host by using FastGPU and create the resources required for the task, which include computing resources such as CPUs, GPUs, storage resources such as cloud disks and NAS file systems, and interactive resources such as Tmux and TensorBoard.
    2. The distributed training task is automatically enabled, and the training can be viewed in real time by using interactive resources during the training process.
    3. Automatically release resources after the distributed training task is complete.
  3. When the task is complete: Store the trained models and log files in the development host of cloud disks or OSS to view the task results.