This topic describes how to install and use FastGPU to build training tasks by using 64-bit Ubuntu 18.04.

Prerequisites

Python 3.6 or later is installed on a client.
Note To build AI computing tasks, you can use an ECS instance, an on-premises machine, or Alibaba Cloud Shell as a client to install FastGPU.

Background information

FastGPU contains the following components:
  • The runtime component ncluster: provides interfaces to deploy offline AI training and inference scripts to Alibaba Cloud IaaS resources. For more information about the runtime component, see Description of the component ncluster during the runtime of the component.
  • The command line-based component ecluster: provides command line-based tools to manage the status of Alibaba Cloud AI computing tasks and the lifecycle of clusters. For more information about the command line-based component, see Description of ecluster.

Install FastGPU

  1. Download the FastGPU package to the client.
    wget https://ali-perseus-release.oss-cn-huhehaote.aliyuncs.com/fastgpu/ncluster-1.0.8-py3-none-any.whl
  2. Install FastGPU.
    pip install ncluster-1.0.8-py3-none-any.whl

Run the FastGPU demo

FastGPU provides the following training scenario demos. You can go to GitHub to download them.
  • GTC-demo: the gesture recognition training by using PyTorch.
  • InsightFace: the facial recognition training of MXNet.
  • Bert: the speech recognition training of TensorFlow.

The following operations use the BERT model to show how to use FastGPU in Cloud Shell. The instance automatically created in the demo is of the ecs.gn6v-c10g1.20xlarge instance type that has eight V100 GPUs. The task deployment time is about 2.5 minutes, and the training duration is 11.5 minutes. Therefore, the total time is 14 minutes. The training precision is above 0.88.

  1. Open Cloud Shell.
    In this test, 64-bit Ubuntu 18.04 is used in Cloud Shell, and FastGPU is automatically installed. You can directly prepare the project file and execute the task.
  2. Prepare the project file.
    git clone https://github.com/aliyun/alibabacloud-aiacc-demo
  3. Go to the directory of the task script.
    cd alibabacloud-aiacc-demo/tensorflow/bert
  4. Run the task script.
    python train_news_classifier.py
    When the task is being run, resources such as instances are automatically created. You may be prompted that you are charged for the resources. Confirm to continue as prompted. prompt-fee
    Notice After the task is run, you can release the instances to avoid further costs.
    If the result in the following figure is returned, the script is executed. training-complete
  5. Check the instance created when the task is being run.
    ecluster ls
    ecluster-ls
  6. Log on to the instance to view logs of the training process.
    ecluster tmux task0.perseus-bert
    If the result in the following figure is returned, the training is complete. log