This topic describes how to install and use FastGPU to build training tasks by using 64-bit Ubuntu 18.04.
- The runtime component ncluster: provides interfaces to deploy offline AI training and inference scripts to Alibaba Cloud IaaS resources. For more information about the runtime component, see Description of the component ncluster during the runtime of the component.
- The command line-based component ecluster: provides command line-based tools to manage the status of Alibaba Cloud AI computing tasks and the lifecycle of clusters. For more information about the command line-based component, see Description of ecluster.
- Download the FastGPU package to the client.
- Install FastGPU.
pip install ncluster-1.0.8-py3-none-any.whl
Run the FastGPU demo
- GTC-demo: the gesture recognition training by using PyTorch.
- InsightFace: the facial recognition training of MXNet.
- Bert: the speech recognition training of TensorFlow.
The following operations use the BERT model to show how to use FastGPU in Cloud Shell. The instance automatically created in the demo is of the ecs.gn6v-c10g1.20xlarge instance type that has eight V100 GPUs. The task deployment time is about 2.5 minutes, and the training duration is 11.5 minutes. Therefore, the total time is 14 minutes. The training precision is above 0.88.
- Open Cloud Shell. In this test, 64-bit Ubuntu 18.04 is used in Cloud Shell, and FastGPU is automatically installed. You can directly prepare the project file and execute the task.
- Prepare the project file.
git clone https://github.com/aliyun/alibabacloud-aiacc-demo
- Go to the directory of the task script.
- Run the task script.
python train_news_classifier.pyWhen the task is being run, resources such as instances are automatically created. You may be prompted that you are charged for the resources. Confirm to continue as prompted.Notice After the task is run, you can release the instances to avoid further costs.If the result in the following figure is returned, the script is executed.
- Check the instance created when the task is being run.
- Log on to the instance to view logs of the training process.
ecluster tmux task0.perseus-bertIf the result in the following figure is returned, the training is complete.