FastGPU is a tool provided by Alibaba Cloud to build AI computing tasks. This tool provides you with interfaces and command lines to build AI computing tasks on Alibaba Cloud IaaS resources. This topic describes how to install and use FastGPU and lists the runtime interfaces and command lines supported by FastGPU. Ubuntu 18.04 64-bit is used in the examples.

Prerequisites

Python 3.6 or later is installed on a client.
Note You can use an ECS instance, an on-premises machine, or Alibaba Cloud Shell as a client to install FastGPU to build AI computing tasks.

Background information

FastGPU is critical for connecting your offline AI algorithms with large amounts of online Alibaba Cloud GPU resources. FastGPU makes it easy to build AI computing tasks on Alibaba Cloud IaaS resources. You can use FastGPU to build AI computing tasks without the need to deploy computing, storage, or network resources on the IaaS layer.

FastGPU contains the following components:
  • The runtime component ncluster: provides interfaces to easily deploy offline AI training and inference scripts to Alibaba Cloud IaaS resources. For more information about the runtime component, see the Runtime component description section.
  • The command line-based component ecluster: provides command line-based tools to manage the status of Alibaba Cloud AI computing tasks and the lifecycle of clusters. For more information about the command line-based component, see the Command line-based component description section.
The following figure shows the modules of FastGPU.fastgpu-arc

Install FastGPU

  1. Download the FastGPU package on the client.
    wget https://ali-perseus-release.oss-cn-huhehaote.aliyuncs.com/fastgpu/ncluster-1.0.8-py3-none-any.whl
  2. Install FastGPU.
    pip install ncluster-1.0.8-py3-none-any.whl

Run the FastGPU demo

This section describes how to use FastGPU to run a task. In this example, the task is running BERT fine-tuning in Cloud Shell. The instance automatically created in the demo is of the ecs.gn6v-c10g1.20xlarge instance type that has eight V100 GPUs. The task deployment time is about 2.5 minutes, the training duration is 11.5 minutes, and therefore the total time is 14 minutes. The training precision is above 0.88.

  1. Open Cloud Shell.
    In this test, Ubuntu 18.04 64-bit is used in Cloud Shell and FastGPU is automatically installed. You can directly prepare the project file and execute the task.
  2. Prepare the project file.
    git clone https://github.com/aliyun/alibabacloud-aiacc-demo
  3. Go to the directory of the task script.
    cd alibabacloud-aiacc-demo/tensorflow/bert
  4. Run the task script.
    python train_news_classifier.py
    When the task is being run, resources such as instances are automatically created. You may be prompted that you are charged for these resources. Confirm to continue as prompted.prompt-fee
    Notice After the task is complete, you can release the instance to avoid further costs.
    If the result in the following figure is returned, the script is executed.training-complete
  5. Check the instance created when the task is being run.
    ecluster ls
    ecluster-ls
  6. Log on to the instance to view logs of the training process.
    ecluster tmux task0.perseus-bert
    If the result in the following figure is returned, the training is complete.log

Runtime component description

You can use the ncluster interface to easily deploy AI training and inference scripts to the cloud for computing. The ncluster interface provides the following features:
  • Obtain information such as the Alibaba Cloud AccessKey pair, default region, and default zone.
    export ALIYUN_ACCESS_KEY_ID=L****      # Your actual aliyun access key id
    export ALIYUN_ACCESS_KEY_SECRET=v****   # Your actual aliyun access key secret
    export ALIYUN_DEFAULT_REGION=cn-hangzhou  # The actual region the resource you want to use
    export ALIYUN_DEFAULT_ZONE=cn-hangzhou-i  # The actual zone of the region you want to use
  • nucluster is a set of Python libraries. You must import ncluster to a Python script when you use the interface.
    import ncluster
  • Create resources required for the task or reuse existing resources.
    job = ncluster.make_job(name=args.name,
                            run_name=f"{args.name}-{args.machines}",
                            num_tasks=args.machines,
                            image_name=IMAGE_NAME,
                            instance_type=INSTANCE_TYPE)
    The following table describes parameters of ncluster.make_job.
    Parameter Description Example
    name The name of the job. 'perseus-bert'.
    run_name The environment name of runtime, which is typically configured to the job name associated with the number of instances. f"perseus-bert-1".
    num_tasks The number of instances to create. 1.

    1 indicates that one instance is created. The name of the instance created in the preceding example is task0.perseus-bert, which corresponds to the task name perseus-bert.tasks[0].

    image_name The image of the instance. Both public and custom images are supported. 'ubuntu_18_04_64_20G_alibase_20190624.vhd'.
    instance_type The instance type of the instance to create. 'ecs.gn6v-c10g1.20xlarge'.
  • Run the task. You can run the task in a job or task mode. A job is a group of tasks.
    Note A job and its tasks support the same API operations. Operations used to manage the job can be used on all of its tasks but operations used to manage tasks can only be used on tasks.
    Example:
    • Call an API operation for a job
      #Open the perseus-bert folder for all instances used when tasks in a job are being executed.
      job.run('cd perseus-bert') 
      
      #Upload the perseus-bert folder from the current directory to the /root directories of all instances used when tasks in a job are being executed.
      job.upload('perseus-bert')
    • Call an API operation for a task
      #Open the perseus-bert folder for the instance used when task0 is being executed.
      job.tasks[0].run('cd perseus-bert') 
      
      #Upload the perseus-bert folder from the current directory to the /root directory of the instance when task0 is being executed.
      job.tasks[0].upload('perseus-bert')

Command line-based component description

You can use ecluster commands to manage the resource lifecycle and view logs of running processes. The following table describes commands supported by ecluster.
Command Description Example
export Obtains information about an Alibaba Cloud account. When you use FastGPU on an on-premises machine, you must obtain information such as the AccessKey pair, default region, and default zone.
  • export ALIYUN_ACCESS_KEY_ID=L****
  • export ALIYUN_ACCESS_KEY_SECRET=v****
  • export ALIYUN_DEFAULT_REGION=cn-hangzhou
  • export ALIYUN_DEFAULT_ZONE=cn-hangzhou-i
ecluster [help,-h,--help] Views all ecluster commands. ecluster --help
ecluster {command} --help Views a specific ecluster command. ecluster ls --help
ecluster create --config create.cfg Creates an instance based on a configuration file. The create.cfg file specifies the configuration environment of an instance. You must create a create.cfg file before you run this command. For more information about the create.cfg file, see the example below this table. ecluster create --config create.cfg
ecluster create --name {instance_name} --machines {instance_num} ... Creates an instance based on the parameters. ecluster create --name task0.ncluster-v100 --machines 1
ecluster ls Lists the automatically created instances. The following parameters of the instances are displayed:
  • name: the name of the instance.
  • hours_live: the time length from when the instance was created. Unit: hours.
  • instance_type: the instance type of the instance.
  • public_ip: the public IP address of the instance.
  • key/owner: the key pair or username of the instance.
  • private_ip: the internal IP address of the instance.
  • instance_id: the ID of the instance.
ecluster ls
ecluster ssh {instance_name} Logs on to a specific instance. ecluster ssh task0.ncluster-v100
ecluster tmux {instance_name} Connects to a running task. If no tmux sessions exist, SSH is used. ecluster tmux task0.ncluster-v100
ecluster stop {instance_name} Stops a specific instance.
  • ecluster stop task0.ncluster-v100: stops the instance used when task0 is being executed
  • ecluster stop {ncluster-v100}: stops all instances used when tasks in a job are being executed
ecluster start {instance_name} Starts a specific instance.
  • ecluster start task0.ncluster-v100: starts the instance used when task0 is being executed
  • ecluster start {ncluster-v100}: starts all instances used when tasks in a job are being executed
ecluster kill {instance_name} Releases a specific instance.
  • ecluster kill task0.ncluster-v100: releases the instance used when task0 is being executed
  • ecluster kill {ncluster-v100}: releases all instances used when tasks in a job are being executed
ecluster mount {instance_name} Attaches an NAS file system to the /ncluster directory of a specific instance. ecluster mount task0.ncluster-v100
ecluster scp {source} {destination} Copies a file or directory securely. ecluster scp /local/path/to/upload task0.ncluster-v100:/remote/path/to/save
ecluster addip {instance_name} Adds the public IP address of an instance used in a specific task to a security group. ecluster addip task0.ncluster-v100
ecluster rename {old_name} {new_name} Renames a specific instance. ecluster rename task0.ncluster-v100 task1.ncluster-v100
When you create an instance based on a configuration file, you can refer to the following code to create the configuration file:
; config.ini

[ncluster]
; The job name for current creation job.
name=ncluster-v100
; The number of machine you want to create
machines=1
; The system disk size for instances in GB
system_disk_size=300
; The data disk size for instances in GB
data_disk_size=0
; The system image name you want to installed in the instances.
image_name=ubuntu_18_04_64_20G_alibase_20190624.vhd
; The instance type you want to create at Alibaba Cloud.
instance_type=ecs.gn6v-c10g1.20xlarge
; The spot instance option; If you want to buy spot instance, please set it to True.
spot=False
; If only used to create instances, it can set to True.
confirm_cost=False
; Confirm the next operation will cost money, if set to True will default confirmed.
skip_setup=True
; Nas create/mount options; Set True will disable nas mount for current job.
disable_nas=True
; The zone id info. The option provided to use resource in the zone.
zone_id=cn-hangzhou-i
; Specify the vpc name
vpc_name=ncluster-vpc

[cmd]
install_script=pwd