You can use FastGPU SDK for Python to integrate FastGPU into your AI training or inference scripts. This enables fast cloud deployment and resource management.

Prerequisites

Python 3.6 or later is installed on a client.

Note To build AI computing tasks, you can use an Elastic Compute Service (ECS) instance, an on-premises machine, or Alibaba Cloud Shell as a client to install FastGPU.

Prepare an environment

  1. Run the following command to install the FastGPU package:
    pip3 install --force-reinstall https://ali-perseus-release.oss-cn-huhehaote.aliyuncs.com/fastgpu/fastgpu-1.1.5-py3-none-any.whl
  2. Run the following commands to configure the environment variables.
    To configure environment variables, you need to obtain information such as the AccessKey pair of your Alibaba Cloud account, the default region, and the default zone. Then, run the following commands on your ECS instance, your on-premises machine, or Alibaba Cloud Shell to configure the environment variables:
    export ALIYUN_ACCESS_KEY_ID=****          # Enter your AccessKey ID.
    export ALIYUN_ACCESS_KEY_SECRET=****      # Enter your AccessKey secret.
    export ALIYUN_DEFAULT_REGION=cn-hangzhou  # Enter the ID of the region.
    export ALIYUN_DEFAULT_ZONE=cn-hangzhou-i  # Optional. Enter the ID of the zone.
  3. Run the following command to import the FastGPU module into the Python code:
    import fastgpu

Create or obtain instances

The fastgpu.make_job method automatically creates a group of instances based on specified rules. If the specified group of instances already exists, the group of instances is returned.

job = fastgpu.make_job(
    name: str="",             # Required. The name of the instance cluster. 
    instance_type: str="",    # Required. The instance type. 
    num_tasks: int=0,         # The number of instances.
    install_script: str="",   # The initialization command.
    image_name: str="",       # The name of the image.
    image_type: str="",       # The type of the image.
    disk_size: int=500,       # The size of the data disk.
    spot: bool=False,         # Specifies whether to create preemptible instances.
    confirm_cost: bool=False, # Specifies whether to skip consumption warnings.
    install_cuda: bool=False, # Specifies whether to automatically install a GPU driver.
    mount_nas: bool=False    # Specifies whether to automatically mount an Apsara File Storage NAS file system.
)

The following table describes the parameters mentioned in the preceding example.

ParameterRequiredDescriptionExample
nameYesThe name of the instance cluster.

This parameter is empty by default. The default value indicates that instances are obtained from existing resources.

Set the instance cluster name to fastgpu_test:

name="fastgpu_test"

instance_typeYesThe instance type.

You can run the fastgpu querygpu command to query GPU-accelerated instance types. For more information, see Instance families with GPU capabilities.

Set the instance type to ecs.gn6v-c8g1.2xlarge:

instance_type="ecs.gn6v-c8g1.2xlarge"

num_tasksNoThe number of instances. Default value: 1. Set the number of instances to 1:

num_tasks=1

install_scriptNoThe script used to initialize the instances.

This parameter is empty by default. The default value indicates that no command is run.

After initialization, enable the instances to start the SSH service:

install_script="systemctl start sshd"

image_nameNoThe name of the image used by the instances.

This parameter is empty by default. The default value indicates that Alibaba Cloud Linux 2.1903 is used as the default image.

You can run the fastgpu queryimage command to query images.

Specify a CentOS image:

image_name="centos_8_5_x64_20G_alibase_202111129.vhd"

image_typeNoThe type of the image used by the instances. You can set the image type to an OS type, such as aliyun, ubuntu, or centos. You can also set the image type to an OS version, such as ubuntu_18_04 or centos_7_9. In addition, you can set the image type to AIACC. An AIACC image contains deep learning frameworks and AIACC-Training. For more information, see What is AIACC?.
  • Set the image type to ubuntu_16_04:

    image_type="ubuntu_16_04"

  • Set the image type to AIACC to use the latest AIACC image:

    image_type="AIACC"

disk_sizeNoThe size of the data disk.

Default value: 500. Unit: GB.

Set the size of the data disk to 500 GB:

disk_size=500

spotNoSpecifies whether to create preemptible instances.

Default value: False.

Set this parameter to True to create preemptible instances:

spot=True

confirm_costNoSpecifies whether to skip billing warnings. Default value: False. The default value indicates that billing warnings are not skipped, and you need to confirm the instance creation when a consumption warning appears. Set this parameter to True to skip billing warnings:

confirm_cost=True

install_cudaNoSpecifies whether to automatically install a GPU driver. Default value: False. The default value indicates that no GPU driver is automatically installed. Set this parameter to True to automatically install a GPU driver:

install_cuda=True

mount_nasNoSpecifies whether to automatically mount an Apsara File Storage NAS file system. For more information, see What is NAS?. Set this parameter to True to automatically mount an Apsara File Storage NAS file system:

mount_nas=True

Return value: A Job object is returned, which represents an instance cluster. To access a specific instance, you can access a task in the instance cluster. A Job object can contain multiple tasks, as shown in the following figure.

job
job = fastgpu.make_job(...) # Create a Job object.
job.run("ls -l")            # Run the ls -l command on the instance cluster.
job.tasks[0].run("ls -l")   # Run the ls -l command on an instance, such as task0.

Example: Create a Job object named fastgpu_test that contains two tasks. You can access the created instances by accessing the tasks of the Job object. Sample code:

job = fastgpu.make_job(
    name="fastgpu_test",                   # The name of the instance cluster.
    num_tasks=2,                           # The number of instances. In this example, set this parameter to 2.
    instance_type="ecs.gn6v-c8g1.2xlarge", # The instance type.
    image_type="ubuntu_18_04",             # The type of the image used by the instances. In this example, set this parameter to ubuntu_18_04.
    disk_size=500,                         # The size of the data disk, such as 500 GB.
    confirm_cost=True,                     # Specifies whether to skip consumption warnings.
    spot=True,                             # Specifies whether to create preemptible instances.
    install_cuda=True,                     # Specifies whether to automatically install a GPU driver.
    mount_nas=True                         # Specifies whether to automatically mount an Apsara File Storage NAS file system.
)
task1 = job.tasks[0]
task2 = job.tasks[1]

Run commands

You can run a command on the whole instance cluster or an instance. After the command is run, the output is stored in the specified directory.

# Run a command on the whole instance cluster.
job.run(cmd,                        # The command that you want to run.
         sudo=False,                # Specifies whether to run the command with administrator privileges.
         non_blocking=False,        # Specifies whether to run the command in a non-blocking manner.
         ignore_errors=False,       # Specifies whether to ignore errors. By default, if an error occurs, the system throws an exception.
         max_wait_sec=365*24*3600,  # The maximum timeout period.
         show=False,                # Specifies whether to return the output after the command is run.
         show_realtime=False        # Specifies whether to display the output in real time.
       )

# Run a command on an instance.
job.tasks[i].run(cmd, ...)

The following table describes the parameters mentioned in the preceding example.

ParameterDescriptionExample
sudoSpecifies whether to run the command with administrator privileges. Default value: False. The default value indicates that the command is not run with administrator privileges. Set this parameter to True to run the command with administrator privileges:

sudo=True

non_blockingSpecifies whether to run the command in a non-blocking manner. Default value: False. The default value indicates that the system waits until the command is run. Set this parameter to True to run the command in a non-blocking manner:

non_blocking=True

ignore_errorsSpecifies whether to ignore errors. Default value: False. The default value indicates that if an error occurs, the system stops running the command and throws an exception. Set this parameter to True to ignore errors:

ignore_errors=True

max_wait_secThe maximum timeout period. Unit: seconds.

Default value: 365*24*3600. The default value indicates one year.

Set the maximum timeout period to 1 hour:

max_wait_sec=3600

showSpecifies whether to return the output after the command is run. Default value: False. Set this parameter to True to return the output after the command is run:

show=True

show_realtimeSpecifies whether to display the output in real time. Default value: False. Set this parameter to True to display the output in real time:

show_realtime=True

Sample code:

# Run the ls command on the whole instance cluster to query the files and folders in the working directory of each instance.
job.run("ls")
# Run the ls command on an instance to query the files and folders in the working directory of Instance i.
job.tasks[i].run("ls")

Transfer files

  • Upload files to the whole instance cluster or an instance.
    # Upload files to the whole instance cluster.
    job.upload(local_fn: str, remote_fn: str="", dont_overwrite: bool=False)
    # Upload files to Instance i.
    job.tasks[i].upload(local_fn: str, remote_fn: str="", dont_overwrite: bool=False)
    ParameterRequiredDescriptionExample
    Iocal_fnYesThe source path from which you want to upload files. Specify the source path:

    local_fn="/root/test_download.fn"

    remote_fnNoThe destination path to which you want to upload files.

    By default, this parameter is empty. The default value indicates that files are uploaded to the same path as that specified by the local_fn parameter.

    Specify the destination path:

    remote_fn="/root/test.txt"

    dont_overwriteNoSpecifies whether to overwrite existing files. Default value: False. The default value indicates that existing files are automatically overwritten. Set this parameter to True so that existing files are not overwritten:

    dont_overwrite=True

  • Download files from the whole instance cluster or an instance to your machine.
    # Download files from the whole instance cluster.
    job.download(remote_fn, local_fn: str="")
    # Download files from Instance i.
    job.tasks[i].download(remote_fn, local_fn: str="")
    Important If you download files from an instance cluster that contains more than two instances, file conflicts may occur. In this case, we recommend that you do not download files from the whole instance cluster.
    ParameterRequiredDescriptionExample
    remote_fnYesThe source path from which you want to download files. Specify the source path:

    remote_fn="/root/test.txt"

    local_fnNoThe destination path to which you want to download files.

    By default, this parameter is empty. The default value indicates that files are downloaded to the same path as that specified by the remote_fn parameter.

    Specify the destination path:

    local_fn="/root/test_download.fn"

Sample code: Upload files to the whole instance cluster, and then download files from an instance to your machine.

# Upload the file from the /root/test.txt path to the /root/ path of all instances in the instance cluster.
job.upload("/root/test.txt")
# Download the file from Instance 0 to the current path of your machine.
job.tasks[0].download("/root/test.txt", "./test.txt")

Stop instances

Stop the whole instance cluster or an instance.

# Stop all instances of the instance cluster. 
job.stop(
    keep=False, # Specifies whether to continue billing all instances of the instance cluster after they are stopped.
    force=False # Specifies whether to forcibly stop all instances.
)

# Stop Instance i.
job.tasks[i].stop(
    keep=False, # Specifies whether to continue billing the instance after it is stopped.
    force=False # Specifies whether to forcibly stop the instance.
)

Sample code:

job.stop(force=True, keep=True) # Forcibly stop all instances of the instance cluster and continue to bill the instances.
ParameterDescriptionExample
keepSpecifies whether to continue billing instances after they are stopped. Default value: False. The default value indicates that instances are not billed after they are stopped. Set this parameter to True to continue billing instances after they are stopped:

keep=True

forceSpecifies whether to forcibly stop instances. Default value: False. Instances are not forcibly stopped. If the system fails to stop some instances, the process may become stuck. Set this parameter to True to forcibly stop instances:

force=True

Release instances

Permanently delete the whole instance cluster or an instance to release the resources occupied by instances.

Important After an instance is released, its ID, public IP address, system disk, and data disks for which Release with Instance is enabled are also released and cannot be recovered. If the instance is associated with an elastic IP address (EIP), the EIP is automatically disassociated from the instance and retained. The data disks for which Release with Instance is not enabled are automatically detached from the instance and retained. Proceed with caution when you perform a release operation.
job.kill()          # Release all instances of the instance cluster.
job.tasks[i].kill() # Release an instance.

Sample code: Forcibly stop and release running instances.

# Forcibly release the whole instance cluster and all of its instances, including running instances.
job.kill(force=True)
# Release an instance that is stopped.
job.tasks[i].kill()
ParameterDescriptionExample
forceSpecifies whether to forcibly stop instances. Default value: False. The default value indicates that running instances cannot be released. Set this parameter to True to forcibly stop instances:

force=True