cGPU can be used to isolate GPU resources so that multiple containers can share a single physical GPU. This topic describes how to install and use cGPU on GPU-accelerated instances.

Prerequisites

Before you install cGPU, make the following preparations:
  • Submit a ticket to obtain the download link to the cGPU installation package.
  • Make sure that the instance meets the following requirements:
    • The instance type is gn6i, gn6v, gn6e, gn5i, gn5, ebmgn6i, or ebmgn6e.
    • The operating system is CentOS 7.6, CentOS 7.7, Ubuntu 16.04, Ubuntu 18.04, or Alibaba Cloud Linux.
    • NVIDIA driver 418.87.01 or later is installed.
    • Docker 19.03.5 or later is installed.

Install cGPU

  1. Download and decompress the cGPU installation package.
  2. Check the files in the installation package.
    Note The files in the installation package may vary with the version of the installation package.
    # cd cgpu
    # ls
    cgpu-container-wrapper  cgpu-km.c  cgpu.o  cgpu-procfs.c  install.sh  Makefile  os-interface.c  README  uninstall.sh  upgrade.sh  version.h
  3. Install cGPU.
    sh install.sh
  4. Verify whether cGPU is installed.
    # lsmod | grep cgpu
    cgpu_km                71355  0

    If the status of cGPU is displayed, cGPU is installed.

Run cGPU

The following table describes the environment variables of cGPU. When you create a container, you can set values for the environment variables to control the computing power that the container can obtain by using cGPU.
Environment variable Value type Description Example
CGPU_DISABLE Boolean Specifies whether to enable cGPU. Valid values:
  • false: enables cGPU.
  • true: disables cGPU and uses the default NVIDIA container service.
None
ALIYUN_COM_GPU_MEM_DEV Integer Specifies the total memory of each GPU on the GPU-accelerated instance, which depends on the instance type. Unit: GiB.
Note The value of this variable must be an integer.
A GPU-accelerated instance of the ecs.gn6i-c4g1.xlarge instance type is equipped with one NVIDIA® Tesla® T4 graphics card. Run the nvidia-smi command on the instance to check the total memory. In this example, the check result is 15,109 MiB, and the value of this variable is 15 GiB.
ALIYUN_COM_GPU_MEM_CONTAINER Integer Specifies the size of the memory to be used by the container. This variable is used together with ALIYUN_COM_GPU_MEM_DEV. If this variable is not specified or is set to 0, the default NVIDIA container service is used instead of cGPU. On a GPU with a total memory of 15 GiB, set ALIYUN_COM_GPU_MEM_DEV to 15 and ALIYUN_COM_GPU_MEM_CONTAINER to 1. Then, 1 GiB of memory is allocated to the container.
ALIYUN_COM_GPU_VISIBLE_DEVICES Integer or uuid Specifies the GPU to be used by the container. On a GPU-accelerated instance configured with four GPUs, run the nvidia-smi -L command to query the device numbers and UUIDs of the GPUs. A response similar to the following one is returned:
GPU 0: Tesla T4 (UUID: GPU-b084ae33-e244-0959-cd97-83****)
GPU 1: Tesla T4 (UUID: GPU-3eb465ad-407c-4a23-0c5f-bb****)
GPU 2: Tesla T4 (UUID: GPU-2fce61ea-2424-27ec-a2f1-8b****)
GPU 3: Tesla T4 (UUID: GPU-22401369-db12-c6ce-fc48-d7****)
Set the following environment variables:
  • Set ALIYUN_COM_GPU_VISIBLE_DEVICES to 0,1. The first and second GPUs are allocated to the container.
  • Set ALIYUN_COM_GPU_VISIBLE_DEVICES to GPU-b084ae33-e244-0959-cd97-83****,GPU-3eb465ad-407c-4a23-0c5f-bb****,GPU-2fce61ea-2424-27ec-a2f1-8b****. Three GPUs with specified UUIDs are allocated to the container.
ALIYUN_COM_GPU_SCHD_WEIGHT Integer Specifies the weight of the GPU computing power for the container. Valid values: 1 to min(max_inst, 16). None

The following example shows how cGPU is used to enable two containers to share one GPU on a GPU-accelerated instance of the ecs.gn6i-c4g1.xlarge instance type.

  1. Run the following commands to create containers and specify the memory to be used by the containers:
    docker run -d -t --gpus all --privileged --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --name gpu_test1 -v /mnt:/mnt -e ALIYUN_COM_GPU_MEM_CONTAINER=6 -e ALIYUN_COM_GPU_MEM_DEV=15 nvcr.io/nvidia/tensorflow:19.10-py3
    docker run -d -t --gpus all --privileged --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --name gpu_test2 -v /mnt:/mnt -e ALIYUN_COM_GPU_MEM_CONTAINER=8 -e ALIYUN_COM_GPU_MEM_DEV=15 nvcr.io/nvidia/tensorflow:19.10-py3
    Note In the preceding commands, the TensorFlow image nvcr.io/nvidia/tensorflow:19.10-py3 is used. Replace it with your own container image. For more information about how to use the TensorFlow image to build a TensorFlow deep learning framework, see Deploy an NGC environment on instances with GPU capabilities.
    In this example, the ALIYUN_COM_GPU_MEM_CONTAINER environment variable is set to specify the total memory of the GPU, and the ALIYUN_COM_GPU_MEM_DEV environment variable is set to specify the memory to be used by the containers. Two containers are created:
    • gpu_test1: 6 GiB of the GPU memory is allocated.
    • gpu_test2: 8 GiB of the GPU memory is allocated.
  2. Access a container.
    In this example, access gpu_test1.
    docker exec -it gpu_test1 bash
  3. Run the following command to query GPU information such as the memory:
    nvidia-smi
    The following figure shows that the memory used by Container gpu_test1 is 6,043 MiB. gpu_test1

View procfs nodes

At runtime, cGPU generates and manages multiple procfs nodes in the /proc/cgpu_km directory. You can view and configure cGPU information based on the procfs nodes. The following section describes how each procfs node is used.

  1. Run the following command to view the nodes:
    # ls /proc/cgpu_km/
    0  default_memsize  inst_ctl  major  version
    The following table describes the nodes.
    Node Read/write type Description
    0 Read and write cGPU generates a directory for each GPU on the GPU-accelerated instance, and uses numbers such as 0, 1, and 2 as the directory names. In this example, a single GPU is available and the corresponding directory ID is 0.
    default_memsize Read and write If the ALIYUN_COM_GPU_MEM_CONTAINER parameter is not specified, the memory size allocated to the newly created container is used.
    inst_ctl Read and write The control node.
    major Read only The primary device number of the cGPU kernel driver.
    version Read only The cGPU version.
  2. Run the following command to view the directory corresponding to the GPU.
    In this example, GPU 0 is used.
    # ls /proc/cgpu_km/0
    012b2edccd7a   0852a381c0cf   free_weight    max_inst    policy
    The following table describes content of the directory.
    Node Read/write type Description
    Directory corresponding to the container Read and write cGPU generates a directory for each container that runs on the GPU-accelerated instance, and uses the container IDs as the directory names. You can run the docker ps command to query the created containers.
    free_weigt Read only The weight of the GPU computing power to be used by the container. If free_weight is set to 0, the weight of the GPU computing power to be used by the new container is 0. The container cannot obtain any GPU computing power or be used to run applications that require GPU computing power.
    max_inst Read and write The maximum number of containers. Valid values: 1 to 16.
    policy Read and write cGPU supports the following computing power scheduling policies:
    • 0: fair-share scheduling. Each container occupies a fixed time slice. The percentage of the time slice is 1/max_inst.
    • 1: preemptive scheduling. Each container occupies as many time slices as possible. The percentage of the time slices is 1/Number of containers.
    • 2: weight-based preemptive scheduling. When the ALIYUN_COM_GPU_SCHD_WEIGHT value is greater than 1, weight-based preemptive scheduling is used.

    You can change the scheduling policy by modifying the policy value. For more information, see Examples of computing power scheduling on a GPU.

  3. Run the following command to view the directory corresponding to a container:
    In this example, the 012b2edccd7a container is used.
    # ls /proc/cgpu_km/0/012b2edccd7a
    id  meminfo  memsize  weight
    The following table describes content of the directory.
    Node Read/write type Description
    id Read only The ID of the container.
    memsize Read and write The GPU memory size allocated to the container. cGPU sets this value based on the ALIYUN_COM_GPU_MEM_DEV parameter.
    meminfo Read only The remaining GPU memory in the container, the ID of the process that is using the GPU, and the GPU memory usage of the process. Example:
    Free: 6730809344
    PID: 19772 Mem: 200278016
    weight Read and write The weight of the maximum GPU computing power for the container. Default value: 1. The sum of the weights of all running containers must be less than or equal to max_inst.
After you have understood how procfs nodes are used, you can run commands on the GPU-accelerated instance to perform operations such as changing the scheduling policy and modifying the weight. The following table lists some sample commands.
Command Effect
echo 2 > /proc/cgpu_km/0/policy Changes the scheduling policy to weight-based preemptive scheduling.
cat /proc/cgpu_km/0/free_weight Returns the available weight on the GPU. If free_weight is set to 0, the weight of the computing power to be used by the new container is 0. The container cannot obtain any GPU computing power or be used to run applications that require GPU computing power.
cat /proc/cgpu_km/0/$dockerid/weight Returns the weight allocated to a specific container.
echo 4 > /proc/cgpu_km/0/$dockerid/weight Modifies the weight allocated to a specific container.

Upgrade cGPU

  1. Close all the running containers.
    docker stop (docker ps -a | awk '{ print $1}' | tail -n +2)
  2. Execute the upgrade.sh script to upgrade cGPU to the latest version.
    sh upgrade.sh

Uninstall cGPU

  1. Close all the running containers.
    docker stop (docker ps -a | awk '{ print $1}' | tail -n +2)
  2. Execute the sh uninstall.sh script to uninstall cGPU.
    sh uninstall.sh

Examples of computing power scheduling on a GPU

When cGPU loads the cgpu_km module, cGPU sets time slices (X ms) for each GPU based on the maximum number of containers (max_inst) to allocate the GPU computing power for the containers. In the examples, time slices such as Slice 1, Slice 2, and Slice N are used. The following examples demonstrate how the GPU computing power is scheduled by using different scheduling policies.
  • Fair-share scheduling
    Time slices are allocated when the containers are created. cGPU starts scheduling from Slice 1. The scheduling task is submitted to the physical GPU and executed in the container within the time slice (X ms). Then, cGPU moves to the next time slice. Each container obtains the same amount of computing power, which is 1/max_inst. The following figure shows an example. Fair-share scheduling
  • Preemptive scheduling

    Time slices are allocated when the containers are created. cGPU starts scheduling from Slice 1. However, if a container is not used or does not have GPU-enabled processes, cGPU skips scheduling and moves to the next time slice.

    Sample code:
    1. A single container Docker 1 is created and allocated with Slice 1. Two TensorFlow processes run in Docker 1. In this case, Docker 1 can obtain the computing power of the entire physical GPU.
    2. Another container Docker 2 is created and allocated with Slice 2. If no GPU-enabled processes exist in Docker 2, cGPU skips scheduling for Docker 2.
    3. When a process in Docker 2 has GPU enabled, both Slice 1 and Slice 2 are scheduled. Each of Docker 1 and Docker 2 obtains up to a half of the computing power of the physical GPU. The following figure shows an example. Preemptive scheduling
  • Weight-based preemptive scheduling

    If ALIYUN_COM_GPU_SCHD_WEIGHT is set to a value greater than 1 during container creation, weight-based preemptive scheduling is used. cGPU divides the computing power of the physical GPU into max_inst portions based on the number of containers (max_inst). If the ALIYUN_COM_GPU_SCHD_WEIGHT value is greater than 1, cGPU combines multiple time slices into one bigger time slice and allocates the time slice to the containers.

    Example:
    • Docker 1: ALIYUN_COM_GPU_SCHD_WEIGHT=m
    • Docker 2: ALIYUN_COM_GPU_SCHD_WEIGHT=n
    Scheduling results:
    • If only Docker 1 is running, Docker 1 obtains the computing power of the entire physical GPU.
    • If both Docker 1 and Docker 2 are running, they obtain the computing power based on a theoretical ratio of m:n. Docker 2 consumes n time slices even if it does not have a GPU-enabled process. This is different from the case in preemptive scheduling.
      Note The running performance of the containers differs when m:n is set to 2:1 and 8:4. The number of time slices within one second when m:n is set to 2:1 is four times that when m:n is set to 8:4.
    Weight-based preemptive scheduling
    Weight-based preemptive scheduling limits the theoretical maximum amount of GPU computing power that a container can obtain. However, for graphics cards such as NVIDIA® V100 that have strong computing power, a computing task can be complete within a single time slice if the memory is not fully used. In this case, if m:n is set to 8:4, the GPU computing power becomes idle during the remaining time slices and the limit becomes invalid. We recommend that you set appropriate weights based on the GPU computing power. Examples:
    • If you use NVIDIA® V100 graphics cards, set m:n to 2:1 to reduce the weight and prevent the GPU computing power from becoming idle.
    • If you use NVIDIA® Telsa® T4 graphics cards, set m:n to 8:4 to increase the weight and ensure that sufficient GPU computing power is allocated.