Install and use cGPU on a Docker container - Elastic GPU Service

cGPU allows you to isolate GPU resources. This allows multiple containers to share a single GPU. It provides external services as a component of Alibaba Cloud Container Service for Kubernetes (ACK). This topic describes how to install and use cGPU on a GPU-accelerated instance.

Prerequisites

Before you start, make sure that the GPU-accelerated instance meets the following requirements:

The instance belongs to one of the following instance families: gn7i, gn6i, gn6v, gn6e, gn5i, gn5, ebmgn7i, ebmgn6i, ebmgn7e, and ebmgn6e.
The operating system of the GPU-accelerated instance is CentOS, Ubuntu, or Alibaba Cloud Linux.
The instance is installed with an NVIDIA driver of the version 418.87.01 or later.
The instance is installed with Docker 19.03.5 or later.

Background information

If you use cGPU to isolate GPU resources, you cannot request GPU memory by using Unified Virtual Memory (UVM). Therefore, you cannot request GPU memory by calling the cudaMallocManaged() operation of Compute Unified Device Architecture (CUDA) API. You can request GPU memory by using other methods. For example, you can call the cudaMalloc() operation. For more information, see Unified Memory for CUDA Beginners.

Install cGPU

We recommend that you install and use cGPU by using the Docker runtime environment of ACK, regardless of whether you are an enterprise user or an individual user. For more information, see Install and use ack-ai-installer and the GPU inspection tool.

Run cGPU

The following table describes the environment variables of cGPU. When you create a container, you can specify values for the environment variables to adjust the computing power that the container can obtain by using cGPU.

Environment variable	Type	Description	Example
CGPU_DISABLE	Boolean	Specifies whether to enable cGPU. Valid values: false: enables cGPU. true: disables cGPU and uses the default NVIDIA container service.	If the `CGPU_DISABLE` parameter is not set, the default value (false) is used, and CGPU is enabled. If the `CGPU_DISABLE` parameter is set to true, the cGPU service is disabled.
ALIYUN_COM_GPU_MEM_DEV	Integer	The total memory of each GPU on a GPU-accelerated instance. The value of this variable varies is based on the instance type. Note The value of this variable must be an integer. Unit: GiB.	A GPU-accelerated instance of the ecs.gn6i-c4g1.xlarge instance type is configured with an NVIDIA^® Tesla ^® T4 GPU. If you run the nvidia-smi command on the instance, the command output shows that the total memory of the GPU is 15,109 MiB. Then, you can set this variable to the rounded value 15 GiB.
ALIYUN_COM_GPU_MEM_CONTAINER	Integer	The GPU memory that is allocated to the container. This variable is used together with ALIYUN_COM_GPU_MEM_DEV. If this variable is not specified or is set to 0, the default NVIDIA container service is used instead of cGPU.	If you use a GPU whose total memory is 15 GiB, and you set `ALIYUN_COM_GPU_MEM_DEV` to 15 and `ALIYUN_COM_GPU_MEM_CONTAINER` to 1, the container obtains 1 GiB of the GPU memory.
ALIYUN_COM_GPU_VISIBLE_DEVICES	Integer or uuid	The GPU that is allocated to the container.	If you use a GPU-accelerated instance that is configured with four GPUs, you can run the nvidia-smi -L command to view the device numbers and UUIDs of the GPUs. The following command output is returned: `GPU 0: Tesla T4 (UUID: GPU-b084ae33-e244-0959-cd97-83**) GPU 1: Tesla T4 (UUID: GPU-3eb465ad-407c-4a23-0c5f-bb) GPU 2: Tesla T4 (UUID: GPU-2fce61ea-2424-27ec-a2f1-8b) GPU 3: Tesla T4 (UUID: GPU-22401369-db12-c6ce-fc48-d7)` Configure the following environment variables: Set `ALIYUN_COM_GPU_VISIBLE_DEVICES` to 0,1. The first and second GPUs are allocated to the container. Set `ALIYUN_COM_GPU_VISIBLE_DEVICES` to GPU-b084ae33-e244-0959-cd97-83,GPU-3eb465ad-407c-4a23-0c5f-bb,GPU-2fce61ea-2424-27ec-a2f1-8b**. Three GPUs that have specific UUIDs are allocated to the container.
ALIYUN_COM_GPU_SCHD_WEIGHT	Integer	The weight based on which the container obtains computing power. Valid values: 1 to max_inst.	None.
ALIYUN_COM_GPU_HIGH_PRIO	Integer	Specifies whether to configure a high priority for the container. Default value: 0. Valid values: 0: configures a regular priority for the container. Default value: 0. 1: configures a high priority for the container. We recommend that you configure at least one high-priority container for each GPU. This way, the computing power of GPUs is allocated to high-priority containers based on the scheduling policy that is specified by the policy parameter. If a GPU task runs in a high-priority container, the container is not restricted by the scheduling policy and can preempt the computing power of the GPU. If no GPU task runs in a high-priority container, the container is not involved in the scheduling process and cannot be allocated with the computing power of the GPU.	0

The following example shows how to use cGPU to allow two containers to share a single GPU on a GPU-accelerated instance of the ecs.gn6i-c4g1.xlarge instance type.

Run the following commands to create containers and specify the GPU memory that is allocated to the containers:
```
docker run -d -t --gpus all --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --name gpu_test1 -v /mnt:/mnt -e ALIYUN_COM_GPU_MEM_CONTAINER=6 -e ALIYUN_COM_GPU_MEM_DEV=15 nvcr.io/nvidia/tensorflow:19.10-py3
```
```
docker run -d -t --gpus all --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --name gpu_test2 -v /mnt:/mnt -e ALIYUN_COM_GPU_MEM_CONTAINER=8 -e ALIYUN_COM_GPU_MEM_DEV=15 nvcr.io/nvidia/tensorflow:19.10-py3
```
Note
In the preceding commands, the TensorFlow image nvcr.io/nvidia/tensorflow:19.10-py3 is used. Replace the image with your container image based on your business requirements. For more information about how to use the TensorFlow image to build a TensorFlow deep learning framework, see Deploy an NGC environment on a GPU-accelerated instance.
In this example, two variables are configured. ALIYUN_COM_GPU_MEM_CONTAINER specifies the GPU memory that is allocated to the containers. ALIYUN_COM_GPU_MEM_DEV specifies the total memory of the GPU. After you run the preceding commands, two containers are created:
- gpu_test1: 6 GiB of the GPU memory is allocated.
- gpu_test2: 8 GiB of the GPU memory is allocated.
Run the following command to view GPU information such as the GPU memory:
In this example, the gpu_test1 container is used.
```
docker exec -i gpu_test1 nvidia-smi
```
The following figure shows that the GPU memory that is allocated to the gpu_test1 container is 6,043 MiB.

View procfs nodes

At runtime, cGPU generates multiple proc filesystem (procfs) nodes in the /proc/cgpu_km directory and manages the nodes. You can view cGPU information and configure cGPU settings by using the procfs nodes. The following section describes how to use the procfs nodes.

Run the following command to view the information about the procfs nodes:

ls /proc/cgpu_km/

The following command output is returned:

0  default_memsize  inst_ctl  upgrade  version

The following table describes the information about the procfs nodes.

Node	Read/Write type	Description
0	Read and write	cGPU generates a directory for each GPU on the GPU-accelerated instance. Each directory is numbered starting from 0. In this example, a single GPU is used and the directory number of the GPU is 0.
default_memsize	Read and write	The memory that is allocated to the created container if ALIYUN_COM_GPU_MEM_CONTAINER is not specified.
inst_ctl	Read and write	The control node.
upgrade	Read and write	Controls the hot upgrade of the cGPU service.
version	Read-Only	The version of cGPU.

Run the following command to view the items in the directory of the GPU:

In this example, the directory items for the GPU in slot 0 are obtained.

ls /proc/cgpu_km/0

The following command output is returned:

012b2edccd7a   0852a381c0cf   free_weight    major    max_inst    policy    prio_ratio

The following table describes the nodes of the directory.

Node	Read/Write type	Description
Directory of the container	Read and write	cGPU generates a directory for each container that runs on the GPU-accelerated instance, and uses the container IDs as the directory names. You can run the docker ps command to query the containers that you created.
free_weight	Read-Only	You can use this parameter to query and change the available weight of the GPU. If `free_weight` is set to 0, the weight based on which the newly created container obtains GPU computing power is 0. The newly created container cannot obtain GPU computing power and cannot be used to run applications that require GPU computing power.
major	Read-Only	The primary device number of the cGPU kernel driver. Each primary device number indicates a different device type.
max_inst	Read and write	The maximum number of containers. Valid values: 1 to 25.
policy	Read and write	cGPU supports the following scheduling policies for computing power: 0: fair-share scheduling. Each container occupies a fixed time slice. The proportion of the time slice is `1/max_inst`. 1: preemptive scheduling. Each container occupies as many time slices as possible. The proportion of the time slices is `1/Number of containers`. 2: weight-based preemptive scheduling. When ALIYUN_COM_GPU_SCHD_WEIGHT is set to a value greater than 1, weight-based preemptive scheduling is used. 3: fixed scheduling. Computing power is scheduled at a fixed percentage. 4: soft scheduling. Compared with preemptive scheduling, soft scheduling isolates GPU resources in a softer manner. 5: native scheduling. The built-in scheduling policy for the GPU driver. You can change the value of this parameter to adjust the scheduling policy. For more information about scheduling policies, see Examples of computing power scheduling by using cGPU.
prio_ratio	Read and write	The maximum computing power that a high-priority container can preempt in a mixed online and offline scenario. Valid values: 20 to 99.

Run the following command to view the directory of a container:

In this example, the 012b2edccd7a container is used.

ls /proc/cgpu_km/0/012b2edccd7a

The following command output is returned:

highprio    id  meminfo  memsize  weight

The following table describes the nodes of the directory.

Node	Read/Write type	Description
highprio	Read and write	The high priority of the container. Default value: 0. If `ALIYUN_COM_GPU_HIGH_PRIO` is set to 1, the maximum computing power that the container can preempt is specified by `prio_ratio`. Note This takes effect for a mixed online and offline scenario. For more information, see "README" in Install cGPU
id	Read-only	The ID of the container.
memsize	Read and write	The GPU memory of the container. cGPU generates a value for this parameter based on the value of ALIYUN_COM_GPU_MEM_DEV.
meminfo	Read-only	The information about the GPU memory, including the remaining GPU memory in the container, the ID of the process that is using the GPU, and the GPU memory usage of the process. The following command output is returned: `Free: 6730809344 PID: 19772 Mem: 200278016`
weight	Read and write	The weight based on which the container obtains the maximum GPU computing power. Default value: 1. The sum of the weights of all running containers cannot exceed the value of max_inst.

You can run commands to perform operations on the GPU-accelerated instance. For example, you can change the scheduling policy and modify the weight. The following table describes sample commands.

Command	Description
echo 2 > /proc/cgpu_km/0/policy	Changes the scheduling policy to weight-based preemptive scheduling.
cat /proc/cgpu_km/0/free_weight	Queries the available weight on the GPU. If `free_weight` is set to 0, the weight based on which the newly created container obtains GPU computing power is 0. The container cannot obtain GPU computing power and cannot be used to run applications that require GPU computing power.
cat /proc/cgpu_km/0/$dockerid/weight	Queries the weight of a specified container.
echo 4 > /proc/cgpu_km/0/$dockerid/weight	Changes the weight based on which the container obtains the GPU computing power.

Update cGPU

You can perform cold or hot upgrades to upgrade the cGPU service.

Cold upgrade
If the cGPU service is disabled, we recommend that you perform a cold upgrade.
1. Run the following command to close all running containers.
```
docker stop $(docker ps -a | awk '{ print $1}' | tail -n +2)
```
2. Run the upgrade.sh script to update cGPU to the latest version.
```
sh upgrade.sh
```
Hot update
If the cGPU service is enabled, we recommend that you perform a hot upgrade.

Uninstall cGPU

For more information about how to uninstall the cGPU service of an earlier version from a node, see Upgrade the cGPU version on a node by using a CLI.

Use cgpu-smi to monitor containers

You can use cgpu-smi to view the information about a container for which cGPU is used. The information includes the container ID, GPU utilization, computing power limit, GPU memory usage, and total allocated memory.

Note

cgpu-smi provides a sample monitoring information about cGPU. When you deploy Kubernetes applications, you can refer to or use the sample monitoring information to perform custom development and integration.

The following figure shows the sample monitoring information.

Examples of computing power scheduling by using cGPU

When cGPU loads the cgpu_km module, cGPU sets time slices (X ms) for each GPU based on the maximum number of containers (max_inst) to allocate GPU computing power to the containers. In the following examples, Slice 1, Slice 2, and Slice N time slices are used. The following examples show how the GPU computing power is allocated by using different scheduling policies.

Fair-share scheduling (policy = 0)
When you create containers, cGPU allocates time slices to the containers. cGPU starts scheduling from Slice 1. The scheduling task is submitted to the physical GPU and executed in the container within a time slice (X ms). Then, cGPU moves to the next time slice. Each container obtains the same computing power, which is 1/max_inst, The following figure shows the details.
Preemptive scheduling (policy = 1)
When you create containers, cGPU allocates time slices to the containers. cGPU starts scheduling from Slice 1. However, if the container within Slice 1 is not used or if the GPU-accelerated instance is not started by a process in the container, cGPU skips scheduling within Slice 1 and moves to the next time slice.
Examples:
1. The container Docker 1 is created and Slice 1 is allocated to this container. Two TensorFlow processes run in Docker 1. In this case, Docker 1 can obtain the computing power of the entire physical GPU.
2. The container Docker 2 is created and Slice 2 is allocated to this container. If the GPU-accelerated instance is not started by a process in Docker 2, cGPU skips scheduling for Docker 2 within Slice 2.
3. If the GPU-accelerated instance is started by a process in Docker 2, cGPU performs scheduling within Slice 1 and Slice 2. Docker 1 and Docker 2 can obtain up to half of the computing power of the physical GPU. The following figure shows the details.
Weight-based preemptive scheduling (policy = 2)
If ALIYUN_COM_GPU_SCHD_WEIGHT is set to a value greater than 1 when you create a container, weight-based preemptive scheduling is used. cGPU divides the computing power of the physical GPU into max_inst portions based on the number of containers (max_inst). If ALIYUN_COM_GPU_SCHD_WEIGHT is set to a value greater than 1, cGPU combines multiple time slices into a larger time slice and allocates the time slice to the containers.
Examples:
- Docker 1: ALIYUN_COM_GPU_SCHD_WEIGHT = m
- Docker 2: ALIYUN_COM_GPU_SCHD_WEIGHT = n
Scheduling results:
- If only Docker 1 is running, Docker 1 preempts the computing power of the entire physical GPU.
- If Docker 1 and Docker 2 are running, the containers obtain the computing power at a theoretical ratio of m:n. Compared with preemptive scheduling, Docker 2 consumes n time slices even if the GPU-accelerated instance is not started by a process in Docker 2.
  Note
  The running performance of the containers differs when m:n is set to 2:1 and 8:4. The number of time slices within one second when m:n is set to 2:1 is four times the number of time slices within one second when m:n is set to 8:4.
Weight-based preemptive scheduling limits the theoretical maximum GPU computing power that a container can obtain. However, for GPUs such as an NVIDIA ^® V100 GPU that have high computing power, a computing task can be completed within a single time slice even if small GPU memory is used. In this case, if m:n is set to 8:4, the GPU computing power becomes idle during the remaining time slices and the limit on the theoretical maximum GPU computing power becomes invalid.
Fixed scheduling (policy = 3)
You can use ALIYUN_COM_GPU_SCHD_WEIGHT together with max_inst to fix the percentage of computing power.
Soft scheduling (policy = 4)
When you create containers, cGPU allocates time slices to the containers. Compared with preemptive scheduling, soft scheduling isolates GPU resources in a softer manner. For more information, see the Preemption scheduling (policy = 1) section.
Native scheduling (policy = 5)
You can use this policy to isolate only GPU memory. When this policy is used, computing power is scheduled based on the built-in scheduling methods of NVIDIA GPU drivers.

The scheduling policies for computing power are supported for all Alibaba Cloud heterogeneous GPU-accelerated instances and NVIDIA GPUs that are used for the instances. The GPUs include Tesla P4, Tesla P100, Tesla T4, Tesla V100, Tesla A10. In this example, two containers that share a GPU-accelerated instance configured with a Tesla A10 GPU are tested. The computing power ratio of the containers is 1:2. Each container obtains 12 GB of the GPU memory.

Note

The following performance test data is provided for reference only.

Test 1: Compare the performance data at different batch_size values. In this scenario, the precision of the ResNet50 model that is trained by using the TensorFlow framework is FP16. The following table describes the test results.

Framework	Model	batch_size	Precision	Images per second (Docker 1)	Images per second (Docker 2)
TensorFlow	ResNet50	16	FP16	151	307
TensorFlow	ResNet50	32	FP16	204	418
TensorFlow	ResNet50	64	FP16	247	503
TensorFlow	ResNet50	128	FP16	257	516

Test 2: Compare the performance data at different batch_size values. In this scenario, the precision of the ResNet50 model that is trained by using the TensorRT framework is FP16. The following table describes the test results.

Framework	Model	batch_size	Precision	Images per second (Docker 1)	Images per second (Docker 2)
TensorRT	ResNet50	1	FP16	568.05	1132.08
TensorRT	ResNet50	2	FP16	940.36	1884.12
TensorRT	ResNet50	4	FP16	1304.03	2571.91
TensorRT	ResNet50	8	FP16	1586.87	3055.66
TensorRT	ResNet50	16	FP16	1783.91	3381.72
TensorRT	ResNet50	32	FP16	1989.28	3695.88
TensorRT	ResNet50	64	FP16	2105.81	3889.35
TensorRT	ResNet50	128	FP16	2205.25	3901.94

In the preceding test results, the ratio of performance data between Docker 1 and Docker 2 that are tested by using different frameworks is close to 1:2. The scheduling policy algorithm can support Ampere GPUs. In the following test, the performance data between T4 and P4 GPUs are compared based on the ResNet50 model that is inferred by using the TensorFlow framework. The ratio of the performance data is close to 1:2.

In the preceding scenarios, the performance test data of GPUs that have different computing power is obtained. You can use cGPU to specify the computing power that you want to allocate to each container. This prevents containers from preempting computing power and ensures that high-priority containers can obtain higher computing power. The scheduling policy algorithm for computing power can support NVIDIA GPU architectures, such as Ampere, Volta, Turing, and Pascal. You can use the algorithm to create custom scheduling policies that suit scenarios of your enterprise users.

Example of memory allocation of multiple GPUs when using cGPU

In this example, GPU 0 is allocated 3 GB, GPU 1 is allocated 4 GB, GPU 2 is allocated 5 GB, and GPU 3 is allocated 6 GB. The following code provides an example of memory allocation of multiple GPUs:

docker run -d -t --runtime=nvidia  --name gpu_test0123 --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -v /mnt:/mnt -e ALIYUN_COM_GPU_MEM_CONTAINER=3,4,5,6 -e ALIYUN_COM_GPU_MEM_DEV=23 -e NVIDIA_VISIBLE_DEVICES=0,1,2,3 nvcr.io/nvidia/tensorflow:21.03-tf1-py3
docker exec -i gpu_test0123   nvidia-smi

The following figure provides an example of the command output. You can view the memory details of the four GPUs.

You can use the ALIYUN_COM_GPU_MEM_CONTAINER parameter to set the details of GPU memory allocation for multiple GPUs. The following table describes the information about the ALIYUN_COM_GPU_MEM_CONTAINER parameter.

Parameter value	Description
ALIYUN_COM_GPU_MEM_CONTAINER=3	Indicates that the memory of all four GPUs is set to 3 GB.
ALIYUN_COM_GPU_MEM_CONTAINER=3,1	Indicates that the memory of the four GPUs is set to 3 GB, 1 GB, 1 GB, and 1 GB in sequence.
ALIYUN_COM_GPU_MEM_CONTAINER=3,4,5,6	Indicates that the memory of the four GPUs is set to 3 GB, 4 GB, 5 GB, and 6 GB in sequence.
ALIYUN_COM_GPU_MEM_CONTAINER not specified	Indicates that the cGPU service is disabled.
ALIYUN_COM_GPU_MEM_CONTAINER=0
ALIYUN_COM_GPU_MEM_CONTAINER=1,0,0