This topic describes how to disable the memory isolation capability of cGPU for a Container Service for Kubernetes (ACK) cluster by using a sample application.
Scenarios
This topic applies to ACK dedicated clusters that have cGPU enabled and ACK Pro clusters that have cGPU enabled.
Prerequisites
The ack-cgpu component is installed in your cluster. For more information, see Install ack-cgpu and Install and use ack-ai-installer and the GPU inspection tool.
Procedure
Verify access control
You can use one of the following methods to check whether the memory isolation capability of cGPU is disabled:
- Method 1: Run the following command to query the application log:
kubectl logs disable-cgpu-xxxx --tail=1
Expected output:
2020-08-25 08:14:54.927965: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15024 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:07.0, compute capability: 7.0)
The returned log entry shows that the GPU memory that the containerized application can use is 15,024 MiB. This indicates that the memory isolation capability of cGPU is disabled. If the memory isolation capability of cGPU is enabled, the amount of GPU memory that can be discovered by the containerized application is 3 GiB.
- Method 2: Run the following command to log on to the container and view the amount
of GPU memory that is allocated to the container:
kubectl exec disable-cgpu-xxxx nvidia-smi
Expected output:
Tue Aug 25 08:23:33 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... Off | 00000000:00:07.0 Off | 0 | | N/A 33C P0 55W / 300W | 15453MiB / 16130MiB | 1% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| +-----------------------------------------------------------------------------+
The output shows that the GPU memory capacity of the host is 16,130 MiB and the amount of GPU memory that is allocated to the container is 15,453 MiB. This indicates that the memory isolation capability of cGPU is disabled. If the memory isolation capability of cGPU is enabled, the amount of GPU memory that is allocated to the container is 3 GiB.