All Products
Search
Document Center

:Configure GPU sharing without GPU memory isolation

Last Updated:Feb 18, 2024

You may require GPU sharing without GPU memory isolation in some scenarios. For example, some applications, such as Java applications, allow you to specify the maximum amount of GPU memory that the applications can use. In this scenario, if you use GPU memory isolation, exceptions may occur. To address this problem, you can disable GPU memory isolation for nodes that support GPU sharing. This topic describes how to configure GPU sharing without GPU memory isolation.

Prerequisites

Step 1: Create a node pool

Perform the following steps to create a node pool that has GPU memory isolation disabled.

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, click the name of the cluster that you want to manage and choose Nodes > Node Pools in the left-side navigation pane.

  3. On the Node Pools page, click Create Node Pool. In the Create Node Pool dialog box, configure the parameters and click Confirm Order.

    The following table describes the key parameters. For more information, see Create a node pool.

Step 2: Submit a job

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, click the name of the cluster that you want to manage and choose Workloads > Jobs in the left-side navigation pane.

  3. On the Jobs page, click Create from YAML. In the code editor on the Create page, paste the following content and Create.

    apiVersion: batch/v1
    kind: Job
    metadata:
      name: tensorflow-mnist-share
    spec:
      parallelism: 1
      template:
        metadata:
          labels:
            app: tensorflow-mnist-share
        spec:
          containers:
          - name: tensorflow-mnist-share
            image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:tensorflow-1.5
            command:
            - python
            - tensorflow-sample-code/tfjob/docker/mnist/main.py
            - --max_steps=100000
            - --data_dir=tensorflow-sample-code/data
            resources:
              limits:
                aliyun.com/gpu-mem: 4 # Request 4 GiB of GPU memory. 
            workingDir: /root
          restartPolicy: Never

    Code description:

    • The YAML content defines a TensorFlow job. The job creates one pod and the pod requests 4 GiB of GPU memory.

    • You can set aliyun.com/gpu-mem: 4 below resources.limits to request 4 GiB of GPU memory.

Step 3: Verify the configuration

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, click the name of the cluster that you want to manage and choose Workloads > Pods in the left-side navigation pane.

  3. On the Pods page, choose Terminal > tensorflow-mnist-share in the Actions column of the pod that you created in Step 2 to log on to the pod.

  4. Run the following command to query GPU memory information:

    nvidia-smi

    Expected output:

    Wed Jun 14 06:45:56 2023
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 515.105.01   Driver Version: 515.105.01   CUDA Version: 11.7     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  Tesla V100-SXM2...  On   | 00000000:00:09.0 Off |                    0 |
    | N/A   35C    P0    59W / 300W |    334MiB / 16384MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    +-----------------------------------------------------------------------------+

    The output indicates that the GPU allocated to the pod provides 16,384 MiB of memory. In this example, the GPU model is V100. If GPU memory isolation is enabled, the value equals the amount of memory requested by the pod, which is 4 GiB. This indicates that the configuration is in effect.

    The application needs to read the GPU memory allocation information from the following environment variables.

    ALIYUN_COM_GPU_MEM_CONTAINER=4 # The GPU memory available for the pod. 
    ALIYUN_COM_GPU_MEM_DEV=16 # The total GPU memory provided by each GPU.

    If the application requires the ratio of available GPU memory, you can use the following formula to calculate the ratio of GPU memory used by the application to total GPU memory provided by the GPU based on the preceding environment variables:

    percetange = ALIYUN_COM_GPU_MEM_CONTAINER / ALIYUN_COM_GPU_MEM_DEV = 4 / 16 = 0.25

References