All Products
Search
Document Center

Container Service for Kubernetes:Adjust the minimum memory allocation unit for shared GPU scheduling

Last Updated:Apr 23, 2025

By default, the minimum memory allocation unit is 1 GiB for shared GPU scheduling. If you require finer-grained GPU memory allocation, you can adjust the minimum memory allocation unit. This topic describes how to change the minimum memory allocation unit to 128 MiB for shared GPU scheduling.

Prerequisites

Usage notes

  • If the aliyun.com/gpu-mem field is specified for a pod, the pod requests GPU resources. If your cluster contains pods that request GPU resources, you must delete these pods before you can change the minimum memory allocation unit for shared GPU scheduling. Otherwise, the scheduler ledger may become disordered.

  • You can adjust the minimum memory allocation unit only for nodes for which GPU sharing is enabled but memory isolation is disabled. These nodes have the ack.node.gpu.schedule=share label. Both GPU sharing and memory isolation are enabled for nodes that have the ack.node.gpu.schedule=cgpu label. Due to the limits of the memory isolation module, each GPU can create at most 16 pods even if you change the minimum memory allocation unit to 128 MiB.

  • If you set the minimum memory allocation unit to 128 MiB, the nodes in the cluster cannot be automatically scaled even if you enable auto scaling for the nodes. For example, you set the aliyun.com/gpu-mem field to 32 for a pod. In this case, if the available GPU memory in the cluster is insufficient to meet the memory request of the pod, no new node is added and the pod remains in the Pending state.

  • If you use a cluster that is created before October 20, 2021, you must submit a ticket to restart the scheduler. The new minimum memory allocation unit takes effect only after the scheduler is restarted.

Adjust the minimum memory allocation unit

ack-ai-installer is not installed

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane, choose Applications > Cloud-native AI Suite.

  3. In the lower part of the page, click Deploy. On the page that appears, select Scheduling Policy Extension (Batch Task Scheduling, GPU Sharing, Topology-aware GPU Scheduling) and click Advanced.

  4. Add the gpuMemoryUnit: 128MiB parameter in the following section, and then click OK.

    image

  5. After the configuration, click Deploy Cloud-native AI Suite.

    Wait until the status of ack-ai-installer changes from Deploying to Deployed, which indicates that ack-ai-installer is deployed.

ack-ai-installer is installed

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane, choose Applications > Cloud-native AI Suite.

  3. On the Cloud-native AI Suite page, find ack-ai-installer in the component list and click Uninstall in the Actions column. In the Uninstall Component message, click Confirm.

  4. After ack-ai-installer is uninstalled, click Deploy in the Actions column. In the Parameters panel, add gpuMemoryUnit: 128MiB to the code.1

  5. Click OK.

    Wait until the status of ack-ai-installer changes from Deploying to Deployed, which indicates that ack-ai-installer is deployed.

Example

The following sample code provides an example on how to request GPU memory for a pod. In this example, the aliyun.com/gpu-mem field is set to 16, and the minimum memory allocation unit is 128 MiB. Therefore, the total amount of GPU memory requested by the pod is 2 GiB.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: binpack
  labels:
    app: binpack
spec:
  replicas: 1
  serviceName: "binpack-1"
  podManagementPolicy: "Parallel"
  selector: # define how the deployment finds the pods it manages
    matchLabels:
      app: binpack-1
  template: # The pod specifications. 
    metadata:
      labels:
        app: binpack-1
    spec:
      containers:
      - name: binpack-1
        image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:tensorflow-1.5
        command:
        - bash
        - gpushare/run.sh
        resources:
          limits:
            # 128 MiB
            aliyun.com/gpu-mem: 16   # 16 * 128 MiB = 2 GiB