Enable MPS in ACK for GPU Sharing & Isolated Memory - Container Service for Kubernetes

Shared GPU scheduling uses NVIDIA MPS (Multi-Process Service) as the underlying GPU isolation module. This enables multiple application pods to share a single GPU while ensuring GPU memory isolation between pods. This topic describes how to enable NVIDIA MPS isolation and integrate it with the shared GPU scheduling component.

Background information

Using MPI (Message Passing Interface) to parallelize CPU cores balances resource allocation across CPU-intensive tasks. This allows multiple compute tasks to run concurrently and accelerates overall computation. However, when CUDA kernels accelerate MPI processes, each MPI process may underutilize the GPU. As a result, individual MPI processes run faster, but overall GPU efficiency remains low. When a single application sends insufficient work to the GPU, GPU resources remain idle. In such cases, use NVIDIA MPS (Multi-Process Service). MPS enables multiple CUDA applications to run on a single NVIDIA GPU. It works well in multi-user environments or when running many small tasks simultaneously. MPS improves GPU utilization and application throughput.

MPS enables different applications to run concurrently on the same GPU device, improving GPU resource utilization across your cluster. MPS uses a client-server architecture and maintains binary compatibility, so you do not need major changes to your existing CUDA applications. MPS consists of three main components.

Control Daemon Process: Starts and stops the MPS Server and manages connections between clients and the MPS Server. This ensures that clients can connect to the MPS service to request and use GPU resources.
Client Runtime: Built into the CUDA driver library. You do not need major code changes to use MPS in your CUDA applications. When an application uses the CUDA driver to access the GPU, the Client Runtime handles communication with the MPS Server. This enables multiple applications to share the GPU safely and efficiently.
Server Process: Receives requests from different clients and uses efficient scheduling to run those requests on a single GPU device. This enables concurrency between clients.

Important notes

In the NVIDIA MPS architecture, MPS Clients—your GPU applications that use MPS—must remain connected to the MPS Control Daemon. If the MPS Control Daemon restarts, these MPS Clients exit with errors.
In this example, the MPS Control Daemon runs as a container. A DaemonSet deploys one MPS Control Daemon pod on each GPU node. Here is what you need to know about the MPS Control Daemon pod.
- Do not delete or restart the MPS Control Daemon pod. Deleting it makes GPU applications on that node unavailable. Run kubectl get po -l app.aliyun.com/name=mps-control-daemon -A to check the status of MPS Control Daemon pods in your cluster.
- The container that runs the MPS Control Daemon requires privileged, hostIPC, and hostPID permissions. These permissions carry potential security risks. Assess them carefully before you use this solution.
- The MPS Control Daemon pod uses priorityClassName: system-node-critical to maintain high priority. This prevents the pod from being terminated when node resources run low. Without this, business applications cannot use the GPU. If node resources are low during deployment, the MPS Control Daemon may preempt lower-priority business pods, causing them to be evicted. Before you deploy the component, ensure your nodes have sufficient CPU and memory.
For GPU nodes that are managed in Container Service for Kubernetes (ACK) clusters, you need to pay attention to the following items when you request GPU resources for applications and use GPU resources.
- Do not run GPU-heavy applications directly on nodes.
- Do not use tools, such as Docker, Podman, or nerdctl, to create containers and request GPU resources for the containers. For example, do not run the docker run --gpus all or docker run -e NVIDIA_VISIBLE_DEVICES=all command and run GPU-heavy applications.
- Do not add the NVIDIA_VISIBLE_DEVICES=all or NVIDIA_VISIBLE_DEVICES=<GPU ID> environment variable to the env section in the pod YAML file. Do not use the NVIDIA_VISIBLE_DEVICES environment variable to request GPU resources for pods and run GPU-heavy applications.
- Do not set NVIDIA_VISIBLE_DEVICES=all and run GPU-heavy applications when you build container images if the NVIDIA_VISIBLE_DEVICES environment variable is not specified in the pod YAML file.
- Do not add privileged: true to the securityContext section in the pod YAML file and run GPU-heavy applications.
The following potential risks may exist when you use the preceding methods to request GPU resources for your application:
- If you use one of the preceding methods to request GPU resources on a node but do not specify the details in the device resource ledger of the scheduler, the actual GPU resource allocation information may be different from that in the device resource ledger of the scheduler. In this scenario, the scheduler can still schedule certain pods that request the GPU resources to the node. As a result, your applications may compete for resources provided by the same GPU, such as requesting resources from the same GPU, and some applications may fail to start up due to insufficient GPU resources.
- Using the preceding methods may also cause other unknown issues, such as the issues reported by the NVIDIA community.

Applicable scope

You have created an ACK managed cluster Pro edition. Its version is 1.20 or later. If your cluster version is older, upgrade your cluster.

Procedure

Step 1: Install the MPS Control Daemon component

Log on to the ACK console. In the left navigation pane, click Marketplace > Marketplace.
In the Marketplace, enter ack-mps-control in the search box. Click the component in the search results to open its installation page.
In the ack-mps-control installation interface, click Deploy, select the Cluster where you want to deploy components, and then click Next.
On the Create page, select the Chart Version. Click OK to complete the installation.
Important
- Uninstalling or upgrading the MPS Control Daemon component ack-mps-control affects GPU applications already running on the node. These applications exit with errors. Perform these actions during off-peak hours.
- The upgrade strategy is OnDelete. The system does not restart pods automatically. After the upgrade, manually delete the old pods in the ack-mps-control DaemonSet to finish the update. For details, see How do I upgrade the MPS Control Daemon component?

Step 2: Install the shared GPU component

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of your cluster. In the left navigation pane, click Applications > Cloud-native AI Suite.
On the Cloud-native AI Suite page, click Deploy.
On the Deploy Now page for the Cloud-native AI Suite, select Scheduling Policy Extension (Batch Task Scheduling, GPU Sharing, Topology-aware GPU Scheduling).
At the bottom of the Cloud-native AI Suite page, click Deploy Cloud-native AI Suite.
After the component installs successfully, find the installed shared GPU component ack-ai-installer in the component list on the Cloud-native AI Suite page.

Step 3: Enable GPU sharing scheduling and GPU memory isolation

On the Clusters page, click the name of your cluster. In the left navigation pane, click Nodes > Node Pools.
On the Node Pools page, click Create Node Pool.

On the Create Node Pool page, configure the node pool settings. Click Confirm.

For details on other settings, see Create and manage node pools.

Setting	Description
Desired number of nodes	Set the initial number of nodes in the node pool. Note After you create the node pool, you can add GPU nodes to it. When you add GPU nodes, set the instance type architecture to Elastic GPU Service. For details, see Add existing nodes or Create and manage node pools.
Node labels	Click for Node Label, set Key to `ack.node.gpu.schedule` and Value to `mps`. Important You must label each GPU node with `ack.node.gpu.schedule=mps` for the MPS Control Daemon Pod to be deployed on the node. If your cluster includes the shared GPU scheduling component, labeling a node with `ack.node.gpu.schedule=mps` also enables both shared GPU scheduling and MPS isolation capabilities on that node. After you add the shared GPU scheduling label, do not change the node GPU scheduling label using the `kubectl label nodes` command or the label management feature on the Nodes page in the console. This avoids potential issues. For more information, see Enable scheduling. We recommend that you enable scheduling.

Step 4: Deploy a sample application

Create a sample application using the following YAML file.

apiVersion: batch/v1
kind: Job
metadata:
  name: mps-sample
spec:
  parallelism: 1
  template:
    metadata:
      labels:
        app: mps-sample
    spec:
      hostIPC: true  # Required. Otherwise, the pod fails to start.
      hostPID: true  # Optional. Added here only to help you see the effect of MPS.
      nodeSelector:
       kubernetes.io/hostname: <NODE_NAME> # Replace <NODE_NAME> with the hostname of a GPU node that has the label ack.node.gpu.schedule=mps. For example: cn-shanghai.192.0.2.109.
      containers:
      - name: mps-sample
        image: registry.cn-hangzhou.aliyuncs.com/ai-samples/gpushare-sample:tensorflow-1.5
        command:
        - python
        - tensorflow-sample-code/tfjob/docker/mnist/main.py
        - --max_steps=100000
        - --data_dir=tensorflow-sample-code/data
        resources:
          limits: 
            aliyun.com/gpu-mem: 7  # Request 7 GiB of GPU memory for this pod.
        workingDir: /root
      restartPolicy: Never

Important

After you enable MPS on a node, GPU application pods on that node must set hostIPC: true. Otherwise, the pod fails to start.

Wait for the pod to reach the Running state. Then run the following command to check whether MPS is active.

kubectl exec -ti mps-sample-xxxxx --  nvidia-smi

Expected output:

Tue May 27 05:32:12 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla ****-****-****           On  | 00000000:00:09.0 Off |                    0 |
| N/A   33C    P0              55W / 300W |    345MiB / 16384MiB |      0%   E. Process |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     14732      C   nvidia-cuda-mps-server                       30MiB |
|    0   N/A  N/A    110312    M+C   python                                      312MiB |
+---------------------------------------------------------------------------------------+

The output shows that the nvidia-smi command lists the mps-server. Its process ID on the host is 14732. It also shows a Python process with process ID 110312 running under MPS. This confirms that MPS is working.

FAQ

How do I upgrade the MPS Control Daemon component?

Upgrading ack-mps-control v0.2.0 requires ack-ai-installer >= 1.13.1. Upgrade the MPS Control Daemon component in this order.

In the component list on the Cloud-native AI Suite page, upgrade the Helm version of the shared GPU scheduling component ack-ai-installer.
In Applications > Helm, select the kube-system namespace. Upgrade the Helm version of the ack-mps-control component.
The upgrade strategy is OnDelete. The system does not restart pods automatically. After the upgrade, manually delete the old pods in the ack-mps-control DaemonSet to finish the update.
For each node, perform Drain, mark it as Unschedulable, and delete the ack-mps-control pod.
1. Set the node to Unschedulable and execute Drain.
2. Delete the ack-mps-control pod on that node.
3. Confirm that the new pod runs normally.
Delete and restart the ack-ai-installer pod.
After you upgrade ack-mps-control and confirm that the related pods are updated, manually delete the ack-ai-installer pod. It will rebuild automatically.
Mark the node as schedulable again.
After you confirm that both the ack-mps-control pod and the ack-ai-installer pod run normally on the target node, mark the node as schedulable.