Maximize GPU Utilization with Knative GPU Sharing on ACK - Container Service for Kubernetes

To run AI inference, high-performance computing, or other GPU workloads in Knative, configure your Knative Service to request GPU resources. You can assign a dedicated GPU to a service or enable GPU sharing so multiple pods split a single physical GPU.

Prerequisites

Before you begin, ensure that you have:

Knative deployed in your ACK cluster. For more information, see Deploy Knative.

Configure a dedicated GPU

Add two fields to your Knative Service manifest:

k8s.aliyun.com/eci-use-specs annotation in spec.template.metadata.annotations — specifies the GPU-accelerated ECS instance type.
nvidia.com/gpu resource limit in spec.containers.resources.limits — specifies the number of GPUs the container requires. This field is required. If you omit it, the pod fails to start.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
spec:
  template:
    metadata:
      labels:
        app: helloworld-go
      annotations:
        k8s.aliyun.com/eci-use-specs: ecs.gn5i-c4g1.xlarge  # GPU-accelerated ECS instance type
    spec:
      containers:
        - image: registry.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:73fbdd56
          ports:
            - containerPort: 8080
          resources:
            limits:
              nvidia.com/gpu: '1'  # Number of GPUs required. Required field — omitting it causes the pod to fail at startup.

Supported GPU instance families

Instance family	GPU chip	Example instance type
gn7i	NVIDIA A10	ecs.gn7i-c8g1.2xlarge
gn7	—	ecs.gn7-c12g1.3xlarge
gn6v	NVIDIA V100	ecs.gn6v-c8g1.2xlarge
gn6e	NVIDIA V100	ecs.gn6e-c12g1.3xlarge
gn6i	NVIDIA T4	ecs.gn6i-c4g1.xlarge
gn5i	NVIDIA P4	ecs.gn5i-c2g1.large
gn5	NVIDIA P100	ecs.gn5-c4g1.xlarge

The gn5 instance family includes local disks. To mount local disks to elastic container instances, see Create an elastic container instance that has local disks attached.

For the full list of GPU-accelerated ECS instance types available in your region, see ECS instance types available for each region. For general information about instance families, see Overview of instance families.

GPU-accelerated elastic container instances support NVIDIA GPU driver version 460.73.01 and CUDA Toolkit version 11.2.

Enable GPU sharing

GPU sharing lets multiple pods share a single physical GPU by dividing its memory. Use GPU sharing for workloads such as lightweight inference services or development environments.

Enable GPU sharing on the nodes. For instructions, see Enable GPU sharing.

In your Knative Service manifest, set aliyun.com/gpu-mem under spec.containers.resources.limits to specify the GPU memory size (in GB) each container receives.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/maxScale: "100"  # Maximum number of pod replicas
        autoscaling.knative.dev/minScale: "0"    # Scale to zero when idle
    spec:
      containerConcurrency: 1  # Maximum concurrent requests per pod replica
      containers:
        - image: registry-vpc.cn-hangzhou.aliyuncs.com/hz-suoxing-test/test:helloworld-go
          name: user-container
          ports:
            - containerPort: 6666
              name: http1
              protocol: TCP
          resources:
            limits:
              aliyun.com/gpu-mem: "3"  # GPU memory allocated to this container, in GB

What's next

Best practices for deploying AI inference services in Knative — deploy AI models as inference services, configure autoscaling, and manage GPU resource allocation.
GPU FAQ — solutions to common GPU issues.