Run GPU-Accelerated Knative Services on ACK Serverless - Container Service for Kubernetes

To deploy GPU-accelerated workloads in Knative — such as AI inference and high-performance computing — specify GPU resources in the Knative Service configuration. You can also enable GPU sharing to let multiple pods share a single physical GPU, reducing costs when workloads do not need dedicated GPU access.

Prerequisites

Before you begin, ensure that you have:

Knative deployed in your cluster. For more information, see Deploy Knative

Configure GPU resources

Add the k8s.aliyun.com/eci-use-specs annotation to spec.template.metadata.annotations to specify a GPU-accelerated Elastic Compute Service (ECS) instance type. Add the nvidia.com/gpu field to spec.containers.resources.limits to specify the number of GPUs required. If you omit nvidia.com/gpu, the pod fails to start.

The following example configures a Knative Service with one GPU:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
spec:
  template:
    metadata:
      labels:
        app: helloworld-go
      annotations:
        k8s.aliyun.com/eci-use-specs: ecs.gn5i-c4g1.xlarge  # GPU-accelerated ECS instance type
    spec:
      containers:
        - image: registry.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:73fbdd56
          ports:
          - containerPort: 8080
          resources:
            limits:
              nvidia.com/gpu: '1'    # Number of GPUs required. Required field.

Supported GPU instance families

Instance family	GPU model	Example instance type
gn7i	NVIDIA A10	ecs.gn7i-c8g1.2xlarge
gn7	—	ecs.gn7-c12g1.3xlarge
gn6v	NVIDIA V100	ecs.gn6v-c8g1.2xlarge
gn6e	NVIDIA V100	ecs.gn6e-c12g1.3xlarge
gn6i	NVIDIA T4	ecs.gn6i-c4g1.xlarge
gn5i	NVIDIA P4	ecs.gn5i-c2g1.large
gn5	NVIDIA P100	ecs.gn5-c4g1.xlarge

The supported GPU driver version is NVIDIA 460.73.01 and the supported CUDA Toolkit version is 11.2. The gn5 instance family is equipped with local disks. For information on mounting local disks to elastic container instances (ECIs), see Create an elastic container instance that has local disks attached. For the full list of available instance types by region, see ECS instance types available for each region and Overview of instance families.

Enable GPU sharing

GPU sharing lets multiple pods share a single physical GPU, reducing costs when workloads do not require dedicated GPU access.

To enable GPU sharing:

Enable GPU sharing for nodes. For more information, see Enable GPU sharing.

Add the aliyun.com/gpu-mem field to spec.containers.resources.limits in your Knative Service to specify the GPU memory size:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/maxScale: "100"
        autoscaling.knative.dev/minScale: "0"
    spec:
      containerConcurrency: 1
      containers:
      - image: registry-vpc.cn-hangzhou.aliyuncs.com/hz-suoxing-test/test:helloworld-go
        name: user-container
        ports:
        - containerPort: 6666
          name: http1
          protocol: TCP
        resources:
          limits:
            aliyun.com/gpu-mem: "3"    # Specify the GPU memory size.

What's next

Best practices for deploying AI inference services in Knative — deploy AI models as inference services, configure autoscaling, and allocate GPU resources flexibly.
GPU FAQs