To run AI inference, high-performance computing, or other GPU workloads in Knative, configure your Knative Service to request GPU resources. You can assign a dedicated GPU to a service or enable GPU sharing so multiple pods split a single physical GPU.
Prerequisites
Before you begin, ensure that you have:
-
Knative deployed in your ACK cluster. For more information, see Deploy Knative.
Configure a dedicated GPU
Add two fields to your Knative Service manifest:
-
k8s.aliyun.com/eci-use-specsannotation inspec.template.metadata.annotations— specifies the GPU-accelerated ECS instance type. -
nvidia.com/gpuresource limit inspec.containers.resources.limits— specifies the number of GPUs the container requires. This field is required. If you omit it, the pod fails to start.
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: helloworld-go
spec:
template:
metadata:
labels:
app: helloworld-go
annotations:
k8s.aliyun.com/eci-use-specs: ecs.gn5i-c4g1.xlarge # GPU-accelerated ECS instance type
spec:
containers:
- image: registry.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:73fbdd56
ports:
- containerPort: 8080
resources:
limits:
nvidia.com/gpu: '1' # Number of GPUs required. Required field — omitting it causes the pod to fail at startup.
Supported GPU instance families
| Instance family | GPU chip | Example instance type |
|---|---|---|
| gn7i | NVIDIA A10 | ecs.gn7i-c8g1.2xlarge |
| gn7 | — | ecs.gn7-c12g1.3xlarge |
| gn6v | NVIDIA V100 | ecs.gn6v-c8g1.2xlarge |
| gn6e | NVIDIA V100 | ecs.gn6e-c12g1.3xlarge |
| gn6i | NVIDIA T4 | ecs.gn6i-c4g1.xlarge |
| gn5i | NVIDIA P4 | ecs.gn5i-c2g1.large |
| gn5 | NVIDIA P100 | ecs.gn5-c4g1.xlarge |
The gn5 instance family includes local disks. To mount local disks to elastic container instances, see Create an elastic container instance that has local disks attached.
For the full list of GPU-accelerated ECS instance types available in your region, see ECS instance types available for each region. For general information about instance families, see Overview of instance families.
GPU-accelerated elastic container instances support NVIDIA GPU driver version 460.73.01 and CUDA Toolkit version 11.2.
Enable GPU sharing
GPU sharing lets multiple pods share a single physical GPU by dividing its memory. Use GPU sharing for workloads such as lightweight inference services or development environments.
-
Enable GPU sharing on the nodes. For instructions, see Enable GPU sharing.
-
In your Knative Service manifest, set
aliyun.com/gpu-memunderspec.containers.resources.limitsto specify the GPU memory size (in GB) each container receives.apiVersion: serving.knative.dev/v1 kind: Service metadata: name: helloworld-go namespace: default spec: template: metadata: annotations: autoscaling.knative.dev/maxScale: "100" # Maximum number of pod replicas autoscaling.knative.dev/minScale: "0" # Scale to zero when idle spec: containerConcurrency: 1 # Maximum concurrent requests per pod replica containers: - image: registry-vpc.cn-hangzhou.aliyuncs.com/hz-suoxing-test/test:helloworld-go name: user-container ports: - containerPort: 6666 name: http1 protocol: TCP resources: limits: aliyun.com/gpu-mem: "3" # GPU memory allocated to this container, in GB
What's next
-
Best practices for deploying AI inference services in Knative — deploy AI models as inference services, configure autoscaling, and manage GPU resource allocation.
-
GPU FAQ — solutions to common GPU issues.