Schedule pods on ACS using a virtual node - Container Service for Kubernetes

This topic describes how to schedule pods from an ACK One registered cluster to Alibaba Cloud Serverless (ACS) computing power by using virtual nodes. Virtual nodes allow you to run workloads on ACS without provisioning or managing physical nodes.

How it works

Virtual nodes integrate ACS computing power into your ACK One registered cluster. After you install the ack-virtual-node component, a virtual node is created in your cluster. When you schedule a pod to the virtual node, ACS automatically provisions serverless compute resources for the pod. The pod runs in an isolated environment on ACS and can communicate with other pods in the cluster. Virtual nodes are ideal for workloads with variable or unpredictable traffic patterns because they scale on demand without requiring you to provision or manage physical nodes.

For more information about virtual nodes, see Overview of registered clusters.

Prerequisites

You have created an ACK One registered cluster and connected it to a Kubernetes cluster. Kubernetes 1.24 or later is required. For more information, see Create an ACK One registered cluster.
The ack-virtual-node component version 2.13.0 or later is installed in the registered cluster. For more information, see Install the ack-virtual-node component.

Configure RAM permissions for the ack-virtual-node component

onectl

Install onectl on your on-premises machine. For more information, see Use onectl to manage registered clusters.

Run the following command to configure RAM permissions for the ack-virtual-node component:

onectl ram-user grant --addon ack-virtual-node

Expected output:

Ram policy ack-one-registered-cluster-policy-ack-virtual-node granted to ram user ack-one-user-ce313528c3 successfully.

Console

Before installing the component, create a RAM user, grant it the necessary permissions, and create an AccessKey for it. You will use this AccessKey to create a Secret that allows the component to access cloud services.

Go to the RAM console.

(Optional) Create a custom policy. Use the following policy content.

Custom policy template

{
    "Version": "1",
    "Statement": [
        {
            "Action": [
                "vpc:DescribeVSwitches",
                "vpc:DescribeVpcs"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "eci:CreateContainerGroup",
                "eci:DeleteContainerGroup",
                "eci:DescribeContainerGroups",
                "eci:DescribeContainerGroupStatus",
                "eci:DescribeContainerGroupEvents",
                "eci:DescribeContainerLog",
                "eci:UpdateContainerGroup",
                "eci:UpdateContainerGroupByTemplate",
                "eci:CreateContainerGroupFromTemplate",
                "eci:RestartContainerGroup",
                "eci:ExportContainerGroupTemplate",
                "eci:DescribeContainerGroupMetric",
                "eci:DescribeMultiContainerGroupMetric",
                "eci:ExecContainerCommand",
                "eci:CreateImageCache",
                "eci:DescribeImageCaches",
                "eci:DeleteImageCache",
                "eci:DescribeContainerGroupMetaInfos",
                "eci:UpdateImageCache",
                "eci:RestartContainer",
                "eci:RestartContainers"
            ],
            "Resource": [
                "*"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "acc:RecommendZones",
                "acc:DescribeZones",
                "acc:CreateInstance",
                "acc:UpdateInstance",
                "acc:DeleteInstance",
                "acc:RestartInstance",
                "acc:DescribeInstances",
                "acc:DescribeInstanceStatus",
                "acc:DescribeInstanceEvents",
                "acc:DescribeInstanceDetail",
                "acc:DescribeMultiInstanceMetric",
                "acc:DescribeContainerLog",
                "acc:ResizeInstanceVolume",
                "acc:CreateCustomResource",
                "acc:UpdateCustomResource",
                "acc:DeleteCustomResource",
                "acc:DescribeCustomResources",
                "acc:DescribeCustomResourceDetail",
                "acc:DescribeReservationMetrics"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

Grant permissions to the RAM user. You can attach the system policies AliyunECIFullAccess, AliyunVPCReadOnlyAccess, and AliyunAccFullAccess, or attach the custom policy you created.
Create an AccessKey for the RAM user.
Warning
Configure a network access policy as described in Network access control policies for AccessKeys to restrict AccessKey calls to trusted network environments. This enhances the security of your AccessKey.
Use the AccessKey to create a Secret named alibaba-addon-secret in the registered cluster.
```
kubectl -n kube-system create secret generic alibaba-addon-secret --from-literal='access-key-id=<your access key id>' --from-literal='access-key-secret=<your access key secret>'
```
Here <your access key id> and <your access key secret> are the AccessKey values that you obtained in the previous step.
When you install the ack-virtual-node component, it automatically uses this AccessKey to access the corresponding cloud services.

Example: Use ACS CPU computing power

After you install or upgrade the ack-virtual-node component to version 2.13.0 or later, the component supports both ACS and Elastic Container Instance (ECI) computing power.

Note

In scenarios where pods are scheduled to virtual nodes, ECI is used by default if you do not specify ACS as the computing power type.

Perform the following steps to use ACS CPU computing power with an ACK One registered cluster:

Update the security group configuration of the registered cluster.
On the Basic Information page of the cluster, click the ID of the Control Plane Security Group.

On the security group details page, click Add Rule. Configure the rule by using the following values.

Rule Type	Protocol	Port Range	Source IP Range	Description
Inbound	TCP	80	CIDR block of the IDC cluster, for example, 192.168.1.0/24.	For configuring ACS endpoint scenarios.
Inbound	TCP	443	CIDR block of the IDC cluster, for example, 192.168.1.0/24.	For configuring ACS endpoint scenarios.
Inbound	TCP	10250	CIDR block of the IDC cluster, for example, 192.168.1.0/24.	The port listened on by the serverless kubelet service.

Create a Deployment that uses ACS computing power.

Create a file named nginx.yaml and copy the following content to the file:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx 
        alibabacloud.com/acs: "true" # Configure to use ACS computing power.
        alibabacloud.com/compute-class: general-purpose # Configure the computing power type for the ACS pod.
        alibabacloud.com/compute-qos: default # Configure the computing power quality for the ACS pod.
    spec:
      containers:
      - name: nginx
        image: mirrors-ssl.aliyuncs.com/nginx:stable-alpine
        ports:
          - containerPort: 80
            protocol: TCP 
        resources:
          limits:
            cpu: 2
          requests:
            cpu: 2

Run the following command to create the nginx application:
```
kubectl apply -f nginx.yaml 
```

Run the following command to check the deployment status:

kubectl get pods -o wide

Expected output (simplified):

NAME                     READY   STATUS    RESTARTS   AGE     IP               NODE                            NOMINATED NODE   READINESS GATES
nginx-54bcbc9b66-****   1/1     Running   0          3m29s   192.168.XX.XXX   virtual-kubelet-cn-shanghai-l   <none>           <none>
nginx-54bcbc9b66-****   1/1     Running   0          3m29s   192.168.XX.XXX   virtual-kubelet-cn-shanghai-l   <none>           <none>

The output shows that the two pods are scheduled to the node labeled type=virtual-kubelet.

Run the following command to view the details of an nginx pod:

kubectl describe pod nginx-54bcbc9b66-****

Expected output:

Annotations:  ProviderCreate: done
              alibabacloud.com/instance-id: acs-uf6008giwgjxlvn*****
              alibabacloud.com/pod-ephemeral-storage: 30Gi
              alibabacloud.com/pod-use-spec: 2-2Gi
              kubernetes.io/pod-stream-port: 10250
              network.alibabacloud.com/enable-dns-cache: false
              topology.kubernetes.io/region: cn-shanghai

If the output contains Annotation alibabacloud.com/instance-id: acs-uf6008giwgjxlvn*****, the pod is an ACS pod instance.

Example: Use ACS GPU computing power

ACS GPU computing power for ACK One registered clusters is in invitational preview. To enable this feature, submit a tick.

Configure a GPU workload

After the feature is enabled, you can configure a GPU workload. The following example shows the required labels for scheduling pods to ACS GPU computing power:

...     
     labels:
        # Declare the ACS GPU resource requirements in the label.
        alibabacloud.com/compute-class: gpu     # If the type is GPU, set this to gpu.
        alibabacloud.com/compute-qos: default   # The QoS type for computing. The meaning is the same as for normal ACS computing power.
        alibabacloud.com/gpu-model-series: GN8IS  # The GPU model. Replace this with the actual model.
...

Note

For more information about ACS compute classes and QoS classes, see Mapping between computing types and computing power quality.
For available GPU models for the gpu-model-series label, see Specify a GPU model and driver version for an ACS GPU pod.

Create a GPU workload using the following sample YAML:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: dep-node-selector-demo
  labels:
    app: node-selector-demo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: node-selector-demo
  template:
    metadata:
      labels:
        app: node-selector-demo
        # ACS properties
        alibabacloud.com/acs: "true" # Configure to use ACS computing power.
        alibabacloud.com/compute-class: gpu
        alibabacloud.com/compute-qos: default
        alibabacloud.com/gpu-model-series: example-model  # The GPU card model. Replace it as needed, for example, T4.
    spec:
      containers:
      - name: node-selector-demo
        image: registry-cn-hangzhou.ack.aliyuncs.com/acs/stress:v1.0.4
        command:
        - "sleep"
        - "1000h"
        resources:
          limits:
            cpu: 1
            memory: 1Gi
            nvidia.com/gpu: "1"
          requests:
            cpu: 1
            memory: 1Gi
            nvidia.com/gpu: "1"

Run the following command to check the running status of the GPU workload:

kubectl get pod node-selector-demo-9cdf7bbf9-s**** -oyaml

Expected output:

    phase: Running

    resources:
      limits:
        #other resources
        nvidia.com/gpu: "1"
      requests:
        #other resources
        nvidia.com/gpu: "1"

Example: Use ACS GPU HPN computing power

The process for using ACS GPU HPN computing power is similar to using ACS CPU computing power, but has the following requirements:

You must purchase a GPU-HPN capacity reservation in advance and associate it with the cluster.
ACS GPU HPN computing power requires upgrading the ack-virtual-node component. The required component version is currently in invitational preview. To enable this feature, submit a ticket.

Configure a GPU HPN workload

To use ACS GPU HPN computing power, configure the following labels in your pod specification:

...     
labels:
  # Declare the ACS GPU resource requirements in the label.
  alibabacloud.com/compute-class: gpu-hpn     # Set to the gpu-hpn type.
  alibabacloud.com/compute-qos: default    # The computing QoS type. The meaning is the same as for regular ACS computing power.
  alibabacloud.com/acs: "true"           # The label to configure the use of ACS computing power.
...

Note

For more information about the relationship between ACS compute types and computing power quality, see Relationship between compute types and computing power quality.
For more information about other parameters of ACS pods, see ACS Pod.
Nodes of the ACS GPU HPN type can schedule only pods of the gpu-hpn compute class. The GPU resource requirement can be omitted from the pod resource declaration. These nodes cannot schedule pods of other compute classes or pods that do not have a compute class declared.

Use a Kubernetes nodeSelector to schedule Pods to GPU HPN nodes. For example:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: dep-node-selector-demo
  labels:
    app: node-selector-demo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: node-selector-demo
  template:
    metadata:
      labels:
        app: node-selector-demo
        # ACS properties
        alibabacloud.com/compute-class: gpu-hpn
        alibabacloud.com/compute-qos: default
        alibabacloud.com/acs: "true" 
    spec:
      # Specify the gpu-hpn reserved node label
      nodeSelector:
        alibabacloud.com/node-type: reserved
      containers:
      - name: node-selector-demo
        image: registry-cn-hangzhou.ack.aliyuncs.com/acs/stress:v1.0.4
        command:
        - "sleep"
        - "1000h"
        resources:
          limits:
            cpu: 1
            memory: 1Gi
            nvidia.com/gpu: "1" # Enter the corresponding resource name based on the actual card model.
          requests:
            cpu: 1
            memory: 1Gi
            nvidia.com/gpu: "1" # Enter the corresponding resource name based on the actual card model.

Important

For pods of the ACS GPU HPN type, take note of the following field configurations:

Specify the compute class: alibabacloud.com/compute-class: gpu-hpn.
Specify the reserved node label: alibabacloud.com/node-type: reserved.
For the device resource names in the requests and limits fields of the resource specification, use the actual device model, such as NVIDIA.

View the running status of the GPU workload.

kubectl get pod node-selector-demo-9cdf7bbf9-s**** -oyaml

The following output is a snippet of the key information:

    phase: Running

    resources:
      limits:
        #other resources
        nvidia.com/gpu: "1"
      requests:
        #other resources
        nvidia.com/gpu: "1"