This topic describes how to add a LINGJUN node pool to an ACK Managed Cluster Pro Edition.
Introduction to LINGJUN node pools
A LINGJUN node pool in an ACK Managed Cluster Pro Edition has a one-to-one mapping with a node group of the Intelligent Computing LINGJUN Service (LINGJUN bare metal cluster). This means that a node group of a LINGJUN Cluster corresponds to a single LINGJUN node pool of an ACK Managed Cluster Pro Edition, and a Node Lingjun instance can belong to only one LINGJUN node pool. By dividing nodes into LINGJUN node pools, you can apply different management policies to the Node Lingjun instances within an ACK Managed Cluster Pro Edition.
ACK Managed Cluster Pro Edition manages Node Lingjun instances using LINGJUN node pools. It supports node pool lifecycle management and the batch addition and removal of nodes. It provides management and operations and maintenance (O&M) capabilities that are almost identical to those of ECS node pools. These capabilities include node configuration, node O&M, application scheduling to specified node pools, monitoring and diagnostics, and automated O&M.
To provide enhanced cloud-native AI capabilities for Node Lingjun instances, you can install the cloud-native AI suite. LINGJUN node pools support topology-aware scheduling for multiple GPUs. They provide shared GPU scheduling and isolation using a GPU container virtualization solution. For tasks such as AI and High-Performance Computing (HPC), they support scheduling policies such as Gang, Capacity, and Binpack. They also support dataset orchestration and access acceleration.
The LINGJUN node pool feature for ACK Managed Cluster Pro Edition is enabled through a whitelist. To use this feature, contact the Container Service team through your solution architect (SA).
Billing description
When you use a LINGJUN node pool in an ACK Managed Cluster Pro Edition, the total cost consists of three parts: cluster management fees, LINGJUN node management fees, and cloud product resource fees.
Prerequisites
Before you create a LINGJUN node pool for an ACK Managed Cluster Pro Edition, the following prerequisites must be met:
Create a basic LINGJUN Cluster of the Lite type and scale out nodes in a LINGJUN node group. For more information, see Create a cluster.
Create an ACK Managed Cluster Pro Edition that meets the following conditions:
The ACK Managed Cluster Pro Edition and the LINGJUN bare metal cluster are in the same region and VPC.
The ACK Managed Cluster Pro Edition is version 1.31 or later. Only IPv4 single-stack clusters are supported. IPv6 dual-stack clusters are not supported. To upgrade the cluster, see Manually upgrade a cluster.
The network plugin is Terway. Different Node Lingjun instance types require different Terway versions. You must upgrade the terway-controlplane and terway-eniip components to the latest versions.
The ack-rdma-device-plugin component is installed.
When you use a LINGJUN node pool, you must retain ECS nodes to deploy some ACK control plane components. We recommend that you use three or more ECS nodes to ensure high availability (HA).
ImportantTo prevent system component pods from being scheduled to LINGJUN nodes and consuming resources, nodes in a LINGJUN node pool have the following labels and taints by default. If you want to run a pod on a LINGJUN node, you can add a toleration for this taint or delete the taint after you upgrade components. However, do not delete the default label.
Label: alibabacloud.com/lingjun-worker:true Taint: Key:node-role.alibabacloud.com/lingjun Effect:NoScheduleLINGJUN node pools support only Node Lingjun instances with an operating system (OS) kernel version of 5.10 or later.
Entry points
On the Node Pools page, you can create, edit, delete, and view the node pools in your cluster.
Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, find the cluster to manage and click its name. In the left navigation pane, choose .
Create a LINGJUN node pool
You can configure the node pool in the console. The configuration includes basic, network, and storage settings. Note that some configuration items, especially those related to node pool availability and networking, cannot be changed after the node pool is created. Creating a node pool does not affect the nodes or services in other existing node pools.
On the Node Pools page, click ... > Create LINGJUN Node Pool. In the Create LINGJUN Node Pool dialog box, complete the configurations and associate an existing LINGJUN Cluster and LINGJUN group.
After the node pool is created, you can modify its configuration items on the Edit Node Pool page. The following table indicates whether a configuration item can be modified after the node pool is created.
LINGJUN node pools currently support storing container runtime data only on the system disk.
For Node Lingjun instances that use LINGJUN Connection, you must submit a request to be added to the whitelist for the ACK VPD CNI component. Before you create the LINGJUN node pool, install the ACK VPD CNI component on the Component Management page. When you create a LINGJUN node pool for a node group that uses LINGJUN Connection, ACK automatically adds the CIDR block of the LINGJUN group to the cluster security group and allows inbound access. ACK also automatically adds the alibabacloud.com/lingjun-network-type: vpd label to the node pool. Do not delete this label.
Add existing Node Lingjun instances
To add Node Lingjun instances from a LINGJUN group to an ACK cluster as worker nodes, or to re-add removed worker nodes, you can add them in batches from the associated group to the LINGJUN node pool in the ACK console. After the nodes are added, you can manage them at the node pool level.
Adding existing Node Lingjun instances does not replace their operating systems, system disks, or data disks, and does not affect the data stored on them. The Node Lingjun instances that you want to add must belong to the LINGJUN group that is associated with the node pool and must not have been added to the node pool.
Log on to the ACK console. In the navigation pane on the left, choose Clusters.
On the Clusters page, click the name of the target cluster. In the navigation pane on the left, choose Node > Node Pools.
On the Node Pools page, click ⋮ > Add Existing Node.
NoteAfter the Node Lingjun instances are successfully added, ACK Managed Cluster Pro Edition automatically adds the corresponding tags to them. You can view these tags in the Intelligent Computing LINGJUN console.
ack.aliyun.com: The ID of the ACK Managed Cluster Pro Edition that manages the Node Lingjun instances.
ack.alibabacloud.com/nodepool-id: The ID of the LINGJUN node pool that manages the Node Lingjun instances.
Remove Node Lingjun instances
Node Lingjun instances that are added to a node pool are not released when you delete the ACK cluster or the LINGJUN node pool. The instances are not automatically removed from the LINGJUN group by scaling in. You must monitor the billing status of your Node Lingjun instances to avoid extra charges.
Removing a Node Lingjun instance only removes it from the LINGJUN node pool. It does not remove the node from the LINGJUN group. For more management operations on Node Lingjun instances and groups, go to the Intelligent Computing LINGJUN console.
Use the RDMA feature
To enable Remote Direct Memory Access (RDMA) communication for Node Lingjun instances, navigate to the details page of the target cluster in the console. In the navigation pane on the left, choose Operations > Add-ons and manually install the ack-rdma-device-plugin component.
The network modes available for pods depend on the IP version of the computing network of the LINGJUN bare metal cluster that is associated with the LINGJUN node pool.
Computing network IP version | Supported RDMA network modes | Configuration description |
IPv4 | Only | Pods support RDMA communication only in |
IPv6 |
|
|
For more information, see Use RDMA networks on Node Lingjun instances for pods.
Use Terway exclusive ENI mode
When you use Terway, LINGJUN node pools support only the exclusive elastic network interface (ENI) network mode and require Terway v1.14.4 or later. If your Terway component version is earlier than v1.14.4, upgrade the terway-eniip component as described in Upgrade components.
When you create a LINGJUN node pool, ACK automatically adds the k8s.aliyun.com/exclusive-mode-eni-type: eniOnly label to the node pool to enable exclusive ENI mode. Do not delete this label. For more information, see Configure exclusive ENI network mode for a node pool.
If your LINGJUN node pool does not have this label, it uses the shared ENI network mode.
When a Node Lingjun instance uses the shared ENI mode for VPC network communication, pod network failures may occasionally occur. You can restart the pod to temporarily restore the service. To completely resolve this issue, upgrade the Terway component to the latest version during off-peak hours. Then, recreate the LINGJUN node pool in exclusive ENI mode and add the Node Lingjun instances to the new node pool.
Upgrade components
When you create an ACK Managed Cluster Pro Edition, the latest component versions are used by default. When you create a LINGJUN node pool in an existing ACK Managed Cluster Pro Edition, you must upgrade the following components to the specified versions. To upgrade the components, navigate to the details page of the target cluster in the console and choose Operations > Add-ons in the navigation pane on the left.
Component Name | Minimum Version Requirement |
v1.31 | |
v1.14.4 | |
v1.11.3.5-5321daf49-aliyun | |
v1.11.4-aliyun.2 | |
v0.2.1 | |
v0.16.1.0-gea4d02f-aliyun | |
v1.8.4 | |
v1.1.31 | |
v2.1.6 | |
v1.32.2 | |
v1.32.2 | |
v0.2.10 | |
ack-ai-installer (Applications > Cloud-native AI Suite Installation) | v1.12.2 |
Related operations
Use shared GPU scheduling.
To use shared GPU scheduling on Node Lingjun instances in an ACK Managed Cluster Pro Edition and enable GPU sharing and isolation, you must first install the ack-ai-installer component of the cloud-native AI suite. For more information, see Use shared GPU scheduling.
Enable the Binpack scheduling policy.
When you run model training jobs in a LINGJUN node pool, you can enable the Binpack policy for pod scheduling. This policy prioritizes scheduling pods to the same machine to reduce cross-machine communication latency during training. For more information about how to enable binpack in the Kube Scheduler component, see Custom parameters of kube-scheduler.
Use topology-aware scheduling in a LINGJUN node pool.
To use topology-aware scheduling in a LINGJUN node pool, you must install Kube Scheduler and upgrade it to v1.31 or later. For more information, see Use topology-aware scheduling.
FAQ
Node remains in Not Ready state after repair
Background: A Node Lingjun instance was taken offline for repair due to a hardware issue. After the repair is complete, the node's status is still Not Ready in the ACK cluster.
Cause: During offline repair, the Node Lingjun instance is replaced, and the data on its local disks is not retained. This clears the information of container runtimes such as kubelet and containerd, which causes the node to enter an abnormal state.
Solution: After the repair is complete, you must manually remove the node from the node pool and then re-add it using the Add Existing Nodes feature.