A Lingjun cluster is a collection of high-performance Lingjun compute nodes equipped with the Lingjun optimization suite. Each Lingjun node corresponds to a GPU compute server that you can use to deploy heterogeneous computing services. This topic describes how to manage Lingjun clusters and nodes, including viewing their information and scaling them out.
Manage Lingjun clusters
A Lingjun cluster can be in one of the following states:
Initialization failed.: The cluster failed to initialize. To view details about the failed task, see O&M Task Center.
Initializing: The system is configuring the Lingjun network and initializing the Lingjun compute nodes.
Running: You can scale out or scale in a cluster, or reinstall or restart a node only when the cluster is in the Running state.
ImportantYou can run scale-out, scale-in, reinstall, and restart tasks in parallel if they target different Lingjun compute nodes.
Cluster information
Log on to the Intelligent Computing Lingjun console.
In the left-side navigation pane, choose Resources and Nodes > Cluster Management.
Click Details next to the cluster ID to open the Cluster Details page.
View basic information about the cluster, such as its name, the number of node groups, and creation details.
View cluster details on the Node Group, Monitoring and Alerting, Basic Metrics, RDMA, and GPU tabs.
Scale out cluster
When you scale out a cluster, you must deploy a CPFS client on the new GPU nodes and add them to the CPFS cluster.
You must also add tags to the new nodes.
Log on to the Intelligent Computing Lingjun console.
In the left-side navigation pane, choose Resources and Nodes > Cluster Management.
Click Scale out next to the target cluster ID.
In the Original Group Details: area, click Scale out next to the name of the corresponding node group.
In the dialog box that appears, enter a node name prefix, and then enter and confirm the logon password.
Select the checkboxes for the unused node instances or purchase new nodes, and then click Yes.
In the Detailed configurations for scale-out area, click Confirm Submission.
Return to the Cluster Management page. The cluster status changes to Scaling out. Wait for the process to complete.
Scale in cluster
Scaling in a cluster removes nodes and then reinstalls their operating systems, which erases all local data. Before you proceed, make sure that you have backed up any required data from these nodes.
When you scale in a cluster, the nodes are removed from the associated CPFS cluster.
Log on to the Intelligent Computing Lingjun console.
In the left-side navigation pane, choose Resources and Nodes > Cluster Management.
Click Scale-in next to the cluster ID.
In the Original Group Details: area, select the checkboxes for the nodes that you want to remove, and then click Batch Remove from Cluster.
In the The following information displays the detailed configurations for scale-down: area, click Confirm Submission.
On the Confirm scale-in page, enter
DELETEin the text box and then click Yes.Return to the Cluster Management page. The cluster status changes to Scaling in. Wait for the process to complete.
Delete cluster
Before you delete a cluster, you must first remove all its nodes by scaling it in.
Deleting a cluster does not delete the associated CPFS cluster.
Log on to the Intelligent Computing Lingjun console.
In the left-side navigation pane, choose Resources and Nodes > Cluster Management.
Click the Cluster ID/Name of the cluster that you want to delete. On the Cluster Details page, click Delete in the upper-right corner.
In the dialog box that appears, click OK to delete the cluster.
Create a node group
You can create a node group for a Lingjun cluster in two ways:
Create a node group when you create the cluster. For more information, see Configure clusters and node groups.
Create a node group for an existing cluster.
Log on to the Intelligent Computing Lingjun console.
In the left-side navigation pane, choose Resources and Nodes>Cluster Management.
Click the target Cluster ID/Name.
Click the Node Group tab.
Click Create Group. Enter the group name, default node type, and other information.
(Optional) After the node group is created, you can edit its name or delete it.
Manage Lingjun nodes
A Lingjun compute node can perform only one operation at a time. These operations include cluster scale-out, cluster scale-in, node reinstallation, and node restart.
Purchase new nodes
Log on to the Intelligent Computing Lingjun console.
In the navigation pane on the left, choose Resources and Nodes > Node Management.
On the Node Management page, click Purchase Node.
Follow the on-screen instructions to purchase new nodes.
View node details
Log on to the Intelligent Computing Lingjun console.
In the navigation pane on the left, choose Resources and Nodes > Node Management to go to the Node Management page.
Click the All tab to view all nodes.
View basic node information, such as Node ID/Name, Image Name, and Zone.
To search for nodes, select criteria such as Image Name, Zone, or IP Address from the drop-down list, and then enter a keyword in the search box.
Click the Unused tab to view unused nodes. View basic information about the nodes, such as Node Model and Resource Group.
Log on to a node
Log on to the Intelligent Computing Lingjun console.
In the navigation pane on the left, choose Resources and Nodes > Node Management.
In the Actions column of the target node, click Remote Logon.
-
The logon username is
root. -
The logon password is the logon password of the cluster. For more information, see Cluster and group configurations.
-
Reinstall a node
Reinstalling a node deletes its data. Proceed with caution.
A node can be reinstalled only when the Lingjun cluster is in the Running state.
Reinstalling a node involves removing the old node from the CPFS cluster and then adding the new node information to the cluster.
Reinstall a node in the following situations:
To redeploy services.
To change the operating system version.
For O&M purposes.
Procedure
Log on to the Intelligent Computing Lingjun console.
In the navigation pane on the left, choose Resources and Nodes > Node Management.
On the Node Management page, click Reinstall for the target instance ID. In the dialog box that appears, select an image version, change the node name, enter and confirm the root password for the node, and then click Reinstall.
Restart a node
Restarting a node may affect business continuity.
A node can be restarted only when the Lingjun cluster is in the Running state.
Restart a node in the following situations:
To deploy new applications or services.
To modify system configurations.
For O&M purposes.
Procedure
Log on to the Intelligent Computing Lingjun console.
In the navigation pane on the left, choose Resources and Nodes > Node Management.
On the Node Management page, click Restart for the target instance ID.