Container Service for Kubernetes (ACK) provides the Slurm on Kubernetes solution and ack-slurm-operator component. You can use them to deploy and manage the Simple Linux Utility for Resource Management (Slurm) scheduling system in ACK clusters in a convenient and efficient manner for high performance computing (HPC) and large-scale AI and machine learning (ML).
Introduction to Slurm
Slurm is a powerful open source platform for cluster resource management and job scheduling. It is designed to optimize the performance and efficiency of supercomputers and large compute clusters. Its key components work together to ensure high efficiency and flexibility of the system. The following figure shows how Slurm works.
slurmctld: the Slurm control daemon. As the brain of Slurm, the slurmctld monitors system resources, schedules jobs, and manages the cluster status. To enhance the reliability of the system, you can configure a secondary slurmctld to prevent service interruptions if the primary slurmctld fails. This ensures the high availability of the system.
slurmd: the Slurm node daemon. The slurmd is deployed on each compute node. The slurmd receives instructions from the slurmctld and manages jobs, including starting and executing jobs, reporting job status, and preparing for new job assignments. The slurmd serves as an interface for direct communication with computing resources. Jobs are scheduled based on the slurmd.
slurmdbd: the Slurm database daemon. The slurmdbd is an optional component but it is essential for long-term management and auditing of large clusters because it maintains a centralized database to store job history and accounting information. The slurmdbd can aggregate data across multiple Slurm-managed clusters to simplify and improve the efficiency of data management.
SlurmCLI: SlurmCLI provides the following commands to facilitate job management and system monitoring:
scontrol: used to manage clusters and control cluster configurations.
squeue: used to query the status of jobs in the queue.
srun: used to submit and manage jobs.
sbatch: used to submit jobs in batches. This component helps you schedule and manage computing resources.
sinfo: used to query the overall status of a cluster, including the availability of nodes.
Introduction of Slurm on ACK
The Slurm Operator uses the SlurmCluster CustomResource (CR) to define configuration files required for managing Slurm clusters and resolve control plane management issues. This simplifies the deployment and maintenance of Slurm-managed clusters. The following figure shows the architecture of Slurm on ACK. A cluster administrator can deploy and manage a Slurm-managed cluster by using the SlurmCluster. The Slurm Operator creates Slurm control components in the cluster based on the SlurmCluster. A Slurm configuration file can be mounted to a control component by using a shared volume or a ConfigMap.
Prerequisites
An ACK cluster that runs Kubernetes 1.22 or later is created, and the cluster contains one GPU-accelerated node. For more information, see Create an ACK cluster with GPU-accelerated nodes and Update clusters.
Step 1: Install the ack-slurm-operator component
Log on to the ACK console. In the left-side navigation pane, choose .
On the Marketplace page, search for the ack-slurm-operator component and click the component. On the details page of the ack-slurm-operator component, click Deploy in the upper-right corner. In the Deploy panel, configure the parameters for the component.
You need to only specify the Cluster parameter. Use the default settings for other parameters.
After you configure the parameters, click OK.
Step 2: Create a Slurm-managed cluster
Manually create a Slurm-managed cluster
Create a Slurm Secret for MUNGE Uid 'N' Gid Emporium (MUNGE)-based authentication for the ACK cluster.
Run the following command to create a key by using OpenSSL. This key is used for MUNGE-based authentication.
openssl rand -base64 512 | tr -d '\r\n'Run the following command to create a Secret. This Secret is used to store the key created in the previous step.
kubectl create secret generic <$MungeKeyName> --from-literal=munge.key=<$MungeKey>Replace
<$MungeKeyName>with the custom name of your key, such asmungekey.Replace
<$MungeKey>with the key string that is generated in the previous step.
After you perform the preceding steps, you can configure or associate the Secret with the Slurm-managed cluster to obtain and use the key for MUNGE-based authentication.
Run the following command to create a ConfigMap for the Slurm-managed cluster.
In this example, the following ConfigMap is mounted to a pod by specifying the slurmConfPath parameter in a CR. This ensures that the pod configuration can be automatically restored to the expected state even if the pod is recreated.
The
dataparameter in the following sample code specifies a sample ConfigMap. To generate a ConfigMap, we recommend that you use the Easy Configurator or Full Configurator tool.Expected output:
configmap/slurm-test createdThe expected output indicates that the ConfigMap is created.
Submit the SlurmCluster CR.
Create a file named slurmcluster.yaml and copy the following content to the file. Sample code:
NoteIn this example, an Ubuntu image that contains Compute Unified Device Architecture (CUDA) 11.4 and Slurm 23.06 is used. The image contains the component that is developed by Alibaba Cloud for auto scaling of on-cloud nodes. If you want to use a custom image, you can create and upload a custom image.
A Slurm-managed cluster that has one head node and four worker nodes is created based on the preceding SlurmCluster CR. The Slurm-managed cluster runs as a pod in the ACK cluster. The values of the mungeConfPath and slurmConfPath parameters in the SlurmCluster CR must be the same as the mount targets that are specified in the headGroupSpec and workerGroupSpecs parameters.
Run the following command to deploy the slurmcluster.yaml file to the cluster:
kubectl apply -f slurmcluster.yamlExpected output:
slurmcluster.kai.alibabacloud.com/slurm-job-demo createdRun the following command to check whether the created Slurm-managed cluster runs as expected:
kubectl get slurmclusterExpected output:
NAME AVAILABLE WORKERS STATUS AGE slurm-job-demo 5 ready 14mThe output indicates that the Slurm-managed cluster is deployed and its five nodes are ready.
Run the following command to check whether the pods in the Slurm-managed cluster named slurm-job-demo are in the Running state:
kubectl get podExpected output:
NAME READY STATUS RESTARTS AGE slurm-job-demo-head-x9sgs 1/1 Running 0 14m slurm-job-demo-worker-cpu-0 1/1 Running 0 14m slurm-job-demo-worker-cpu-1 1/1 Running 0 14m slurm-job-demo-worker-cpu1-0 1/1 Running 0 14m slurm-job-demo-worker-cpu1-1 1/1 Running 0 14mThe output indicates that the head node and four worker nodes run as expected in the Slurm-managed cluster.
Create a Slurm-managed cluster by using Helm
To quickly install and manage a Slurm-managed cluster and flexibly modify the configurations of the cluster, you can use the Helm to install the SlurmCluster chart provided by Alibaba Cloud. Download the Helm chart for Slurm-managed clusters from charts-incubator (chart repository of Alibaba Cloud). After you configure the parameters, Helm creates resources such as role-based access control (RBAC), ConfigMap, Secret, and Slurm-managed cluster.
The Helm chart contains the configurations of the following resources.
Resource type | Resource name | Description |
ConfigMap | {{ .Values.slurmConfigs.configMapName }} | When the .Values.slurmConfigs.createConfigsByConfigMap parameter is set to True, the ConfigMap is created and used to store user-defined Slurm configurations. The ConfigMap is mounted to the path specified by the .Values.slurmConfigs.slurmConfigPathInPod parameter. The specified path is also rendered to the .Spec.SlurmConfPath parameter of the Slurm-managed cluster and the startup commands of the pod. When the pod is started, the ConfigMap is copied to the /etc/slurm/ path and access to the ConfigMap is limited. |
ServiceAccount | {{ .Release.Namespace }}/{{ .Values.clusterName }} | This resource allows the slurmctld pod to modify the configurations of the Slurm-managed cluster. The Slurm-managed cluster can use this resource to enable auto scaling of on-cloud nodes. |
Role | {{ .Release.Namespace }}/{{ .Values.clusterName }} | This resource allows the slurmctld pod to modify the configurations of the Slurm-managed cluster. The Slurm-managed cluster can use this resource to enable auto scaling of on-cloud nodes. |
RoleBinding | {{ .Release.Namespace }}/{{ .Values.clusterName }} | This resource allows the slurmctld pod to modify the configurations of the Slurm-managed cluster. The Slurm-managed cluster can use this resource to enable auto scaling of on-cloud nodes. |
Role | {{ .Values.slurmOperatorNamespace }}/{{ .Values.clusterName }} | This resource allows the slurmctld pod to modify the Secrets in the SlurmOperator namespace. When Slurm and Kubernetes are deployed on the same batch of physical servers, the Slurm-managed cluster can use this resource to renew tokens. |
RoleBinding | {{ .Values.slurmOperatorNamespace }}/{{ .Values.clusterName }} | This resource allows the slurmctld pod to modify the Secrets in the SlurmOperator namespace. When Slurm and Kubernetes are deployed on the same batch of physical servers, the Slurm-managed cluster can use this resource to renew tokens. |
Secret | {{ .Values.mungeConfigs.secretName }} | This resource is used by Slurm components for authentication when they communicate with each other. When the .Values.mungeConfigs.createConfigsBySecret parameter is set to True, this resource is automatically created. This resource contains the following content: "munge.key"={{ .Values.mungeConfigs.content }}. When the .Values.mungeConfigs.createConfigsBySecret parameter is set to True, the .Values.mungeConfigs.createConfigsBySecret parameter is rendered as the .Spec.MungeConfPath parameter and then rendered as the mount path of the resource in the pod. The startup commands of the pod initialize /etc/munge/munge.key based on the mount path. |
SlurmCluster | The rendered Slurm-managed cluster. |
The following table describes the relevant parameters.
Parameter | Example | Description |
clusterName | "" | The cluster name. The cluster name is used to generate Secrets and roles. The value must be the same as the cluster name specified in other Slurm configuration files. |
headNodeConfig | N/A | This parameter is required. This parameter specifies the configurations of the slurmctld pod. |
workerNodesConfig | N/A | The configurations of the slurmd pod. |
workerNodesConfig.deleteSelfBeforeSuspend | true | When the value is set to true, a preStop hook is automatically added to the worker pod. A preStop hook triggers automatic node draining before the node is removed and marks the node as unschedulable. |
slurmdbdConfigs | N/A | This parameter specifies the configurations of the slurmdbd pod. If you leave this parameter empty, no pod is created to run slurmdbd. |
slurmrestdConfigs | N/A | This parameter specifies the configurations of the slurmrestd pod. If you leave this parameter empty, no pod is created to run slurmrestd. |
headNodeConfig.hostNetwork slurmdbdConfigs.hostNetwork slurmrestdConfigs.hostNetwork workerNodesConfig.workerGroups[].hostNetwork | false | Rendered as the hostNetwork parameter of the slurmctld pod. |
headNodeConfig.setHostnameAsFQDN slurmdbdConfigs.setHostnameAsFQDN slurmrestdConfigs.setHostnameAsFQDN workerNodesConfig.workerGroups[].setHostnameAsFQDN | false | Rendered as the setHostnameAsFQDN parameter of the slurmctld pod. |
headNodeConfig.nodeSelector slurmdbdConfigs.nodeSelector slurmrestdConfigs.nodeSelector workerNodesConfig.workerGroups[].nodeSelector | | Rendered as the nodeSelector parameter of the slurmctld pod. |
headNodeConfig.tolerations slurmdbdConfigs.tolerations slurmrestdConfigs.tolerations workerNodesConfig.workerGroups[].tolerations | | Rendered as the tolerations of the slurmctld pod. |
headNodeConfig.affinity slurmdbdConfigs.affinity slurmrestdConfigs.affinity workerNodesConfig.workerGroups[].affinity | | Rendered as the affinity rules of the slurmctld pod. |
headNodeConfig.resources slurmdbdConfigs.resources slurmrestdConfigs.resources workerNodesConfig.workerGroups[].resources | | Rendered as the resources of the primary container in the slurmctld pod. The resource limit of the primary container in the worker pod is rendered as the resource limit of the Slurm node. |
headNodeConfig.image slurmdbdConfigs.image slurmrestdConfigs.image workerNodesConfig.workerGroups[].image | "registry-cn-hangzhou.ack.aliyuncs.com/acs/slurm:23.06-1.6-aliyun-49259f59" | Rendered as the slurmctld image. You can also download a custom image from ai-models-on-ack/framework/slurm/building-slurm-image at main · AliyunContainerService/ai-models-on-ack (github.com). |
headNodeConfig.imagePullSecrets slurmdbdConfigs.imagePullSecrets slurmrestdConfigs.imagePullSecrets workerNodesConfig.workerGroups[].imagePullSecrets | | Rendered as the Secret used to pull the slurmctld image. |
headNodeConfig.podSecurityContext slurmdbdConfigs.podSecurityContext slurmrestdConfigs.podSecurityContext workerNodesConfig.workerGroups[].podSecurityContext | | Rendered as the security context of slurmctld. |
headNodeConfig.securityContext slurmdbdConfigs.securityContext slurmrestdConfigs.securityContext workerNodesConfig.workerGroups[].securityContext | | Rendered as the security context of the primary container in the slurmctld pod. |
headNodeConfig.volumeMounts slurmdbdConfigs.volumeMounts slurmrestdConfigs.volumeMounts workerNodesConfig.workerGroups[].volumeMounts | N/A | Rendered as the volume mounting configurations of the primary container in the slurmctld pod. |
headNodeConfig.volumes slurmdbdConfigs.volumes slurmrestdConfigs.volumes workerNodesConfig.workerGroups[].volumes | N/A | Rendered as the volume mounted to the slurmctld pod |
slurmConfigs.slurmConfigPathInPod | "" | The mount path of the Slurm configuration file in the pod. If the Slurm configuration file is mounted to the pod as a volume, you must set the value to the path to which the slurm.conf file is mounted. The startup commands of the pod will copy the file in the mount path to the /etc/slurm/ path and limit access to the file. |
slurmConfigs.createConfigsByConfigMap | true | Specifies whether to automatically create a ConfigMap to store the Slurm configurations. |
slurmConfigs.configMapName | "" | The name of the ConfigMap that stores the Slurm configurations. |
slurmConfigs.filesInConfigMap | "" | The content in the ConfigMap that is automatically created to store the Slurm configurations. |
mungeConfigs.mungeConfigPathInPod | N/A | The mount path of the MUNGE configuration file in the pod. If the MUNGE configuration file is mounted to the pod as a volume, you must set the value to the path to which the munge.key file is mounted. The startup commands of the pod will copy the file in the mount path to the /etc/munge/ path and limit access to the file. |
mungeConfigs.createConfigsBySecret | N/A | Specifies whether to automatically create a Secret to store the MUNGE configurations. |
mungeConfigs.secretName | N/A | The name of the Secret that stores the MUNGE configurations. |
mungeConfigs.content | N/A | The content in the Secret that is automatically created to store the MUNGE configurations. |
For more information about slurmConfigs.filesInConfigMap, see Slurm System Configuration Tool (schedmd.com).
If you modify the slurmConfigs.filesInConfigMap parameter after the pod is created, you must recreate the pod to make the modification take effect. In this case, we recommend that you check whether the parameter is modified as required before you recreate the pod.
Perform the following operations to install the Helm chart:
Run the following command to add the Helm repository provided by Alibaba Cloud to your local Helm client:
helm repo add aliyun https://aliacs-app-catalog.oss-cn-hangzhou.aliyuncs.com/charts-incubator/This operation allows you to access various charts provided by Alibaba Cloud, including Slurm-managed cluster.
Run the following command to pull and decompress the Helm chart:
helm pull aliyun/ack-slurm-cluster --untar=trueThis operation creates a directory named
ack-slurm-clusterin the current directory. The ack-slurm-cluster directory contains all the files and templates of the chart.Run the following commands to modify the chart parameters in the values.yaml file.
The values.yaml file contains the default configurations of the chart. You can modify this file to modify the parameter settings, such as Slurm configurations, resource requests and limits, and storage, based on your business requirements.
cd ack-slurm-cluster vi values.yamlUse Helm to install the chart.
cd .. helm install my-slurm-cluster ack-slurm-cluster # Replace my-slurm-cluster with the actual value.This operation deploys the Slurm-managed cluster.
Check whether the Slurm-managed cluster is deployed.
After the deployment is complete, you can use the
kubectltool provided by Kubernetes to check the deployment status. Make sure that the Slurm-managed cluster runs as expected.kubectl get pods -l app.kubernetes.io/name=slurm-cluster
Step 3: Log on to the Slurm-managed cluster
Log on as a Kubernetes cluster administrator
An administrator of a Kubernetes cluster has the permissions to manage the Kubernetes cluster. A Slurm-managed cluster runs as a pod in the Kubernetes cluster. Therefore, the Kubernetes cluster administrator can use kubectl to log on to any pod of any Slurm-managed cluster in the Kubernetes cluster, and has the permissions of the root user of the Slurm-managed cluster by default.
Run the following command to log on to a pod of the Slurm-managed cluster:
# Replace slurm-job-demo-head-x9sgs with the name of the pod in your cluster.
kubectl exec -it slurm-job-demo-xxxxx -- bashLog on as a regular user of the Slurm-managed cluster
The administrators or regular users of a Slurm-managed cluster may not have the permissions to run the kubectl exec command. If you use a Slurm-managed cluster as an administrator or a regular user, you need to log on to the Slurm-managed cluster by using SSH.
You can use an external IP address of a Service to log on to the head pod for long-term connections and scalability. This solution is suitable for scenarios in which long-term and stable connections are required. You can access the Slurm-managed cluster from anywhere within the internal network by using a Server Load Balancer (SLB) instance and its associated external IP address.
You can use the port-forward command to forward requests for a temporary period. This solution is suitable for short-term O&M or debugging because it relies on the continuous execution of the
kubectl port-forwardcommand.
Log on to the head pod by using an external IP address of a Service
Create a LoadBalancer Service expose internal services in the cluster to external access. For more information, see Use an existing SLB instance to expose an application or Use an automatically created SLB instance to expose an application.
The LoadBalancer Service must use an internal-facing Classic Load Balancer (CLB) instance.
You need to add the
kai.alibabacloud.com/slurm-cluster: ack-slurm-cluster-1andkai.alibabacloud.com/slurm-node-type: headlabels to the Service so that the Service can route incoming requests to the expected pod.
Run the following command to obtain the external IP address of the LoadBalancer Service:
kubectl get svcRun the following command to log on to the head pod associated with the Service by using SSH:
# Replace $YOURUSER with the actual username of the head pod and $EXTERNAL_IP with the external IP address of the Service. ssh $YOURUSER@$EXTERNAL_IP
Forward requests by using the port-forward command
To use the port-forward command to forward requests, you must save the KubeConfig file of the Kubernetes cluster to your local host. This may cause security risks. We recommend that you do not use this method in production environments.
Run the following command to enable a port of the local host for request forwarding and map the port of the local host to port 22 of the head pod in which the slurmctld runs in the cluster. By default, SSH uses port 22.
# Replace $NAMESPACE, $CLUSTERNAME, and $LOCALPORT with the actual values. kubectl port-forward -n $NAMESPACE svc/$CLUSTERNAME $LOCALPORT:22Run the following command when the
port-forwardcommand is running. All users on the current host can log on to the cluster and submit jobs.# Replace $YOURUSER with the username that you want to use to log on to the head pod. ssh -p $LOCALPORT $YOURUSER@localhost
Step 4: Use the Slurm-managed cluster
The following sections describe how to synchronize users across nodes, share logs across nodes, and perform auto scaling for the Slurm-managed cluster.
Synchronize users across nodes
Slurm does not provide a centralized user authentication service. When you run the sbatch command to submit jobs to the Slurm-managed cluster, the jobs may fail to be executed if the account of the user who submits the jobs does not exist on the node that is selected to execute the jobs. To resolve this issue, you can configure Lightweight Directory Access Protocol (LDAP) for the Slurm-managed cluster to manage users. LDAP serves as a centralized backend Service for authentication. This way, Slurm can authenticate user identities based on the Service. Perform the following operations:
Create a file named ldap.yaml and copy the following content to the file to create a basic LDAP instance that stores and manages user information.
The ldap.yaml file defines an LDAP backend pod and its associated Service. The pod contains an LDAP container, and the Service exposes the LDAP service to make the LDAP service accessible within the network.
Run the following command to deploy the LDAP backend Service:
kubectl apply -f ldap.yamlExpected output:
deployment.apps/ldap created service/ldap-service created secret/ldap-secret createdOptional. Create a file named phpldapadmin.yaml and copy the following content to the file to deploy an LDAP frontend pod and its associated Service. The LDAP frontend pod and its associated Service are used to configure frontend interfaces to improve management efficiency.
Run the following command to deploy the LDAP frontend Service:
kubectl apply -f phpldapadmin.yamlLog on to a pod in the Slurm-managed cluster based on Step 3 and run the following commands to install the LDAP client package:
apt update apt install libnss-ldapdAfter the libnss-ldapd package is installed, configure the network authentication service for the Slurm-managed cluster in the pod.
Run the following commands to install the Vim package for editing scripts and files:
apt update apt install vimModify the following parameters in the /etc/ldap/ldap.conf file to configure the LDAP client:
... BASE dc=example,dc=org # Replace the value with the distinguished name of the root node in the LDAP directory structure. URI ldap://ldap-service # Replace the value with the uniform resource identifier (URI) of your LDAP server. ...Modify the following parameters in the /etc/nslcd.conf file to define the connection to the LDAP server:
... uri ldap://ldap-service # Replace the value with the URI of your LDAP server. base dc=example,dc=org # Specify this parameter based on your LDAP directory structure. ... tls_cacertfile /etc/ssl/certs/ca-certificates.crt # Specify the path to the certificate authority (CA) certificate file that is used to verify the certificate of the LDAP server. ...
Share and access logs
By default, the job logs that are generated when you use the sbatch command are directly stored on the node that executes the jobs. This can be inconvenient for viewing logs. To view logs with ease, you can create a File Storage NAS (NAS) file system to store all job logs in accessible directories. This way, even if computing jobs are executed on different nodes, the logs that are generated for the jobs can be centrally collected and stored. This facilitates log management. Perform the following operations:
Create a NAS file system to store and share the logs of each node. For more information, see Create a file system.
Log on to the ACK console, and create a persistent volume (PV) and a persistent volume claim (PVC) for the NAS file system. For more information, see Mount a statically provisioned NAS volume.
Modify the SlurmCluster CR.
Configure the
volumeMountsandvolumesparameters in theheadGroupSpecandworkerGroupSpecparameters to reference the created PVC and mount the PVC to the /home directory. Example:headGroupSpec: ... # Specify /home as the mount target. volumeMounts: - mountPath: /home name: test # The name of the volume that references the PVC. volumes: # Add the definition of the PVC. - name: test # The value of this parameter must be the same as that of the name parameter in the volumeMounts parameter. persistentVolumeClaim: claimName: test # Replace the value with the name of the PVC. ... workerGroupSpecs: # ... Repeat the preceding process of configuring the volume and volumeMounts parameters for each workerGroupSpec parameter.Run the following command to deploy the SlurmCluster CR.
ImportantIf the SlurmCluster CR fails to be deployed, run the
kubectl delete slurmcluster slurm-job-democommand to delete the CR and redeploy it.kubectl apply -f slurmcluster.yamlAfter the SlurmCluster CR is deployed, worker nodes can share the NAS file system.
Perform auto scaling for the Slurm-managed cluster
The root path of the default Slurm image contains executable files and scripts such as slurm-resume.sh, slurm-suspend.sh, and slurmctld-copilot. They are used to interact with the slurmctld to scale the Slurm-managed cluster.
Auto scaling for Slurm clusters based on on-cloud nodes
Local nodes: the physical compute nodes that are directly connected to the slurmctld.
On-cloud nodes: the logical nodes. Logical nodes are VM instances that can be created and destroyed on demand by cloud service providers.
Auto scaling for Slurm on ACK
Procedure
Configure auto scaling permissions. When Helm is installed, the auto scaling permissions are automatically configured for the slurmctld pod. You can skip this step.
The head pod requires permissions to access and update the SlurmCluster CR for auto scaling. Therefore, we recommend that you use RBAC to grant the required permissions to the head pod when you use the auto scaling feature. You can perform the following steps to configure permissions:
First, you must create the ServiceAccount, Role, and RoleBinding required by the slurmctld pod. Example: Set the name of the Slurm-managed cluster to
slurm-job-demo, and set the namespace todefault. Create a file namedrbac.yamland copy the following content to the file:apiVersion: v1 kind: ServiceAccount metadata: name: slurm-job-demo --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: slurm-job-demo rules: - apiGroups: ["kai.alibabacloud.com"] resources: ["slurmclusters"] verbs: ["get", "watch", "list", "update", "patch"] resourceNames: ["slurm-job-demo"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: slurm-job-demo subjects: - kind: ServiceAccount name: slurm-job-demo roleRef: kind: Role name: slurm-job-demo apiGroup: rbac.authorization.k8s.ioRun the
kubectl apply -f rbac.yamlcommand to submit the resource list.Grant permissions to the slurmctld pod. Run the
kubectl edit slurmcluster slurm-job-democommand to modify the Slurm-managed cluster. Set the Spec.Slurmctld.Template.Spec.ServiceAccountName parameter to the service account you created.apiVersion: kai.alibabacloud.com/v1 kind: SlurmCluster ... spec: slurmctld: template: spec: serviceAccountName: slurm-job-demo ...Rebuild the StatefulSet that manages the slurmctld to apply the changes you just made. Run the
kubectl get sts slurm-job-democommand to find the StatefulSet that manages the slurmctld pod. Run thekubectl delete sts slurm-job-democommand to delete this StatefulSet. The Slurm Operator rebuilds the StatefulSet and applies the new configurations.Configure the auto scaling file /etc/slurm/slurm.conf.
Manage ConfigMaps by using a shared volume
# The following parameters are required if you use on-cloud nodes. # The SuspendProgram and ResumeProgram features are developed by Alibaba Cloud. SuspendTimeout=600 ResumeTimeout=600 # The interval at which the node is automatically suspended when no job runs on the node. SuspendTime=600 # Set the number of nodes that can be scaled per minute. ResumeRate=1 SuspendRate=1 # You must set the value of the NodeName parameter in the ${cluster_name}-worker-${group_name}- format. You must specify the amount of resources for the node in this line. Otherwise, the slurmctld pod # considers that the node has only one vCPU. Make sure that the resources that you specified on the on-cloud nodes are the same as those declared in the workerGroupSpec parameter. Otherwise, resources may be wasted. NodeName=slurm-job-demo-worker-cpu-[0-10] Feature=cloud State=CLOUD # The following configurations are fixed. Keep them unchanged. CommunicationParameters=NoAddrCache ReconfigFlags=KeepPowerSaveSettings SuspendProgram="/slurm-suspend.sh" ResumeProgram="/slurm-resume.sh"Manually manage ConfigMaps.
If
slurm.confis stored in the ConfigMap ofslurm-config, you can run thekubectl edit slurm-configcommand to add the following configurations to the ConfigMap:slurm.conf: ... # The following parameters are required if you use on-cloud nodes. # The SuspendProgram and ResumeProgram features are developed by Alibaba Cloud. SuspendTimeout=600 ResumeTimeout=600 # The interval at which the node is automatically suspended when no job runs on the node. SuspendTime=600 # Set the number of nodes that can be scaled per minute. ResumeRate=1 SuspendRate=1 # You must set the value of the NodeName parameter in the ${cluster_name}-worker-${group_name}- format. You must specify the amount of resources for the node in this line. Otherwise, the slurmctld pod # considers that the node has only one vCPU. Make sure that the resources that you specified on the on-cloud nodes are the same as those declared in the workerGroupSpec parameter. Otherwise, resources may be wasted. NodeName=slurm-job-demo-worker-cpu-[0-10] Feature=cloud State=CLOUD # The following configurations are fixed. Keep them unchanged. CommunicationParameters=NoAddrCache ReconfigFlags=KeepPowerSaveSettings SuspendProgram="/slurm-suspend.sh" ResumeProgram="/slurm-resume.sh"Use Helm to manage ConfigMaps
Add the following ConfigMap to the values.yaml file:
slurm.conf: ... # The following parameters are required if you use on-cloud nodes. # The SuspendProgram and ResumeProgram features are developed by Alibaba Cloud. SuspendTimeout=600 ResumeTimeout=600 # The interval at which the node is automatically suspended when no job runs on the node. SuspendTime=600 # Set the number of nodes that can be scaled per minute. ResumeRate=1 SuspendRate=1 # You must set the value of the NodeName parameter in the ${cluster_name}-worker-${group_name}- format. You must specify the amount of resources for the node in this line. Otherwise, the slurmctld pod # considers that the node has only one vCPU. Make sure that the resources that you specified on the on-cloud nodes are the same as those declared in the workerGroupSpec parameter. Otherwise, resources may be wasted. NodeName=slurm-job-demo-worker-cpu-[0-10] Feature=cloud State=CLOUD # The following configurations are fixed. Keep them unchanged. CommunicationParameters=NoAddrCache ReconfigFlags=KeepPowerSaveSettings SuspendProgram="/slurm-suspend.sh" ResumeProgram="/slurm-resume.sh"Run the
helm upgradecommand to update the Slurm configuration.
Apply the new configuration
If the name of the Slurm-managed cluster is
slurm-job-demo, run thekubectl delete sts slurm-job-democommand to apply the new configuration for the slurmctld pod.Change the number of worker node replicas to 0 in the slurmcluster.yaml file to so that you can view node scaling activities in subsequent steps.
Manual management
Set the name of the Slurm-managed cluster to
slurm-job-demo. Run thekubectl edit slurmcluster slurm-job-democommand to change the value of workerCount to 10 in the Slurm-managed cluster. This operation changes the number of worker node replicas to 0.Manage by using Helm
In the values.yaml file, change the .Values.workerGroup[].workerCount parameter to 0. Then, run the
helm upgrade slurm-job-demo .command to update the current Helm chart. This operation sets the number of worker node replicas to 0.Submit a job by using the sbatch command.
Run the following command to create a Shell script:
cat << EOF > cloudnodedemo.shEnter the following content after the command prompt:
> #!/bin/bash > srun hostname > EOFRun the following command to check whether the content of the script is correct:
cat cloudnodedemo.shExpected output:
#!/bin/bash srun hostnameThe output indicates that the script content is correct.
Run the following command to submit the script to the Slurm-managed cluster for processing:
sbatch cloudnodedemo.shExpected output:
Submitted batch job 1The output indicates that the job is submitted and assigned a job ID.
View the cluster scaling results.
Run the following command to view the scaling logs of the Slurm-managed cluster:
cat /var/log/slurm-resume.logExpected output:
namespace: default cluster: slurm-demo resume called, args [slurm-demo-worker-cpu-0] slurm cluster metadata: default slurm-demo get SlurmCluster CR slurm-demo succeed hostlists: [slurm-demo-worker-cpu-0] resume node slurm-demo-worker-cpu-0 resume worker -cpu-0 resume node -cpu-0 endThe output indicates that the Slurm-managed cluster automatically adds one compute node based on the workloads to execute the submitted job.
Run the following command to view the pods in the cluster:
kubectl get podExpected output:
NAME READY STATUS RESTARTS AGE slurm-demo-head-9hn67 1/1 Running 0 21m slurm-demo-worker-cpu-0 1/1 Running 0 43sThe output indicates that the slurm-demo-worker-cpu-0 pod is added to the cluster. In this case, the cluster is scaled out when the job is submitted.
Run the following command to view the nodes in the cluster:
sinfoExpected output:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 10 idle~ slurm-job-demo-worker-cpu-[2-10] debug* up infinite 1 idle slurm-job-demo-worker-cpu-[0-1]The output indicates that slurm-demo-worker-cpu-0 node is the node that is newly started and another 10 on-cloud nodes are available for scale-out.
Run the following command to view the information about the executed job:
scontrol show job 1Expected output:
JobId=1 JobName=cloudnodedemo.sh UserId=root(0) GroupId=root(0) MCS_label=N/A Priority=4294901757 Nice=0 Account=(null) QOS=(null) JobState=COMPLETED Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2024-05-28T11:37:36 EligibleTime=2024-05-28T11:37:36 AccrueTime=2024-05-28T11:37:36 StartTime=2024-05-28T11:37:36 EndTime=2024-05-28T11:37:36 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-05-28T11:37:36 Scheduler=Main Partition=debug AllocNode:Sid=slurm-job-demo:93 ReqNodeList=(null) ExcNodeList=(null) NodeList=slurm-job-demo-worker-cpu-0 BatchHost=slurm-job-demo-worker-cpu-0 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=1,mem=1M,node=1,billing=1 AllocTRES=cpu=1,mem=1M,node=1,billing=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=//cloudnodedemo.sh WorkDir=/ StdErr=//slurm-1.out StdIn=/dev/null StdOut=//slurm-1.out Power=In the output, NodeList=slurm-demo-worker-cpu-0 indicates that the job is executed on the newly added node.
After a period of time, run the following command to view the node scale-in results:
sinfoExpected output:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 11 idle~ slurm-demo-worker-cpu-[0-10]The output indicates that the number of nodes available for scale-out becomes 11. This indicates that the automatic scale-in is complete.