Containerize and deploy Slurm on an ACK cluster - Container Service for Kubernetes

Container Service for Kubernetes (ACK) provides the Slurm on Kubernetes solution and the ack-slurm-operator component. Together, they allow you to deploy and manage the Simple Linux Utility for Resource Management (Slurm) scheduling system in ACK clusters for high performance computing (HPC) and large-scale AI and machine learning (ML) workloads.

Introduction to Slurm

Slurm is a powerful open source platform for cluster resource management and job scheduling. It is designed to optimize the performance and efficiency of supercomputers and large compute clusters. The following figure shows how its key components work together.

slurmctld: The Slurm control daemon. As the central management component of Slurm, slurmctld monitors system resources, schedules jobs, and manages the cluster status. You can configure a secondary slurmctld for failover to ensure high availability.
slurmd: The Slurm node daemon. Deployed on each compute node, slurmd receives instructions from slurmctld and manages the job lifecycle, including starting and executing jobs, reporting job status, and preparing for new job assignments. Jobs are scheduled through slurmd.
slurmdbd: The Slurm database daemon. This optional component maintains a centralized database for job history and accounting information. It is essential for long-term management and auditing of large clusters. slurmdbd can aggregate data across multiple Slurm-managed clusters to simplify data management.
Slurm CLI: Slurm provides the following command-line tools for job management and system monitoring:
- scontrol: Manages clusters and controls cluster configurations.
- squeue: Queries the status of jobs in the queue.
- srun: Submits and manages jobs.
- sbatch: Submits jobs in batches for scheduling and managing computing resources.
- sinfo: Queries the overall status of a cluster, including node availability.

Introduction to Slurm on ACK

The Slurm Operator uses the SlurmCluster CustomResource (CR) to define the configuration files required for managing Slurm clusters. This simplifies the deployment and maintenance of Slurm-managed clusters and resolves control plane management issues. The following figure shows the architecture of Slurm on ACK.

A cluster administrator deploys and manages a Slurm-managed cluster by defining a SlurmCluster CR. The Slurm Operator then creates the Slurm control components in the cluster based on this CR. A Slurm configuration file can be mounted to a control component by using a shared volume or a ConfigMap.

Prerequisites

An ACK cluster that runs Kubernetes 1.22 or later is created, and the cluster contains one GPU-accelerated node. For more information, see Create an ACK cluster with GPU-accelerated nodes and Update clusters.

Step 1: Install the ack-slurm-operator component

Log on to the ACK console. In the left-side navigation pane, choose Marketplace > Marketplace.
On the Marketplace page, search for ack-slurm-operator and click the component. On the details page, click Deploy in the upper-right corner. In the Deploy panel, configure the parameters. You need to specify only the Cluster parameter. Use the default settings for all other parameters.
After you configure the parameters, click OK.

Step 2: Create a Slurm-managed cluster

You can create a Slurm-managed cluster either manually or by using Helm. Choose the method that best fits your needs.

Manually create a Slurm-managed cluster

Create a MUNGE authentication Secret

MUNGE (MUNGE Uid 'N' Gid Emporium) provides authentication between Slurm components. You must create a Kubernetes Secret to store the MUNGE key.

Run the following command to generate a key by using OpenSSL:
```
   openssl rand -base64 512 | tr -d '\r\n'
```
Run the following command to create a Secret that stores the generated key:
- Replace <$MungeKeyName> with a custom name for your key, such as mungekey.
- Replace <$MungeKey> with the key string generated in the previous step.
```
   kubectl create secret generic <$MungeKeyName> --from-literal=munge.key=<$MungeKey>
```

After you create the Secret, you can configure or associate it with the Slurm-managed cluster for MUNGE-based authentication.

Create a ConfigMap for the Slurm-managed cluster

In this example, a ConfigMap is mounted to a pod by specifying the slurmConfPath parameter in the CR. This ensures that the pod configuration is automatically restored to the expected state even if the pod is recreated.

The data parameter in the following sample code specifies a sample ConfigMap. To generate a ConfigMap, we recommend that you use the Easy Configurator or Full Configurator tool.

View sample code

kubectl create -f - << EOF
apiVersion: v1
data:
  slurm.conf: |
    ProctrackType=proctrack/linuxproc
    ReturnToService=1
    SlurmctldPidFile=/var/run/slurmctld.pid
    SlurmctldPort=6817
    SlurmdPidFile=/var/run/slurmd.pid
    SlurmdPort=6818
    SlurmdSpoolDir=/var/spool/slurmd
    SlurmUser=root # test2
    StateSaveLocation=/var/spool/slurmctld
    TaskPlugin=task/none
    InactiveLimit=0
    KillWait=30
    MinJobAge=300
    SlurmctldTimeout=120
    SlurmdTimeout=300
    Waittime=0
    SchedulerType=sched/builtin
    SelectType=select/cons_tres
    JobCompType=jobcomp/none
    JobAcctGatherFrequency=30
    SlurmctldDebug=info
    SlurmctldLogFile=/var/log/slurmctld.log
    SlurmdDebug=info
    SlurmdLogFile=/var/log/slurmd.log
    TreeWidth=65533
    MaxNodeCount=10000
    PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP

    ClusterName=slurm-job-demo
    # Set SlurmctldHost to the name of the Slurm-managed cluster with the -0 suffix. For high-availability deployment,
    # you can use the following configuration. The suffix depends on the number of slurmctld replicas:
    # SlurmctldHost=slurm-job-demo-0
    # SlurmctldHost=slurm-job-demo-1
    SlurmctldHost=slurm-job-demo-0
kind: ConfigMap
metadata:
  name: slurm-test
  namespace: default
EOF

Expected output:

configmap/slurm-test created

This output indicates that the ConfigMap is created.

Submit the SlurmCluster CR

Note In this example, an Ubuntu image that contains Compute Unified Device Architecture (CUDA) 11.4 and Slurm 23.06 is used. The image includes a component developed by Alibaba Cloud for auto scaling of on-cloud nodes. If you want to use a custom image, you can create and upload one.

Create a file named slurmcluster.yaml and copy the following content to the file. This SlurmCluster CR creates a Slurm-managed cluster with one head node and four worker nodes. The cluster runs as a pod in the ACK cluster. The values of the mungeConfPath and slurmConfPath parameters in the SlurmCluster CR must match the mount targets specified in the slurmctld and workerGroupSpecs sections.

View the YAML file content

   # This Kubernetes resource deploys a Slurm-managed cluster on ACK by using a CustomResource (CR) of kai.
   apiVersion: kai.alibabacloud.com/v1
   kind: SlurmCluster
   metadata:
     name: slurm-job-demo # The name of the cluster.
     namespace: default # The namespace in which the cluster is deployed.
   spec:
     mungeConfPath: /var/munge # The path of the MUNGE configuration file.
     slurmConfPath: /var/slurm # The path of the Slurm configuration file.
     slurmctld: # The specifications of the head node. The controller creates a StatefulSet to manage the head node.
       template:
         metadata: {}
         spec:
           containers:
           - image: registry-cn-hangzhou.ack.aliyuncs.com/acs/slurm-cuda:23.06-aliyun-cuda-11.4
             imagePullPolicy: Always
             name: slurmctld
             ports:
             - containerPort: 8080
               protocol: TCP
             resources:
               requests:
                 cpu: "1"
                 memory: 1Gi
             volumeMounts:
             - mountPath: /var/slurm # The mount target of the Slurm configuration file.
               name: config-slurm-test
             - mountPath: /var/munge # The mount target of the MUNGE authentication key.
               name: secret-slurm-test
           volumes:
           - configMap:
               name: slurm-test
             name: config-slurm-test
           - name: secret-slurm-test
             secret:
               secretName: slurm-test
     workerGroupSpecs: # The specifications of the worker nodes. In this example, two node groups are defined: cpu and cpu1.
     - groupName: cpu
       replicas: 2
       template:
         metadata: {}
         spec:
           containers:
           - env:
             - name: NVIDIA_REQUIRE_CUDA
             image: registry-cn-hangzhou.ack.aliyuncs.com/acs/slurm-cuda:23.06-aliyun-cuda-11.4
             imagePullPolicy: Always
             name: slurmd
             resources:
               requests:
                 cpu: "1"
                 memory: 1Gi
             volumeMounts:
             - mountPath: /var/slurm
               name: config-slurm-test
             - mountPath: /var/munge
               name: secret-slurm-test
           volumes:
           - configMap:
               name: slurm-test
             name: config-slurm-test
           - name: secret-slurm-test
             secret:
               secretName: slurm-test
     - groupName: cpu1 # The cpu1 node group definition, similar to cpu. Modify resources or configurations based on your business requirements.
       replicas: 2
       template:
         metadata: {}
         spec:
           containers:
           - env:
             - name: NVIDIA_REQUIRE_CUDA
             image: registry-cn-hangzhou.ack.aliyuncs.com/acs/slurm-cuda:23.06-aliyun-cuda-11.4
             imagePullPolicy: Always
             name: slurmd
             resources:
               requests:
                 cpu: "1"
                 memory: 1Gi
             securityContext: # The security context configuration that allows the pod to run in privileged mode.
               privileged: true
             volumeMounts:
             - mountPath: /var/slurm
               name: config-slurm-test
             - mountPath: /var/munge
               name: secret-slurm-test
           volumes:
           - configMap:
               name: slurm-test
             name: config-slurm-test
           - name: secret-slurm-test
             secret:
               secretName: slurm-test

Run the following command to deploy the slurmcluster.yaml file to the cluster: Expected output:

   kubectl apply -f slurmcluster.yaml

   slurmcluster.kai.alibabacloud.com/slurm-job-demo created

Run the following command to verify that the Slurm-managed cluster runs as expected: Expected output: This output indicates that the Slurm-managed cluster is deployed and its five nodes are ready.
```
   kubectl get slurmcluster
```
```
   NAME             AVAILABLE WORKERS   STATUS   AGE
   slurm-job-demo   5                   ready    14m
```

Run the following command to verify that all pods in the Slurm-managed cluster named slurm-job-demo are in the Running state: Expected output: This output confirms that the head node and four worker nodes are running as expected.

   kubectl get pod

   NAME                                          READY   STATUS      RESTARTS     AGE
   slurm-job-demo-head-x9sgs                     1/1     Running     0            14m
   slurm-job-demo-worker-cpu-0                   1/1     Running     0            14m
   slurm-job-demo-worker-cpu-1                   1/1     Running     0            14m
   slurm-job-demo-worker-cpu1-0                  1/1     Running     0            14m
   slurm-job-demo-worker-cpu1-1                  1/1     Running     0            14m

Create a Slurm-managed cluster by using Helm

To quickly install and manage a Slurm-managed cluster with flexible configuration, you can use Helm to install the SlurmCluster chart provided by Alibaba Cloud. Download the Helm chart from charts-incubator (the Alibaba Cloud chart repository). After you configure the parameters, Helm creates resources such as role-based access control (RBAC), ConfigMap, Secret, and the Slurm-managed cluster.

Resources created by the Helm chart

Resource type	Resource name	Description
ConfigMap	{{ .Values.slurmConfigs.configMapName }}	When `.Values.slurmConfigs.createConfigsByConfigMap` is set to `True`, this ConfigMap stores user-defined Slurm configurations. It is mounted to the path specified by `.Values.slurmConfigs.slurmConfigPathInPod`, which is also rendered as `.Spec.SlurmConfPath` of the Slurm-managed cluster. When the pod starts, the ConfigMap is copied to `/etc/slurm/` and access is restricted.
ServiceAccount	{{ .Release.Namespace }}/{{ .Values.clusterName }}	Allows the slurmctld pod to modify SlurmCluster CR configurations, enabling auto scaling of on-cloud nodes.
Role	{{ .Release.Namespace }}/{{ .Values.clusterName }}	Grants the slurmctld pod permissions to modify SlurmCluster CR configurations for auto scaling.
RoleBinding	{{ .Release.Namespace }}/{{ .Values.clusterName }}	Binds the Role to the ServiceAccount for auto scaling permissions.
Role	{{ .Values.slurmOperatorNamespace }}/{{ .Values.clusterName }}	Allows the slurmctld pod to modify Secrets in the SlurmOperator namespace. When Slurm and Kubernetes are deployed on the same batch of physical servers, the Slurm-managed cluster can use this resource to renew tokens.
RoleBinding	{{ .Values.slurmOperatorNamespace }}/{{ .Values.clusterName }}	Binds the operator namespace Role to the ServiceAccount for token renewal.
Secret	{{ .Values.mungeConfigs.secretName }}	Stores the MUNGE authentication key for Slurm component communication. When `.Values.mungeConfigs.createConfigsBySecret` is set to `True`, this Secret is created automatically with `"munge.key"={{ .Values.mungeConfigs.content }}`. The mount path is rendered from `.Spec.MungeConfPath`, and the pod startup commands initialize `/etc/munge/munge.key` from this path.
SlurmCluster		The rendered Slurm-managed cluster.

Helm chart parameters

Parameter	Example	Description
clusterName	""	The cluster name. Used to generate Secrets and roles. The value must match the cluster name in your Slurm configuration files.
headNodeConfig	N/A	Required. Configures the slurmctld pod.
workerNodesConfig	N/A	Configures the slurmd pods.
workerNodesConfig.deleteSelfBeforeSuspend	true	When set to `true`, a preStop hook is automatically added to the worker pod. This triggers automatic node draining before the node is removed and marks the node as unschedulable.
slurmdbdConfigs	N/A	Configures the slurmdbd pod. If left empty, no slurmdbd pod is created.
slurmrestdConfigs	N/A	Configures the slurmrestd pod. If left empty, no slurmrestd pod is created.
headNodeConfig.hostNetwork / slurmdbdConfigs.hostNetwork / slurmrestdConfigs.hostNetwork / workerNodesConfig.workerGroups[].hostNetwork	false	Rendered as the `hostNetwork` parameter of the respective pod.
headNodeConfig.setHostnameAsFQDN / slurmdbdConfigs.setHostnameAsFQDN / slurmrestdConfigs.setHostnameAsFQDN / workerNodesConfig.workerGroups[].setHostnameAsFQDN	false	Rendered as the `setHostnameAsFQDN` parameter of the respective pod.
headNodeConfig.nodeSelector / slurmdbdConfigs.nodeSelector / slurmrestdConfigs.nodeSelector / workerNodesConfig.workerGroups[].nodeSelector	`nodeSelector:` `example: example`	Rendered as the `nodeSelector` parameter of the respective pod.
headNodeConfig.tolerations / slurmdbdConfigs.tolerations / slurmrestdConfigs.tolerations / workerNodesConfig.workerGroups[].tolerations	`tolerations:` `- key:` `value:` `operator:`	Rendered as the `tolerations` of the respective pod.
headNodeConfig.affinity / slurmdbdConfigs.affinity / slurmrestdConfigs.affinity / workerNodesConfig.workerGroups[].affinity	`affinity:` `nodeAffinity:` `requiredDuringSchedulingIgnoredDuringExecution:` `nodeSelectorTerms:` `- matchExpressions:` `- key: topology.kubernetes.io/zone` `operator: In` `values:` `- zone-a` `preferredDuringSchedulingIgnoredDuringExecution:` `- weight: 1` `preference:` `matchExpressions:` `- key: another-node-label-key` `operator: In` `values:` `- another-node-label-value`	Rendered as the affinity rules of the respective pod.
headNodeConfig.resources / slurmdbdConfigs.resources / slurmrestdConfigs.resources / workerNodesConfig.workerGroups[].resources	`resources:` `requests:` `cpu: 1` `limits:` `cpu: 1`	Rendered as the resources of the primary container in the respective pod. The resource limit of the worker pod primary container is rendered as the Slurm node resource limit.
headNodeConfig.image / slurmdbdConfigs.image / slurmrestdConfigs.image / workerNodesConfig.workerGroups[].image	"registry-cn-hangzhou.ack.aliyuncs.com/acs/slurm:23.06-1.6-aliyun-49259f59"	Rendered as the container image. You can also build a custom image from ai-models-on-ack/framework/slurm/building-slurm-image.
headNodeConfig.imagePullSecrets / slurmdbdConfigs.imagePullSecrets / slurmrestdConfigs.imagePullSecrets / workerNodesConfig.workerGroups[].imagePullSecrets	`imagePullSecrets:` `- name: example`	Rendered as the Secret used to pull the container image.
headNodeConfig.podSecurityContext / slurmdbdConfigs.podSecurityContext / slurmrestdConfigs.podSecurityContext / workerNodesConfig.workerGroups[].podSecurityContext	`podSecurityContext:` `runAsUser: 1000` `runAsGroup: 3000` `fsGroup: 2000` `supplementalGroups: [4000]`	Rendered as the pod-level security context.
headNodeConfig.securityContext / slurmdbdConfigs.securityContext / slurmrestdConfigs.securityContext / workerNodesConfig.workerGroups[].securityContext	`securityContext:` `allowPrivilegeEscalation: false`	Rendered as the security context of the primary container.
headNodeConfig.volumeMounts / slurmdbdConfigs.volumeMounts / slurmrestdConfigs.volumeMounts / workerNodesConfig.workerGroups[].volumeMounts	N/A	Rendered as the volume mounting configurations of the primary container.
headNodeConfig.volumes / slurmdbdConfigs.volumes / slurmrestdConfigs.volumes / workerNodesConfig.workerGroups[].volumes	N/A	Rendered as the volumes mounted to the pod.
slurmConfigs.slurmConfigPathInPod	""	The mount path of the Slurm configuration file in the pod. If the configuration file is mounted as a volume, set this value to the path where `slurm.conf` is mounted. The pod startup commands copy the file to `/etc/slurm/` and restrict access.
slurmConfigs.createConfigsByConfigMap	true	Specifies whether to automatically create a ConfigMap for Slurm configurations.
slurmConfigs.configMapName	""	The name of the ConfigMap that stores the Slurm configurations.
slurmConfigs.filesInConfigMap	""	The content in the automatically created ConfigMap.
mungeConfigs.mungeConfigPathInPod	N/A	The mount path of the MUNGE configuration file in the pod. If the configuration file is mounted as a volume, set this value to the path where `munge.key` is mounted. The pod startup commands copy the file to `/etc/munge/` and restrict access.
mungeConfigs.createConfigsBySecret	N/A	Specifies whether to automatically create a Secret for MUNGE configurations.
mungeConfigs.secretName	N/A	The name of the Secret that stores the MUNGE configurations.
mungeConfigs.content	N/A	The content in the automatically created Secret.

For more information about slurmConfigs.filesInConfigMap, see Slurm System Configuration Tool (schedmd.com).

Important

If you modify slurmConfigs.filesInConfigMap after the pod is created, you must recreate the pod for the change to take effect. We recommend that you verify the modification before recreating the pod.

Install the Helm chart

Run the following command to add the Alibaba Cloud Helm repository to your local Helm client: This allows you to access various charts provided by Alibaba Cloud, including the Slurm-managed cluster chart.
```
   helm repo add aliyun https://aliacs-app-catalog.oss-cn-hangzhou.aliyuncs.com/charts-incubator/
```
Run the following command to pull and decompress the Helm chart: This creates a directory named ack-slurm-cluster in the current directory. The directory contains all chart files and templates.
```
   helm pull aliyun/ack-slurm-cluster --untar=true
```
Modify the chart parameters in the values.yaml file. The values.yaml file contains the default chart configurations. Modify this file to customize parameter settings such as Slurm configurations, resource requests and limits, and storage based on your requirements.
```
   cd ack-slurm-cluster
   vi values.yaml
```

Use Helm to install the chart: This deploys the Slurm-managed cluster.

   cd ..
   helm install my-slurm-cluster ack-slurm-cluster # Replace my-slurm-cluster with your desired release name.

Verify that the Slurm-managed cluster is deployed. After the deployment is complete, use kubectl to check the deployment status and confirm that the Slurm-managed cluster runs as expected:
```
   kubectl get pods -l app.kubernetes.io/name=slurm-cluster
```

Step 3: Log on to the Slurm-managed cluster

Log on as a Kubernetes cluster administrator

A Kubernetes cluster administrator has the permissions to manage the entire Kubernetes cluster. Because a Slurm-managed cluster runs as a pod in the Kubernetes cluster, the administrator can use kubectl to log on to any pod of any Slurm-managed cluster and has root permissions by default.

Run the following command to log on to a pod of the Slurm-managed cluster:

# Replace slurm-job-demo-head-x9sgs with the name of the pod in your cluster.
kubectl exec -it slurm-job-demo-xxxxx -- bash

Log on as a regular user of the Slurm-managed cluster

Administrators or regular users of a Slurm-managed cluster may not have permissions to run the kubectl exec command. In this case, you can log on to the Slurm-managed cluster by using SSH. Two methods are available:

LoadBalancer Service: Use an external IP address of a Service to access the head pod. This method is suitable for long-term, stable connections. You access the Slurm-managed cluster from anywhere within the internal network by using a Classic Load Balancer (CLB) instance and its external IP address.
Port forwarding: Use the kubectl port-forward command for temporary access. This method is suitable for short-term operations and maintenance (O&M) or debugging because it requires continuous execution of the port-forward command.

Log on to the head pod by using a LoadBalancer Service

Create a LoadBalancer Service to expose internal services in the cluster to external access. For more information, see Use an existing SLB instance to expose an application or Use an automatically created SLB instance to expose an application.
- The LoadBalancer Service must use an internal-facing Classic Load Balancer (CLB) instance.
- Add the following labels to the Service so that it routes incoming requests to the expected pod:
  - kai.alibabacloud.com/slurm-cluster: ack-slurm-cluster-1
  - kai.alibabacloud.com/slurm-node-type: head
Run the following command to obtain the external IP address of the LoadBalancer Service:
```
   kubectl get svc
```

Run the following command to log on to the head pod by using SSH:

   # Replace $YOURUSER with your username and $EXTERNAL_IP with the external IP address of the Service.
   ssh $YOURUSER@$EXTERNAL_IP

Forward requests by using the port-forward command

Warning

To use the port-forward command, you must save the kubeconfig file of the Kubernetes cluster to your local host. This may cause security risks. We recommend that you do not use this method in production environments.

Run the following command to enable a local port for request forwarding and map it to port 22 of the head pod running slurmctld. SSH uses port 22 by default.
```
   # Replace $NAMESPACE, $CLUSTERNAME, and $LOCALPORT with the actual values.
   kubectl port-forward -n $NAMESPACE svc/$CLUSTERNAME $LOCALPORT:22
```
While the port-forward command is running, run the following command to log on. All users on the current host can log on to the cluster and submit jobs.
```
   # Replace $YOURUSER with the username you want to use to log on to the head pod.
   ssh -p $LOCALPORT $YOURUSER@localhost
```

Step 4: Use the Slurm-managed cluster

The following sections describe how to synchronize users across nodes, share logs across nodes, and perform auto scaling for the Slurm-managed cluster.

Synchronize users across nodes

Slurm does not provide a centralized user authentication service. When you use the sbatch command to submit jobs, the jobs may fail if the submitting user's account does not exist on the node selected to execute the jobs. To resolve this issue, you can configure Lightweight Directory Access Protocol (LDAP) for the Slurm-managed cluster. LDAP serves as a centralized backend service for authentication, allowing Slurm to authenticate user identities.

Deploy the LDAP backend

Create a file named ldap.yaml and copy the following content to the file. This creates a basic LDAP instance that stores and manages user information. The ldap.yaml file defines an LDAP backend pod and its associated Service. The pod contains an LDAP container, and the Service exposes the LDAP service within the network.

View the LDAP backend pod and its associated Service

   ---
   apiVersion: apps/v1
   kind: Deployment
   metadata:
     namespace: default
     name: ldap
     labels:
       app: ldap
   spec:
     selector:
       matchLabels:
         app: ldap
     revisionHistoryLimit: 10
     template:
       metadata:
         labels:
           app: ldap
       spec:
         securityContext:
           seLinuxOptions: {}
         imagePullSecrets: []
         restartPolicy: Always
         initContainers: []
         containers:
           - image: 'osixia/openldap:1.4.0'
             imagePullPolicy: IfNotPresent
             name: ldap
             volumeMounts:
               - name: openldap-data
                 mountPath: /var/lib/ldap
                 subPath: data
               - name: openldap-data
                 mountPath: /etc/ldap/slapd.d
                 subPath: config
               - name: openldap-data
                 mountPath: /container/service/slapd/assets/certs
                 subPath: certs
               - name: secret-volume
                 mountPath: /container/environment/01-custom
               - name: container-run
                 mountPath: /container/run
             args:
               - '--copy-service'
             resources:
               limits:
               requests:
             env: []
             readinessProbe:
               tcpSocket:
                 port: openldap
               initialDelaySeconds: 20
               timeoutSeconds: 1
               periodSeconds: 10
               successThreshold: 1
               failureThreshold: 10
             livenessProbe:
               tcpSocket:
                 port: openldap
               initialDelaySeconds: 20
               timeoutSeconds: 1
               periodSeconds: 10
               successThreshold: 1
               failureThreshold: 10
             lifecycle: {}
             ports:
               - name: openldap
                 containerPort: 389
                 protocol: TCP
               - name: ssl-ldap-port
                 containerPort: 636
                 protocol: TCP
         volumes:
           - name: openldap-data
             emptyDir: {}
           - name: secret-volume
             secret:
               secretName: ldap-secret
               defaultMode: 420
               items: []
           - name: container-run
             emptyDir: {}
         dnsPolicy: ClusterFirst
         dnsConfig: {}
         terminationGracePeriodSeconds: 30
     progressDeadlineSeconds: 600
     strategy:
       type: RollingUpdate
       rollingUpdate:
         maxUnavailable: 25%
         maxSurge: 25%
     replicas: 1
   ---
   apiVersion: v1
   kind: Service
   metadata:
     annotations: {}
     labels:
       app: ldap
     name: ldap-service
     namespace: default
   spec:
     ports:
       - name: openldap
         port: 389
         protocol: TCP
         targetPort: openldap
       - name: ssl-ldap-port
         port: 636
         protocol: TCP
         targetPort: ssl-ldap-port
     selector:
       app: ldap
     sessionAffinity: None
     type: ClusterIP
   ---
   metadata:
     name: ldap-secret
     namespace: default
     annotations: {}
   data:
     env.startup.yaml: >-
       IyBUaGlzIGlzIHRoZSBkZWZhdWx0IGltYWdlIHN0YXJ0dXAgY29uZmlndXJhdGlvbiBmaWxlCiMgdGhpcyBmaWxlIGRlZmluZSBlbnZpcm9ubWVudCB2YXJpYWJsZXMgdXNlZCBkdXJpbmcgdGhlIGNvbnRhaW5lciAqKmZpcnN0IHN0YXJ0KiogaW4gKipzdGFydHVwIGZpbGVzKiouCgojIFRoaXMgZmlsZSBpcyBkZWxldGVkIHJpZ2h0IGFmdGVyIHN0YXJ0dXAgZmlsZXMgYXJlIHByb2Nlc3NlZCBmb3IgdGhlIGZpcnN0IHRpbWUsCiMgYWZ0ZXIgdGhhdCBhbGwgdGhlc2UgdmFsdWVzIHdpbGwgbm90IGJlIGF2YWlsYWJsZSBpbiB0aGUgY29udGFpbmVyIGVudmlyb25tZW50LgojIFRoaXMgaGVscHMgdG8ga2VlcCB5b3VyIGNvbnRhaW5lciBjb25maWd1cmF0aW9uIHNlY3JldC4KIyBtb3JlIGluZm9ybWF0aW9uIDogaHR0cHM6Ly9naXRodWIuY29tL29zaXhpYS9kb2NrZXItbGlnaHQtYmFzZWltYWdlCgojIFJlcXVpcmVkIGFuZCB1c2VkIGZvciBuZXcgbGRhcCBzZXJ2ZXIgb25seQpMREFQX09SR0FOSVNBVElPTjogRXhhbXBsZSBJbmMuCkxEQVBfRE9NQUlOOiBleGFtcGxlLm9yZwpMREFQX0JBU0VfRE46ICNpZiBlbXB0eSBhdXRvbWF0aWNhbGx5IHNldCBmcm9tIExEQVBfRE9NQUlOCgpMREFQX0FETUlOX1BBU1NXT1JEOiBhZG1pbgpMREFQX0NPTkZJR19QQVNTV09SRDogY29uZmlnCgpMREFQX1JFQURPTkxZX1VTRVI6IGZhbHNlCkxEQVBfUkVBRE9OTFlfVVNFUl9VU0VSTkFNRTogcmVhZG9ubHkKTERBUF9SRUFET05MWV9VU0VSX1BBU1NXT1JEOiByZWFkb25seQoKIyBCYWNrZW5kCkxEQVBfQkFDS0VORDogaGRiCgojIFRscwpMREFQX1RMUzogdHJ1ZQpMREFQX1RMU19DUlRfRklMRU5BTUU6IGxkYXAuY3J0CkxEQVBfVExTX0tFWV9GSUxFTkFNRTogbGRhcC5rZXkKTERBUF9UTFNfQ0FfQ1JUX0ZJTEVOQU1FOiBjYS5jcnQKCkxEQVBfVExTX0VORk9SQ0U6IGZhbHNlCkxEQVBfVExTX0NJUEhFUl9TVUlURTogU0VDVVJFMjU2Oi1WRVJTLVNTTDMuMApMREFQX1RMU19QUk9UT0NPTF9NSU46IDMuMQpMREFQX1RMU19WRVJJRllfQ0xJRU5UOiBkZW1hbmQKCiMgUmVwbGljYXRpb24KTERBUF9SRVBMSUNBVElPTjogZmFsc2UKIyB2YXJpYWJsZXMgJExEQVBfQkFTRV9ETiwgJExEQVBfQURNSU5fUEFTU1dPUkQsICRMREFQX0NPTkZJR19QQVNTV09SRAojIGFyZSBhdXRvbWF0aWNhbHkgcmVwbGFjZWQgYXQgcnVuIHRpbWUKCiMgaWYgeW91IHdhbnQgdG8gYWRkIHJlcGxpY2F0aW9uIHRvIGFuIGV4aXN0aW5nIGxkYXAKIyBhZGFwdCBMREFQX1JFUExJQ0FUSU9OX0NPTkZJR19TWU5DUFJPViBhbmQgTERBUF9SRVBMSUNBVElPTl9EQl9TWU5DUFJPViB0byB5b3VyIGNvbmZpZ3VyYXRpb24KIyBhdm9pZCB1c2luZyAkTERBUF9CQVNFX0ROLCAkTERBUF9BRE1JTl9QQVNTV09SRCBhbmQgJExEQVBfQ09ORklHX1BBU1NXT1JEIHZhcmlhYmxlcwpMREFQX1JFUExJQ0FUSU9OX0NPTkZJR19TWU5DUFJPVjogYmluZGRuPSJjbj1hZG1pbixjbj1jb25maWciIGJpbmRtZXRob2Q9c2ltcGxlIGNyZWRlbnRpYWxzPSRMREFQX0NPTkZJR19QQVNTV09SRCBzZWFyY2hiYXNlPSJjbj1jb25maWciIHR5cGU9cmVmcmVzaEFuZFBlcnNpc3QgcmV0cnk9IjYwICsiIHRpbWVvdXQ9MSBzdGFydHRscz1jcml0aWNhbApMREFQX1JFUExJQ0FUSU9OX0RCX1NZTkNQUk9WOiBiaW5kZG49ImNuPWFkbWluLCRMREFQX0JBU0VfRE4iIGJpbmRtZXRob2Q9c2ltcGxlIGNyZWRlbnRpYWxzPSRMREFQX0FETUlOX1BBU1NXT1JEIHNlYXJjaGJhc2U9IiRMREFQX0JBU0VfRE4iIHR5cGU9cmVmcmVzaEFuZFBlcnNpc3QgaW50ZXJ2YWw9MDA6MDA6MDA6MTAgcmV0cnk9IjYwICsiIHRpbWVvdXQ9MSBzdGFydHRscz1jcml0aWNhbApMREFQX1JFUExJQ0FUSU9OX0hPU1RTOgogIC0gbGRhcDovL2xkYXAuZXhhbXBsZS5vcmcgIyBUaGUgb3JkZXIgbXVzdCBiZSB0aGUgc2FtZSBvbiBhbGwgbGRhcCBzZXJ2ZXJzCiAgLSBsZGFwOi8vbGRhcDIuZXhhbXBsZS5vcmcKCgojIFJlbW92ZSBjb25maWcgYWZ0ZXIgc2V0dXAKTERBUF9SRU1PVkVfQ09ORklHX0FGVEVSX1NFVFVQOiB0cnVlCgojIGNmc3NsIGVudmlyb25tZW50IHZhcmlhYmxlcyBwcmVmaXgKTERBUF9DRlNTTF9QUkVGSVg6IGxkYXAgIyBjZnNzbC1oZWxwZXIgZmlyc3Qgc2VhcmNoIGNvbmZpZyBmcm9tIExEQVBfQ0ZTU0xfKiB2YXJpYWJsZXMsIGJlZm9yZSBDRlNTTF8qIHZhcmlhYmxlcy4K
     env.yaml: >-
       IyBUaGlzIGlzIHRoZSBkZWZhdWx0IGltYWdlIGNvbmZpZ3VyYXRpb24gZmlsZQojIFRoZXNlIHZhbHVlcyB3aWxsIHBlcnNpc3RzIGluIGNvbnRhaW5lciBlbnZpcm9ubWVudC4KCiPCoEFsbCBlbnZpcm9ubWVudCB2YXJpYWJsZXMgdXNlZCBhZnRlciB0aGUgY29udGFpbmVyIGZpcnN0IHN0YXJ0CiMgbXVzdCBiZSBkZWZpbmVkIGhlcmUuCiMgbW9yZSBpbmZvcm1hdGlvbiA6IGh0dHBzOi8vZ2l0aHViLmNvbS9vc2l4aWEvZG9ja2VyLWxpZ2h0LWJhc2VpbWFnZQoKIyBHZW5lcmFsIGNvbnRhaW5lciBjb25maWd1cmF0aW9uCiMgc2VlIHRhYmxlIDUuMSBpbiBodHRwOi8vd3d3Lm9wZW5sZGFwLm9yZy9kb2MvYWRtaW4yNC9zbGFwZGNvbmYyLmh0bWwgZm9yIHRoZSBhdmFpbGFibGUgbG9nIGxldmVscy4KTERBUF9MT0dfTEVWRUw6IDI1Ngo=
   type: Opaque
   kind: Secret
   apiVersion: v1

Run the following command to deploy the LDAP backend Service: Expected output:

   kubectl apply -f ldap.yaml

   deployment.apps/ldap created
   service/ldap-service created
   secret/ldap-secret created

(Optional) Deploy the LDAP frontend

Create a file named phpldapadmin.yaml and copy the following content to the file. This deploys an LDAP frontend pod and its associated Service for improved management efficiency through a web interface.

View the LDAP frontend pod and its associated Service

   ---
   apiVersion: apps/v1
   kind: Deployment
   metadata:
     namespace: default
     name: phpldapadmin
     labels:
       io.kompose.service: phpldapadmin
   spec:
     selector:
       matchLabels:
         io.kompose.service: phpldapadmin
     revisionHistoryLimit: 10
     template:
       metadata:
         labels:
           io.kompose.service: phpldapadmin
       spec:
         securityContext:
           seLinuxOptions: {}
         imagePullSecrets: []
         restartPolicy: Always
         initContainers: []
         containers:
           - image: 'osixia/phpldapadmin:0.9.0'
             imagePullPolicy: Always
             name: phpldapadmin
             volumeMounts: []
             resources:
               limits:
               requests:
             env:
               - name: PHPLDAPADMIN_HTTPS
                 value: 'false'
               - name: PHPLDAPADMIN_LDAP_HOSTS
                 value: ldap-service
             lifecycle: {}
             ports:
               - containerPort: 80
                 protocol: TCP
         volumes: []
         dnsPolicy: ClusterFirst
         dnsConfig: {}
         terminationGracePeriodSeconds: 30
     progressDeadlineSeconds: 600
     strategy:
       type: RollingUpdate
       rollingUpdate:
         maxUnavailable: 25%
         maxSurge: 25%
     replicas: 1
   ---
   apiVersion: v1
   kind: Service
   metadata:
     namespace: default
     name: phpldapadmin
     annotations:
       k8s.kuboard.cn/workload: phpldapadmin
     labels:
       io.kompose.service: phpldapadmin
   spec:
     selector:
       io.kompose.service: phpldapadmin
     type: ClusterIP
     ports:
       - port: 8080
         targetPort: 80
         protocol: TCP
         name: '8080'
         nodePort: 0
     sessionAffinity: None

Run the following command to deploy the LDAP frontend Service:

   kubectl apply -f phpldapadmin.yaml

Configure the LDAP client

Log on to a pod in the Slurm-managed cluster as described in Step 3, and run the following commands to install the LDAP client package:
```
   apt update
   apt install libnss-ldapd
```

After the libnss-ldapd package is installed, configure the network authentication service for the Slurm-managed cluster in the pod.

Modify the following parameters in the /etc/nslcd.conf file to define the connection to the LDAP server:

   apt update
   apt install vim

   ...
   BASE	dc=example,dc=org # Replace the value with the distinguished name of the root node in the LDAP directory structure.
   URI	ldap://ldap-service # Replace the value with the uniform resource identifier (URI) of your LDAP server.
   ...

   ...
   uri ldap://ldap-service # Replace the value with the URI of your LDAP server.
   base dc=example,dc=org # Specify this parameter based on your LDAP directory structure.
   ...
   tls_cacertfile /etc/ssl/certs/ca-certificates.crt # Specify the path to the certificate authority (CA) certificate file used to verify the LDAP server certificate.
   ...

Share and access logs

By default, the job logs generated by the sbatch command are stored on the node that executes the jobs. This can make it difficult to view logs centrally. To simplify log management, you can create a File Storage NAS (NAS) file system to store all job logs in accessible directories. This allows logs to be centrally collected and accessed regardless of which node executes the computing jobs.

Create a NAS file system to store and share the logs of each node. For more information, see Create a file system.
Log on to the ACK console, and create a persistent volume (PV) and a persistent volume claim (PVC) for the NAS file system. For more information, see Mount a statically provisioned NAS volume.

Modify the SlurmCluster CR. Configure the volumeMounts and volumes parameters in the slurmctld and workerGroupSpecs sections to reference the created PVC and mount it to the /home directory. Example:

   slurmctld:
   ...
   # Specify /home as the mount target.
     volumeMounts:
     - mountPath: /home
       name: test  # The name of the volume that references the PVC.
     volumes:
   # Add the PVC definition.
     - name: test  # Must match the name in volumeMounts.
       persistentVolumeClaim:
         claimName: test  # Replace with the name of your PVC.
   ...
   workerGroupSpecs:
     # ... Repeat the volume and volumeMounts configuration for each worker group.

Run the following command to deploy the SlurmCluster CR. After the SlurmCluster CR is deployed, worker nodes can share the NAS file system.
Important
If the SlurmCluster CR fails to deploy, run the kubectl delete slurmcluster slurm-job-demo command to delete the CR and then redeploy it.
```
   kubectl apply -f slurmcluster.yaml
```

Perform auto scaling for the Slurm-managed cluster

The root path of the default Slurm image contains executable files and scripts such as slurm-resume.sh, slurm-suspend.sh, and slurmctld-copilot. These interact with slurmctld to scale the Slurm-managed cluster.

Auto scaling for Slurm clusters based on on-cloud nodes

Slurm on ACK supports two types of nodes:

Local nodes: Physical compute nodes that are directly connected to slurmctld.
On-cloud nodes: Logical nodes backed by VM instances that can be created and destroyed on demand by cloud service providers.

Auto scaling for Slurm on ACK

Procedure

Configure auto scaling permissions. If Helm is installed, auto scaling permissions are automatically configured for the slurmctld pod and you can skip this step. The head pod requires permissions to access and update the SlurmCluster CR for auto scaling. We recommend that you use RBAC to grant the required permissions. Follow these steps: First, create the ServiceAccount, Role, and RoleBinding for the slurmctld pod. In the following example, the Slurm-managed cluster name is slurm-job-demo and the namespace is default. Create a file named rbac.yaml and copy the following content to the file: Run kubectl apply -f rbac.yaml to submit the resource list. Next, grant permissions to the slurmctld pod. Run kubectl edit slurmcluster slurm-job-demo to modify the Slurm-managed cluster. Set Spec.Slurmctld.Template.Spec.ServiceAccountName to the ServiceAccount you created: To apply the changes, rebuild the StatefulSet that manages slurmctld. Run kubectl get sts slurm-job-demo to find the StatefulSet, then run kubectl delete sts slurm-job-demo to delete it. The Slurm Operator rebuilds the StatefulSet and applies the new configurations.
```
   apiVersion: v1
   kind: ServiceAccount
   metadata:
     name: slurm-job-demo
   ---
   apiVersion: rbac.authorization.k8s.io/v1
   kind: Role
   metadata:
     name: slurm-job-demo
   rules:
   - apiGroups: ["kai.alibabacloud.com"]
     resources: ["slurmclusters"]
     verbs: ["get", "watch", "list", "update", "patch"]
     resourceNames: ["slurm-job-demo"]
   ---
   apiVersion: rbac.authorization.k8s.io/v1
   kind: RoleBinding
   metadata:
     name: slurm-job-demo
   subjects:
   - kind: ServiceAccount
     name: slurm-job-demo
   roleRef:
     kind: Role
     name: slurm-job-demo
     apiGroup: rbac.authorization.k8s.io
```
```
   apiVersion: kai.alibabacloud.com/v1
   kind: SlurmCluster
   ...
   spec:
     slurmctld:
       template:
         spec:
           serviceAccountName: slurm-job-demo
   ...
```

Configure the auto scaling parameters in /etc/slurm/slurm.conf.

Method A: Manage ConfigMaps by using a shared volume

   # The following parameters are required if you use on-cloud nodes.
   # The SuspendProgram and ResumeProgram features are developed by Alibaba Cloud.
   SuspendTimeout=600
   ResumeTimeout=600
   # The interval at which the node is automatically suspended when no job runs on the node.
   SuspendTime=600
   # Set the number of nodes that can be scaled per minute.
   ResumeRate=1
   SuspendRate=1
   # You must set the value of the NodeName parameter in the ${cluster_name}-worker-${group_name}- format. You must specify the amount of resources for the node in this line. Otherwise, the slurmctld pod
   # considers that the node has only one vCPU. Make sure that the resources that you specified on the on-cloud nodes are the same as those declared in the workerGroupSpecs parameter. Otherwise, resources may be wasted.
   NodeName=slurm-job-demo-worker-cpu-[0-10] Feature=cloud State=CLOUD
   # The following configurations are fixed. Keep them unchanged.
   CommunicationParameters=NoAddrCache
   ReconfigFlags=KeepPowerSaveSettings
   SuspendProgram="/slurm-suspend.sh"
   ResumeProgram="/slurm-resume.sh"

Method B: Manually manage ConfigMaps

If slurm.conf is stored in the ConfigMap named slurm-config, run kubectl edit slurm-config to add the following configurations:

   slurm.conf:
   ...
     # The following parameters are required if you use on-cloud nodes.
     # The SuspendProgram and ResumeProgram features are developed by Alibaba Cloud.
     SuspendTimeout=600
     ResumeTimeout=600
     # The interval at which the node is automatically suspended when no job runs on the node.
     SuspendTime=600
     # Set the number of nodes that can be scaled per minute.
     ResumeRate=1
     SuspendRate=1
     # You must set the value of the NodeName parameter in the ${cluster_name}-worker-${group_name}- format. You must specify the amount of resources for the node in this line. Otherwise, the slurmctld pod
     # considers that the node has only one vCPU. Make sure that the resources that you specified on the on-cloud nodes are the same as those declared in the workerGroupSpecs parameter. Otherwise, resources may be wasted.
     NodeName=slurm-job-demo-worker-cpu-[0-10] Feature=cloud State=CLOUD
     # The following configurations are fixed. Keep them unchanged.
     CommunicationParameters=NoAddrCache
     ReconfigFlags=KeepPowerSaveSettings
     SuspendProgram="/slurm-suspend.sh"
     ResumeProgram="/slurm-resume.sh"

Method C: Use Helm to manage ConfigMaps

Run the helm upgrade command to update the Slurm configuration.

   slurm.conf:
   ...
     # The following parameters are required if you use on-cloud nodes.
     # The SuspendProgram and ResumeProgram features are developed by Alibaba Cloud.
     SuspendTimeout=600
     ResumeTimeout=600
     # The interval at which the node is automatically suspended when no job runs on the node.
     SuspendTime=600
     # Set the number of nodes that can be scaled per minute.
     ResumeRate=1
     SuspendRate=1
     # You must set the value of the NodeName parameter in the ${cluster_name}-worker-${group_name}- format. You must specify the amount of resources for the node in this line. Otherwise, the slurmctld pod
     # considers that the node has only one vCPU. Make sure that the resources that you specified on the on-cloud nodes are the same as those declared in the workerGroupSpecs parameter. Otherwise, resources may be wasted.
     NodeName=slurm-job-demo-worker-cpu-[0-10] Feature=cloud State=CLOUD
     # The following configurations are fixed. Keep them unchanged.
     CommunicationParameters=NoAddrCache
     ReconfigFlags=KeepPowerSaveSettings
     SuspendProgram="/slurm-suspend.sh"
     ResumeProgram="/slurm-resume.sh"

Apply the new configuration. If the name of the Slurm-managed cluster is slurm-job-demo, run kubectl delete sts slurm-job-demo to apply the new configuration for the slurmctld pod.
Set the number of worker node replicas to 0 in the slurmcluster.yaml file so that you can observe node scaling activities in subsequent steps.

Manual management

Run kubectl edit slurmcluster slurm-job-demo and change the value of workerCount to 10 in the Slurm-managed cluster. This sets the number of worker node replicas to 0.

Manage by using Helm

In the values.yaml file, change .Values.workerGroup[].workerCount to 0. Then run helm upgrade slurm-job-demo . to update the current Helm chart. This sets the number of worker node replicas to 0.
Submit a job by using the sbatch command. Enter the following content after the command prompt: Expected output: This output confirms that the script content is correct. Expected output: This output indicates that the job is submitted and assigned a job ID.
1. Run the following command to submit the script to the Slurm-managed cluster for processing:
```
   cat << EOF > cloudnodedemo.sh
```
```
   > #!/bin/bash
   > srun hostname
   > EOF
```
```
   cat cloudnodedemo.sh
```
```
   #!/bin/bash
     srun hostname
```
```
   sbatch cloudnodedemo.sh
```
```
   Submitted batch job 1
```

View the cluster scaling results. Expected output: This output indicates that the Slurm-managed cluster automatically added one compute node to execute the submitted job. Expected output: This output shows that the slurm-demo-worker-cpu-0 pod was added to the cluster, confirming that the cluster scaled out when the job was submitted. Expected output: This output shows that slurm-demo-worker-cpu-0 is the newly started node and another 10 on-cloud nodes are available for scale-out. Expected output: In the output, NodeList=slurm-demo-worker-cpu-0 indicates that the job was executed on the newly added node. Expected output: This output shows that the number of nodes available for scale-out has increased to 11, confirming that the automatic scale-in is complete.

After a period of time, run the following command to view the scale-in results:

   cat /var/log/slurm-resume.log

   namespace: default cluster: slurm-demo
     resume called, args [slurm-demo-worker-cpu-0]
     slurm cluster metadata: default slurm-demo
     get SlurmCluster CR slurm-demo succeed
     hostlists: [slurm-demo-worker-cpu-0]
     resume node slurm-demo-worker-cpu-0
     resume worker -cpu-0
     resume node -cpu-0 end

   kubectl get pod

   NAME                                          READY   STATUS    RESTARTS        AGE
   slurm-demo-head-9hn67                         1/1     Running   0               21m
   slurm-demo-worker-cpu-0                       1/1     Running   0               43s

   sinfo

   PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
   debug*       up   infinite      10  idle~ slurm-job-demo-worker-cpu-[2-10]
   debug*       up   infinite      1   idle slurm-job-demo-worker-cpu-[0-1]

   scontrol show job 1

   JobId=1 JobName=cloudnodedemo.sh
      UserId=root(0) GroupId=root(0) MCS_label=N/A
      Priority=4294901757 Nice=0 Account=(null) QOS=(null)
      JobState=COMPLETED Reason=None Dependency=(null)
      Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
      RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
      SubmitTime=2024-05-28T11:37:36 EligibleTime=2024-05-28T11:37:36
      AccrueTime=2024-05-28T11:37:36
      StartTime=2024-05-28T11:37:36 EndTime=2024-05-28T11:37:36 Deadline=N/A
      SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-05-28T11:37:36 Scheduler=Main
      Partition=debug AllocNode:Sid=slurm-job-demo:93
      ReqNodeList=(null) ExcNodeList=(null)
      NodeList=slurm-job-demo-worker-cpu-0
      BatchHost=slurm-job-demo-worker-cpu-0
      NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
      ReqTRES=cpu=1,mem=1M,node=1,billing=1
      AllocTRES=cpu=1,mem=1M,node=1,billing=1
      Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
      MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
      Features=(null) DelayBoot=00:00:00
      OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
      Command=//cloudnodedemo.sh
      WorkDir=/
      StdErr=//slurm-1.out
      StdIn=/dev/null
      StdOut=//slurm-1.out
      Power=

   sinfo

   PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
   debug*       up   infinite     11  idle~ slurm-demo-worker-cpu-[0-10]