All Products
Search
Document Center

Container Service for Kubernetes:Deploy containerized Slurm on ACK

Last Updated:Mar 25, 2026

Container Service for Kubernetes (ACK) provides the Slurm on Kubernetes solution and the ack-slurm-operator component, enabling you to efficiently deploy and manage the Slurm (Simple Linux Utility for Resource Management) scheduling system on ACK clusters for high performance computing (HPC) and large-scale AI/ML workloads.

Slurm

Slurm is a powerful, open-source platform for cluster resource management and job scheduling, specifically designed to optimize the performance and efficiency of supercomputers and large compute clusters. Its core components work together to ensure efficient system operation and flexible management. The following figure shows how Slurm works.

image
  • slurmctld (Slurm Control Daemon): As the central controller for Slurm, slurmctld monitors system resources, schedules jobs, and manages the overall state of the cluster. For high availability, you can configure a standby slurmctld to prevent service interruptions if the primary controller fails.

  • slurmd (Slurm Node Daemon): Deployed on each compute node, the slurmd daemon receives instructions from slurmctld and executes jobs. It starts and runs jobs, reports job status, and prepares to accept new work. It acts as a direct interface to compute resources and is fundamental to job execution.

  • slurmdbd (Slurm Database Daemon): Although an optional component, slurmdbd is critical for the long-term management and auditing of large-scale clusters. It maintains a centralized database to store job history and accounting information. It also supports data aggregation across multiple Slurm-managed clusters, which improves data management efficiency.

  • SlurmCLI: Provides a suite of command-line tools to facilitate job management and system monitoring:

    • scontrol: Provides detailed control over cluster management and configuration.

    • squeue: Queries the status of the job queue.

    • srun: Submits and manages jobs.

    • sbatch: Submits a batch job.

    • sinfo: Displays the overall state of the cluster, including node availability.

Slurm on ACK

The Slurm Operator uses the SlurmCluster CustomResource (CR) to simplify the deployment and operation of Slurm clusters. This approach simplifies managing the configuration files and control plane, reducing overall complexity. The following figure shows the architecture of Slurm on ACK. A cluster administrator can deploy and manage a Slurm cluster simply by creating a SlurmCluster object. The Slurm Operator then automatically creates the corresponding Slurm control plane components. You can mount Slurm configuration files to these components by using shared storage or a ConfigMap.

image

Prerequisites

You need an ACK cluster that runs Kubernetes 1.22 or later and contains at least one GPU-accelerated node. For more information, see Add GPU-accelerated nodes to a cluster and Update clusters.

Step 1: Install the ack-slurm-operator

  1. Log on to the ACK console. In the left navigation pane, click Marketplace > Marketplace.

  2. On the Marketplace page, search for and click the ack-slurm-operator card. On the ack-slurm-operator details page, click Deploy and follow the prompts to configure the component.

    You only need to select a target cluster. Keep all other parameters at their default settings.

  3. Click OK.

Step 2: Create a SlurmCluster

Create manually

  1. Create a Secret in your ACK cluster for MUNGE-based authentication.

    1. Run the following command to generate a key for MUNGE-based authentication using the OpenSSL tool.

      openssl rand -base64 512 | tr -d '\r\n'
    2. Run the following command to create a Secret that stores the MUNGE key you generated.

      kubectl create secret generic <$MungeKeyName> --from-literal=munge.key=<$MungeKey>
      • Replace <$MungeKeyName> with a custom name for your key, such as mungekey.

      • Replace <$MungeKey> with the key string that you generated in the previous step.

    Once created, you can configure the SlurmCluster resource to use this Secret for MUNGE-based authentication.

  2. Run the following command to create the ConfigMap that the SlurmCluster resource requires.

    In this example, specifying slurmConfPath in the Custom Resource (CR) mounts the ConfigMap to the pods. This ensures the configuration is automatically restored if a pod is recreated.

    The data parameter in the code is a sample configuration file. To generate a configuration file, you can use the Easy Configurator or Full Configurator tools.

    Command details

    kubectl create -f - << EOF
    apiVersion: v1
    data:
      slurm.conf: |
        ProctrackType=proctrack/linuxproc
        ReturnToService=1
        SlurmctldPidFile=/var/run/slurmctld.pid
        SlurmctldPort=6817
        SlurmdPidFile=/var/run/slurmd.pid
        SlurmdPort=6818
        SlurmdSpoolDir=/var/spool/slurmd
        SlurmUser=root # test2
        StateSaveLocation=/var/spool/slurmctld
        TaskPlugin=task/none
        InactiveLimit=0
        KillWait=30
        MinJobAge=300
        SlurmctldTimeout=120
        SlurmdTimeout=300
        Waittime=0
        SchedulerType=sched/builtin
        SelectType=select/cons_tres
        JobCompType=jobcomp/none
        JobAcctGatherFrequency=30
        SlurmctldDebug=info
        SlurmctldLogFile=/var/log/slurmctld.log
        SlurmdDebug=info
        SlurmdLogFile=/var/log/slurmd.log
        TreeWidth=65533
        MaxNodeCount=10000
        PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
    
        ClusterName=slurm-job-demo
        # SlurmctldHost should be set to the name of the SlurmCluster resource with a -0 suffix.
        # For a high-availability deployment, you can use the following configuration.
        # The number of entries depends on the number of slurmctld replicas.
        # SlurmctldHost=slurm-job-demo-0
        # SlurmctldHost=slurm-job-demo-1
        SlurmctldHost=slurm-job-demo-0
    kind: ConfigMap
    metadata:
      name: slurm-test
      namespace: default
    EOF

    Expected output:

    configmap/slurm-test created

    This output indicates that the ConfigMap was created successfully.

  3. Submit the SlurmCluster CR.

    1. Create a file named slurmcluster.yaml and copy the following content into it.

      Note

      The example uses an Ubuntu-based image that includes CUDA 11.4, Slurm 23.06, and a self-developed component for the auto scaling of Cloud Nodes. To use a custom image, you must create and upload it yourself.

      YAML example

      # This is a Kubernetes configuration file for deploying a Slurm-managed cluster on Alibaba Cloud ACK by using a Kai Custom Resource Definition (CRD).
      apiVersion: kai.alibabacloud.com/v1
      kind: SlurmCluster
      metadata:
        name: slurm-job-demo # The name of the cluster.
        namespace: default # The namespace where the cluster is deployed.
      spec:
        mungeConfPath: /var/munge # The configuration file path for the MUNGE service.
        slurmConfPath: /var/slurm # The configuration file path for the Slurm service.
        slurmctld: # Specifications for the head node (control plane node). A StatefulSet is created to manage the head node.
          template:
            metadata: {}
            spec:
              containers:
              - image: registry-cn-hangzhou.ack.aliyuncs.com/acs/slurm-cuda:23.06-aliyun-cuda-11.4
                imagePullPolicy: Always
                name: slurmctld
                ports:
                - containerPort: 8080
                  protocol: TCP
                resources:
                  requests:
                    cpu: "1"
                    memory: 1Gi
                volumeMounts:
                - mountPath: /var/slurm # The volume mount for the Slurm configuration file.
                  name: config-slurm-test
                - mountPath: /var/munge # The volume mount for the MUNGE key file.
                  name: secret-slurm-test 
              volumes:
              - configMap:
                  name: slurm-test
                name: config-slurm-test
              - name: secret-slurm-test
                secret:
                  secretName: slurm-test
        workerGroupSpecs: # Specifications for the worker nodes. Two groups are defined here: cpu and cpu1.
        - groupName: cpu
          replicas: 2
          template:
            metadata: {}
            spec:
              containers:
              - env:
                - name: NVIDIA_REQUIRE_CUDA
                image: registry-cn-hangzhou.ack.aliyuncs.com/acs/slurm-cuda:23.06-aliyun-cuda-11.4
                imagePullPolicy: Always
                name: slurmd
                resources:
                  requests:
                    cpu: "1"
                    memory: 1Gi
                volumeMounts:
                - mountPath: /var/slurm
                  name: config-slurm-test
                - mountPath: /var/munge
                  name: secret-slurm-test
              volumes:
              - configMap:
                  name: slurm-test
                name: config-slurm-test
              - name: secret-slurm-test
                secret:
                  secretName: slurm-test
        - groupName: cpu1 # The second worker node group. It is similar to the first one, but you can adjust resources or configurations as needed.
          replicas: 2
          template:
            metadata: {}
            spec:
              containers:
              - env:
                - name: NVIDIA_REQUIRE_CUDA
                image: registry-cn-hangzhou.ack.aliyuncs.com/acs/slurm-cuda:23.06-aliyun-cuda-11.4
                imagePullPolicy: Always
                name: slurmd
                resources:
                  requests:
                    cpu: "1"
                    memory: 1Gi
                securityContext: # The security context is configured to allow the container to run in privileged mode.
                  privileged: true
                volumeMounts:
                - mountPath: /var/slurm
                  name: config-slurm-test
                - mountPath: /var/munge
                  name: secret-slurm-test
              volumes:
              - configMap:
                  name: slurm-test
                name: config-slurm-test
              - name: secret-slurm-test
                secret:
                  secretName: slurm-test

      The preceding SlurmCluster CR creates a Slurm-managed cluster with one head node and four worker nodes. These nodes run as pods in the ACK cluster. Note that the mungeConfPath and slurmConfPath specified in the SlurmCluster CR must match the mount paths defined in the slurmctld and workerGroupSpecs templates.

    2. Run the following command to deploy slurmcluster.yaml to the cluster:

      kubectl apply -f slurmcluster.yaml

      Expected output:

      slurmcluster.kai.alibabacloud.com/slurm-job-demo created
    3. Run the following command to check the status of the SlurmCluster.

      kubectl get slurmcluster

      Expected output:

      NAME             AVAILABLE WORKERS   STATUS   AGE
      slurm-job-demo   5                   ready    14m

      The output shows that the Slurm-managed cluster is deployed and all 5 nodes are in the Ready state.

    4. Run the following command to verify that the pods for the slurm-job-demo Slurm-managed cluster are running.

      kubectl get pod

      Expected output:

      NAME                                          READY   STATUS      RESTARTS     AGE
      slurm-job-demo-head-x9sgs                     1/1     Running     0            14m
      slurm-job-demo-worker-cpu-0                   1/1     Running     0            14m
      slurm-job-demo-worker-cpu-1                   1/1     Running     0            14m
      slurm-job-demo-worker-cpu1-0                  1/1     Running     0            14m
      slurm-job-demo-worker-cpu1-1                  1/1     Running     0            14m

      The output shows that the one head node and four worker nodes in the Slurm-managed cluster are running properly.

Create using Helm

You can use the Helm package manager to quickly deploy a Slurm-managed cluster. The SlurmCluster chart, provided by Alibaba Cloud, simplifies installation, management, and configuration. After you download and configure the chart from the Alibaba Cloud chart repository, Helm creates the required resources for you, such as RBAC permissions, a ConfigMap, a Secret, and the SlurmCluster CR.

The Helm chart includes the following resources:

Resource type

Resource name

Description

ConfigMap

{{ .Values.slurmConfigs.configMapName }}

This ConfigMap, created when .Values.slurmConfigs.createConfigsByConfigMap is true, stores the Slurm configuration file. The file is mounted to the path specified by .Values.slurmConfigs.slurmConfigPathInPod. This path value is then used in the .Spec.SlurmConfPath field of the SlurmCluster CR and in the pod's startup command. On startup, the pod copies the file to the /etc/slurm/ directory and sets the correct permissions.

ServiceAccount

{{ .Release.Namespace }}/{{ .Values.clusterName }}

Allows the slurmctld pod to modify the SlurmCluster CR. This enables auto scaling with the Cloud Node feature.

Role

{{ .Release.Namespace }}/{{ .Values.clusterName }}

Allows the slurmctld pod to modify the SlurmCluster CR. This enables auto scaling with the Cloud Node feature.

RoleBinding

{{ .Release.Namespace }}/{{ .Values.clusterName }}

Allows the slurmctld pod to modify the SlurmCluster CR. This enables auto scaling with the Cloud Node feature.

Role

{{ .Values.slurmOperatorNamespace }}/{{ .Values.clusterName }}

Allows the slurmctld pod to modify Secrets in the SlurmOperator namespace. This is required for token updates in a mixed deployment scenario.

RoleBinding

{{ .Values.slurmOperatorNamespace }}/{{ .Values.clusterName }}

Allows the slurmctld pod to modify Secrets in the SlurmOperator namespace. This is required for token updates in a mixed deployment scenario.

Secret

{{ .Values.mungeConfigs.secretName }}

This Secret authenticates communications between Slurm components. If .Values.mungeConfigs.createConfigsBySecret is true, Helm creates this Secret with the content "munge.key"={{ .Values.mungeConfigs.content }}. Its path is rendered to the .Spec.MungeConfPath field and used as a volume mount path in the pod. The pod's startup command then uses this path to initialize the /etc/munge/munge.key file.

SlurmCluster

The rendered SlurmCluster CR.

The following table describes the relevant parameters.

Parameter

Sample value

Description

clusterName

""

The name of the cluster. It is used to generate resources such as Secrets and Roles. The name must match ClusterName in the Slurm configuration file.

headNodeConfig

None

Required. Defines the pod configuration for slurmctld.

workerNodesConfig

None

Defines the pod configuration for slurmd.

workerNodesConfig.deleteSelfBeforeSuspend

true

When set to true, a preStop hook is added to the worker pod. This hook automatically drains the node and marks it as down before suspension.

slurmdbdConfigs

None

Defines the pod configuration for slurmdbd. If this parameter is omitted, the slurmdbd pod is not created.

slurmrestdConfigs

None

Defines the pod configuration for slurmrestd. If this parameter is omitted, the slurmrestd pod is not created.

headNodeConfig.hostNetwork

slurmdbdConfigs.hostNetwork

slurmrestdConfigs.hostNetwork

workerNodesConfig.workerGroups[].hostNetwork

false

Sets the hostNetwork field for the corresponding pod(s).

headNodeConfig.setHostnameAsFQDN

slurmdbdConfigs.setHostnameAsFQDN

slurmrestdConfigs.setHostnameAsFQDN

workerNodesConfig.workerGroups[].setHostnameAsFQDN

false

Sets the setHostnameAsFQDN field for the corresponding pod(s).

headNodeConfig.nodeSelector

slurmdbdConfigs.nodeSelector

slurmrestdConfigs.nodeSelector

workerNodesConfig.workerGroups[].nodeSelector

nodeSelector:
  example: example

Sets the nodeSelector field for the corresponding pod(s).

headNodeConfig.tolerations

slurmdbdConfigs.tolerations

slurmrestdConfigs.tolerations

workerNodesConfig.workerGroups[].tolerations

tolerations:
- key:
  value:
  operator:

Sets the tolerations field for the corresponding pod(s).

headNodeConfig.affinity

slurmdbdConfigs.affinity

slurmrestdConfigs.affinity

workerNodesConfig.workerGroups[].affinity

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: topology.kubernetes.io/zone
          operator: In
          values:
          - zone-a
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 1
      preference:
        matchExpressions:
        - key: another-node-label-key
          operator: In
          values:
          - another-node-label-value

Sets the affinity field for the corresponding pod(s).

headNodeConfig.resources

slurmdbdConfigs.resources

slurmrestdConfigs.resources

workerNodesConfig.workerGroups[].resources

resources:
  requests:
    cpu: 1
  limits:
    cpu: 1

Specifies the resources for the main container. The resource limits of the main container in a worker pod determine the resource capacity of the corresponding Slurm node.

headNodeConfig.image

slurmdbdConfigs.image

slurmrestdConfigs.image

workerNodesConfig.workerGroups[].image

"registry-cn-hangzhou.ack.aliyuncs.com/acs/slurm:23.06-1.6-aliyun-49259f59"

Sets the container image for the main container in the corresponding pod(s). To use a custom image, see ai-models-on-ack/framework/slurm/building-slurm-image at main · AliyunContainerService/ai-models-on-ack (github.com).

headNodeConfig.imagePullSecrets

slurmdbdConfigs.imagePullSecrets

slurmrestdConfigs.imagePullSecrets

workerNodesConfig.workerGroups[].imagePullSecrets

imagePullSecrets:
- name: example

Sets the image pull secret for the corresponding pod(s).

headNodeConfig.podSecurityContext

slurmdbdConfigs.podSecurityContext

slurmrestdConfigs.podSecurityContext

workerNodesConfig.workerGroups[].podSecurityContext

podSecurityContext:
  runAsUser: 1000
  runAsGroup: 3000
  fsGroup: 2000
  supplementalGroups: [4000]

Sets the podSecurityContext for the corresponding pod(s).

headNodeConfig.securityContext

slurmdbdConfigs.securityContext

slurmrestdConfigs.securityContext

workerNodesConfig.workerGroups[].securityContext

securityContext:
  allowPrivilegeEscalation: false

Sets the security context for the main container of the corresponding pod(s).

headNodeConfig.volumeMounts

slurmdbdConfigs.volumeMounts

slurmrestdConfigs.volumeMounts

workerNodesConfig.workerGroups[].volumeMounts

None

Sets the volume mounts for the main container of the corresponding pod(s).

headNodeConfig.volumes

slurmdbdConfigs.volumes

slurmrestdConfigs.volumes

workerNodesConfig.workerGroups[].volumes

None

Sets the volumes for the corresponding pod(s).

slurmConfigs.slurmConfigPathInPod

""

The mount path for Slurm configurations within the pod. When Slurm configuration files are mounted into the pod by using a volume, you must use this parameter to declare the location of slurm.conf. The pod's startup command copies files from this path to /etc/slurm/ and sets the required permissions.

slurmConfigs.createConfigsByConfigMap

true

Specifies whether to automatically create a ConfigMap to store Slurm configuration files.

slurmConfigs.configMapName

""

The name of the ConfigMap that stores the Slurm configuration files.

slurmConfigs.filesInConfigMap

""

The content of the configuration files when the ConfigMap is automatically created.

mungeConfigs.mungeConfigPathInPod

None

The mount path for MUNGE configurations within the pod. When the MUNGE configuration file is mounted into the pod by using a volume, you must use this parameter to declare the location of munge.key. The pod's startup command copies the file from this path to /etc/munge/ and sets the required permissions.

mungeConfigs.createConfigsBySecret

None

Specifies whether to automatically create a Secret to store the MUNGE configuration file.

mungeConfigs.secretName

None

The name of the Secret when it is automatically created.

mungeConfigs.content

None

The content of the MUNGE configuration file when the Secret is automatically created.

For more information about the content of slurmConfigs.filesInConfigMap, see Slurm System Configuration Tool (schedmd.com).

Important

If you modify slurmConfigs.filesInConfigMap after the pods have started, you must recreate the pods for the changes to take effect. Therefore, confirm the file content before installation.

Follow these steps to install the chart:

  1. Run the following command to add the Alibaba Cloud chart repository to your local Helm client.

    helm repo add aliyun https://aliacs-app-catalog.oss-cn-hangzhou.aliyuncs.com/charts-incubator/

    This command allows you to access various charts provided by Alibaba Cloud, including the Slurm chart.

  2. Run the following command to pull and untar the Helm chart.

    helm pull aliyun/ack-slurm-cluster --untar=true

    This operation creates a directory named ack-slurm-cluster in the current directory. This directory contains all the files and templates of the Chart.

  3. Run the following commands to modify the chart parameters in the values.yaml file.

    The values.yaml file contains the default configuration for the chart. You can edit this file to modify parameters such as the Slurm configuration, resource requests and limits, and storage options.

    cd ack-slurm-cluster
    vi values.yaml
  4. Run the following commands to install the chart.

    cd ..
    helm install my-slurm-cluster ack-slurm-cluster # You can replace my-slurm-cluster with a custom release name.

    This command deploys the Slurm-managed cluster.

  5. Verify the deployment

    After the deployment is complete, you can use the Kubernetes command-line tool kubectl to check the deployment status and ensure that the Slurm Cluster has started successfully and is running correctly.

    kubectl get pods -l app.kubernetes.io/name=slurm-cluster

Step 3: Log on to the Slurm cluster

For Kubernetes cluster administrators

Kubernetes cluster administrators have full operational permissions on the cluster. Because a Slurm-managed cluster runs as pods within the Kubernetes cluster, an administrator can use the kubectl command-line tool to log on to any of its pods. This access automatically grants them root permissions within the Slurm-managed cluster.

Run the following command to log on to any pod of the Slurm-managed cluster.

# Replace slurm-job-demo-xxxxx with the name of a specific pod in your cluster.
kubectl exec -it slurm-job-demo-xxxxx -- bash

For regular Slurm cluster users

Administrators or regular users of a Slurm-managed cluster might not have the permissions to run the kubectl exec command. In this case, you must log on to the Slurm-managed cluster by using SSH.

  • Using a Service's external IP address to log on to a head pod provides a persistent, scalable solution for long-term, stable access. This method uses a load balancer, allowing you to access the Slurm-managed cluster from any location within your internal network.

  • Using port forwarding is a temporary solution for short-term operations or debugging because it requires the kubectl port-forward command to run continuously.

Use an external IP

  1. Create a Service of the LoadBalancer type to forward traffic and expose internal services. For more information, see Use an existing Server Load Balancer instance to expose an application or Expose an application by using an automatically created LoadBalancer Service.

    • The Service must use an internal-facing Classic Load Balancer (CLB) instance.

    • You must add the kai.alibabacloud.com/slurm-cluster: ack-slurm-cluster-1 and kai.alibabacloud.com/slurm-node-type: head labels to the Service to ensure it routes incoming requests to the correct pod.

  2. Run the following command to obtain the external IP address of the LoadBalancer type Service.

    kubectl get svc
  3. Run the following command to log on to the corresponding head pod by using SSH.

    # Replace $YOURUSER with the username in the pod and $EXTERNAL_IP with the external IP address obtained from the Service.
    ssh $YOURUSER@$EXTERNAL_IP

Use port forwarding

Warning

To use the port-forward method, you must save the KubeConfig file of the Kubernetes cluster to your local machine. This poses a security risk. Do not use this method in a production environment.

  1. Run the following command on your local machine to start port forwarding. This command maps the local port $LOCALPORT to port 22 (the default SSH port) of the slurmctld pod in the cluster.

    # Replace $NAMESPACE, $CLUSTERNAME, and $LOCALPORT with their actual values.
    kubectl port-forward -n $NAMESPACE svc/$CLUSTERNAME $LOCALPORT:22
  2. While the port-forward command is running, any user on the local machine can run the following command to log on to the cluster and submit jobs.

    # $YOURUSER is the username to use when logging on to the pod.
    ssh -p $LOCALPORT $YOURUSER@localhost

Step 4: Use SlurmCluster

This section describes how to configure user synchronization, shared logging, and auto scaling for your SlurmCluster.

User synchronization across nodes

Slurm does not provide a built-in service for centralized user authentication. When you submit a job to a SlurmCluster by using the sbatch command, the job may fail if the user account does not exist on the target node. To resolve this, you can configure Lightweight Directory Access Protocol (LDAP) as a centralized authentication backend for your SlurmCluster. This allows Slurm to verify user identities through the LDAP service. Perform the following steps:

  1. Create a file named ldap.yaml with the following content. This configuration deploys a basic LDAP service instance for storing and managing user information.

    The ldap.yaml file defines a Pod to run the LDAP service and a Service to expose it on the network.

    LDAP backend Pod and Service

    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      namespace: default
      name: ldap
      labels:
        app: ldap
    spec:
      selector:
        matchLabels:
          app: ldap
      revisionHistoryLimit: 10
      template:
        metadata:
          labels:
            app: ldap
        spec:
          securityContext:
            seLinuxOptions: {}
          imagePullSecrets: []
          restartPolicy: Always
          initContainers: []
          containers:
            - image: 'osixia/openldap:1.4.0'
              imagePullPolicy: IfNotPresent
              name: ldap
              volumeMounts:
                - name: openldap-data
                  mountPath: /var/lib/ldap
                  subPath: data
                - name: openldap-data
                  mountPath: /etc/ldap/slapd.d
                  subPath: config
                - name: openldap-data
                  mountPath: /container/service/slapd/assets/certs
                  subPath: certs
                - name: secret-volume
                  mountPath: /container/environment/01-custom
                - name: container-run
                  mountPath: /container/run
              args:
                - '--copy-service'
              resources:
                limits:
                requests:
              env: []
              readinessProbe:
                tcpSocket:
                  port: openldap
                initialDelaySeconds: 20
                timeoutSeconds: 1
                periodSeconds: 10
                successThreshold: 1
                failureThreshold: 10
              livenessProbe:
                tcpSocket:
                  port: openldap
                initialDelaySeconds: 20
                timeoutSeconds: 1
                periodSeconds: 10
                successThreshold: 1
                failureThreshold: 10
              lifecycle: {}
              ports:
                - name: openldap
                  containerPort: 389
                  protocol: TCP
                - name: ssl-ldap-port
                  containerPort: 636
                  protocol: TCP
          volumes:
            - name: openldap-data
              emptyDir: {}
            - name: secret-volume
              secret:
                secretName: ldap-secret
                defaultMode: 420
                items: []
            - name: container-run
              emptyDir: {}
          dnsPolicy: ClusterFirst
          dnsConfig: {}
          terminationGracePeriodSeconds: 30
      progressDeadlineSeconds: 600
      strategy:
        type: RollingUpdate
        rollingUpdate:
          maxUnavailable: 25%
          maxSurge: 25%
      replicas: 1
    ---
    apiVersion: v1
    kind: Service
    metadata:
      annotations: {}
      labels:
        app: ldap
      name: ldap-service
      namespace: default
    spec:
      ports:
        - name: openldap
          port: 389
          protocol: TCP
          targetPort: openldap
        - name: ssl-ldap-port
          port: 636
          protocol: TCP
          targetPort: ssl-ldap-port
      selector:
        app: ldap
      sessionAffinity: None
      type: ClusterIP
    ---
    metadata:
      name: ldap-secret
      namespace: default
      annotations: {}
    data:
      env.startup.yaml: >-
        IyBUaGlzIGlzIHRoZSBkZWZhdWx0IGltYWdlIHN0YXJ0dXAgY29uZmlndXJhdGlvbiBmaWxlCiMgdGhpcyBmaWxlIGRlZmluZSBlbnZpcm9ubWVudCB2YXJpYWJsZXMgdXNlZCBkdXJpbmcgdGhlIGNvbnRhaW5lciAqKmZpcnN0IHN0YXJ0KiogaW4gKipzdGFydHVwIGZpbGVzKiouCgojIFRoaXMgZmlsZSBpcyBkZWxldGVkIHJpZ2h0IGFmdGVyIHN0YXJ0dXAgZmlsZXMgYXJlIHByb2Nlc3NlZCBmb3IgdGhlIGZpcnN0IHRpbWUsCiMgYWZ0ZXIgdGhhdCBhbGwgdGhlc2UgdmFsdWVzIHdpbGwgbm90IGJlIGF2YWlsYWJsZSBpbiB0aGUgY29udGFpbmVyIGVudmlyb25tZW50LgojIFRoaXMgaGVscHMgdG8ga2VlcCB5b3VyIGNvbnRhaW5lciBjb25maWd1cmF0aW9uIHNlY3JldC4KIyBtb3JlIGluZm9ybWF0aW9uIDogaHR0cHM6Ly9naXRodWIuY29tL29zaXhpYS9kb2NrZXItbGlnaHQtYmFzZWltYWdlCgojIFJlcXVpcmVkIGFuZCB1c2VkIGZvciBuZXcgbGRhcCBzZXJ2ZXIgb25seQpMREFQX09SR0FOSVNBVElPTjogRXhhbXBsZSBJbmMuCkxEQVBfRE9NQUlOOiBleGFtcGxlLm9yZwpMREFQX0JBU0VfRE46ICNpZiBlbXB0eSBhdXRvbWF0aWNhbGx5IHNldCBmcm9tIExEQVBfRE9NQUlOCgpMREFQX0FETUlOX1BBU1NXT1JEOiBhZG1pbgpMREFQX0NPTkZJR19QQVNTV09SRDogY29uZmlnCgpMREFQX1JFQURPTkxZX1VTRVI6IGZhbHNlCkxEQVBfUkVBRE9OTFlfVVNFUl9VU0VSTkFNRTogcmVhZG9ubHkKTERBUF9SRUFET05MWV9VU0VSX1BBU1NXT1JEOiByZWFkb25seQoKIyBCYWNrZW5kCkxEQVBfQkFDS0VORDogaGRiCgojIFRscwpMREFQX1RMUzogdHJ1ZQpMREFQX1RMU19DUlRfRklMRU5BTUU6IGxkYXAuY3J0CkxEQVBfVExTX0tFWV9GSUxFTkFNRTogbGRhcC5rZXkKTERBUF9UTFNfQ0FfQ1JUX0ZJTEVOQU1FOiBjYS5jcnQKCkxEQVBfVExTX0VORk9SQ0U6IGZhbHNlCkxEQVBfVExTX0NJUEhFUl9TVUlURTogU0VDVVJFMjU2Oi1WRVJTLVNTTDMuMApMREFQX1RMU19QUk9UT0NPTF9NSU46IDMuMQpMREFQX1RMU19WRVJJRllfQ0xJRU5UOiBkZW1hbmQKCiMgUmVwbGljYXRpb24KTERBUF9SRVBMSUNBVElPTjogZmFsc2UKIyB2YXJpYWJsZXMgJExEQVBfQkFTRV9ETiwgJExEQVBfQURNSU5fUEFTU1dPUkQsICRMREFQX0NPTkZJR19QQVNTV09SRAojIGFyZSBhdXRvbWF0aWNhbHkgcmVwbGFjZWQgYXQgcnVuIHRpbWUKCiMgaWYgeW91IHdhbnQgdG8gYWRkIHJlcGxpY2F0aW9uIHRvIGFuIGV4aXN0aW5nIGxkYXAKIyBhZGFwdCBMREFQX1JFUExJQ0FUSU9OX0NPTkZJR19TWU5DUFJPViBhbmQgTERBUF9SRVBMSUNBVElPTl9EQl9TWU5DUFJPViB0byB5b3VyIGNvbmZpZ3VyYXRpb24KIyBhdm9pZCB1c2luZyAkTERBUF9CQVNFX0ROLCAkTERBUF9BRE1JTl9QQVNTV09SRCBhbmQgJExEQVBfQ09ORklHX1BBU1NXT1JEIHZhcmlhYmxlcwpMREFQX1JFUExJQ0FUSU9OX0NPTkZJR19TWU5DUFJPVjogYmluZGRuPSJjbj1hZG1pbixjbj1jb25maWciIGJpbmRtZXRob2Q9c2ltcGxlIGNyZWRlbnRpYWxzPSRMREFQX0NPTkZJR19QQVNTV09SRCBzZWFyY2hiYXNlPSJjbj1jb25maWciIHR5cGU9cmVmcmVzaEFuZFBlcnNpc3QgcmV0cnk9IjYwICsiIHRpbWVvdXQ9MSBzdGFydHRscz1jcml0aWNhbApMREFQX1JFUExJQ0FUSU9OX0RCX1NZTkNQUk9WOiBiaW5kZG49ImNuPWFkbWluLCRMREFQX0JBU0VfRE4iIGJpbmRtZXRob2Q9c2ltcGxlIGNyZWRlbnRpYWxzPSRMREFQX0FETUlOX1BBU1NXT1JEIHNlYXJjaGJhc2U9IiRMREFQX0JBU0VfRE4iIHR5cGU9cmVmcmVzaEFuZFBlcnNpc3QgaW50ZXJ2YWw9MDA6MDA6MDA6MTAgcmV0cnk9IjYwICsiIHRpbWVvdXQ9MSBzdGFydHRscz1jcml0aWNhbApMREFQX1JFUExJQ0FUSU9OX0hPU1RTOgogIC0gbGRhcDovL2xkYXAuZXhhbXBsZS5vcmcgIyBUaGUgb3JkZXIgbXVzdCBiZSB0aGUgc2FtZSBvbiBhbGwgbGRhcCBzZXJ2ZXJzCiAgLSBsZGFwOi8vbGRhcDIuZXhhbXBsZS5vcmcKCgojIFJlbW92ZSBjb25maWcgYWZ0ZXIgc2V0dXAKTERBUF9SRU1PVkVfQ09ORklHX0FGVEVSX1NFVFVQOiB0cnVlCgojIGNmc3NsIGVudmlyb25tZW50IHZhcmlhYmxlcyBwcmVmaXgKTERBUF9DRlNTTF9QUkVGSVg6IGxkYXAgIyBjZnNzbC1oZWxwZXIgZmlyc3Qgc2VhcmNoIGNvbmZpZyBmcm9tIExEQVBfQ0ZTU0xfKiB2YXJpYWJsZXMsIGJlZm9yZSBDRlNTTF8qIHZhcmlhYmxlcy4K
      env.yaml: >-
        IyBUaGlzIGlzIHRoZSBkZWZhdWx0IGltYWdlIGNvbmZpZ3VyYXRpb24gZmlsZQojIFRoZXNlIHZhbHVlcyB3aWxsIHBlcnNpc3RzIGluIGNvbnRhaW5lciBlbnZpcm9ubWVudC4KCiPCoEFsbCBlbnZpcm9ubWVudCB2YXJpYWJsZXMgdXNlZCBhZnRlciB0aGUgY29udGFpbmVyIGZpcnN0IHN0YXJ0CiMgbXVzdCBiZSBkZWZpbmVkIGhlcmUuCiMgbW9yZSBpbmZvcm1hdGlvbiA6IGh0dHBzOi8vZ2l0aHViLmNvbS9vc2l4aWEvZG9ja2VyLWxpZ2h0LWJhc2VpbWFnZQoKIyBHZW5lcmFsIGNvbnRhaW5lciBjb25maWd1cmF0aW9uCiMgc2VlIHRhYmxlIDUuMSBpbiBodHRwOi8vd3d3Lm9wZW5sZGFwLm9yZy9kb2MvYWRtaW4yNC9zbGFwZGNvbmYyLmh0bWwgZm9yIHRoZSBhdmFpbGFibGUgbG9nIGxldmVscy4KTERBUF9MT0dfTEVWRUw6IDI1Ngo=
    type: Opaque
    kind: Secret
    apiVersion: v1
    
  2. Run the following command to deploy the LDAP backend service:

    kubectl apply -f ldap.yaml

    Expected output:

    deployment.apps/ldap created
    service/ldap-service created
    secret/ldap-secret created
  3. (Optional) To improve management efficiency, you can deploy a frontend interface. Create a file named phpldapadmin.yaml with the following content to deploy a frontend Pod and Service.

    LDAP frontend Pod and Service

    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      namespace: default
      name: phpldapadmin
      labels:
        io.kompose.service: phpldapadmin
    spec:
      selector:
        matchLabels:
          io.kompose.service: phpldapadmin
      revisionHistoryLimit: 10
      template:
        metadata:
          labels:
            io.kompose.service: phpldapadmin
        spec:
          securityContext:
            seLinuxOptions: {}
          imagePullSecrets: []
          restartPolicy: Always
          initContainers: []
          containers:
            - image: 'osixia/phpldapadmin:0.9.0'
              imagePullPolicy: Always
              name: phpldapadmin
              volumeMounts: []
              resources:
                limits:
                requests:
              env:
                - name: PHPLDAPADMIN_HTTPS
                  value: 'false'
                - name: PHPLDAPADMIN_LDAP_HOSTS
                  value: ldap-service
              lifecycle: {}
              ports:
                - containerPort: 80
                  protocol: TCP
          volumes: []
          dnsPolicy: ClusterFirst
          dnsConfig: {}
          terminationGracePeriodSeconds: 30
      progressDeadlineSeconds: 600
      strategy:
        type: RollingUpdate
        rollingUpdate:
          maxUnavailable: 25%
          maxSurge: 25%
      replicas: 1
    ---
    apiVersion: v1
    kind: Service
    metadata:
      namespace: default
      name: phpldapadmin
      annotations:
        k8s.kuboard.cn/workload: phpldapadmin
      labels:
        io.kompose.service: phpldapadmin
    spec:
      selector:
        io.kompose.service: phpldapadmin
      type: ClusterIP
      ports:
        - port: 8080
          targetPort: 80
          protocol: TCP
          name: '8080'
          nodePort: 0
      sessionAffinity: None

    Run the following command to deploy the LDAP frontend service:

    kubectl apply -f phpldapadmin.yaml
  4. Log on to a Pod in the SlurmCluster as described in Step 3. Then, run the following commands to install the LDAP client package:

    apt update
    apt install libnss-ldapd
  5. After the libnss-ldapd package is installed, configure the network authentication service for the SlurmCluster from within the Pod.

    1. Run the following commands to install the Vim package for editing scripts and files:

      apt update
      apt install vim
    2. Modify the following parameters in the /etc/ldap/ldap.conf file to configure the LDAP client:

      ...
      BASE	dc=example,dc=org # Replace this with the base DN of your LDAP directory.
      URI	ldap://ldap-service # Replace this with the address of your LDAP server.
      ...
    3. Modify the following parameters in the /etc/nslcd.conf file to define the connection to the LDAP server:

      ...
      uri ldap://ldap-service # Replace this with the address of your LDAP server.
      base dc=example,dc=org # Set this based on your LDAP directory structure.
      ...
      tls_cacertfile /etc/ssl/certs/ca-certificates.crt # Specifies the path to the CA certificate file used to verify the LDAP server certificate.
      ...

Log sharing and access

By default, job logs generated by sbatch are stored directly on the node where the job runs, which can make viewing logs inconvenient. To centralize log access, you can create a NAS file system to store all job logs. This collects logs from all nodes in a single location, simplifying management. Perform the following steps:

  1. Create a NAS file system to store and share logs from all nodes. For more information, see Create a file system.

  2. Log on to the ACK console and create a Persistent Volume (PV) and a Persistent Volume Claim (PVC) for the NAS file system. For more information, see Use a statically provisioned NAS volume.

  3. Modify the SlurmCluster CR.

    Add the volumeMounts and volumes parameters to headGroupSpec and each workerGroupSpec to reference the created PVC and mount it to the /home directory. The following is an example:

    headGroupSpec:
    ...
    # Add a volume mount for /home.
      volumeMounts:
      - mountPath: /home
        name: test  # The name of the volume that references the PVC.
      volumes:
    # Add the PVC definition.
      - name: test  # This must match the name in volumeMounts.
        persistentVolumeClaim:
          claimName: test  # Replace this with the name of your PVC.
    ...
    workerGroupSpecs:
      # ... Repeat the preceding volume and volumeMounts configuration for each workerGroupSpec.
  4. Run the following command to apply the changes to the SlurmCluster CR:

    Important

    If the SlurmCluster CR resource fails to deploy, run the kubectl delete slurmcluster slurm-job-demo command to delete the CR resource, and then deploy it again.

    kubectl  apply -f slurmcluster.yaml

    After the deployment, all worker nodes share the same file system.

Auto scaling

The root directory of the default Slurm image includes executable files and scripts such as slurm-resume.sh, slurm-suspend.sh, and slurmctld-copilot. These components interact with slurmctld to manage cluster scaling.

Slurm auto scaling with cloud nodes

  • local node: A physical compute node that is directly connected to the cluster manager.

  • cloud node: A logical node that represents a VM instance that can be created and terminated on demand by a cloud provider.

image

Auto scaling in Slurm on ACK

image

Procedure

  1. Configure permissions for auto scaling. If you installed the cluster by using Helm, these permissions are created automatically for slurmctld, and you can skip this step.

    Auto scaling requires the head Pod to have permissions to access and update the SlurmCluster CR. Use RBAC to grant the head Pod the necessary permissions.

    First, you need to create the ServiceAccount, Role, and RoleBinding required by slurmctld. Assume that the Name of your SlurmCluster is slurm-job-demo and the Namespace is default. Save the following content to a file named rbac.yaml:

    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: slurm-job-demo
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: Role
    metadata:
      name: slurm-job-demo
    rules:
    - apiGroups: ["kai.alibabacloud.com"]
      resources: ["slurmclusters"]
      verbs: ["get", "watch", "list", "update", "patch"]
      resourceNames: ["slurm-job-demo"]
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: RoleBinding
    metadata:
      name: slurm-job-demo
    subjects:
    - kind: ServiceAccount
      name: slurm-job-demo
    roleRef:
      kind: Role
      name: slurm-job-demo
      apiGroup: rbac.authorization.k8s.io

    After you save the file, run kubectl apply -f rbac.yaml to apply the manifest.

    Second, assign these permissions to the slurmctld Pod. Run kubectl edit slurmcluster slurm-job-demo to edit the SlurmCluster, and set .spec.slurmctld.template.spec.serviceAccountName to the ServiceAccount that you just created.

    apiVersion: kai.alibabacloud.com/v1
    kind: SlurmCluster
    ...
    spec:
      slurmctld:
        template:
          spec:
            serviceAccountName: slurm-job-demo
    ...

    Then, recreate the StatefulSet that manages slurmctld to apply the preceding changes. You can view the StatefulSet that currently manages the slurmctld pod by running kubectl get sts slurm-job-demo and delete it by running kubectl delete sts slurm-job-demo. The slurmoperator will then recreate the StatefulSet and apply the new configuration.

  2. Configure auto scaling in the /etc/slurm/slurm.conf file.

    Shared file system

    # The following settings are required when using cloud nodes. 
    # SuspendProgram and ResumeProgram are custom-developed features.
    SuspendTimeout=600
    ResumeTimeout=600
    # The interval after which an idle node is automatically suspended. 
    SuspendTime=600
    # The number of nodes that can be scaled out or in per minute. 
    ResumeRate=1
    SuspendRate=1
    # The NodeName format must be ${cluster_name}-worker-${group_name}-. You must declare the node's resources in this line.
    # Otherwise, slurmctld treats the node as having only 1 CPU core.
    # To avoid resource waste, ensure that the resources declared here match the resources specified in the workerGroup.
    NodeName=slurm-job-demo-worker-cpu-[0-10] Feature=cloud State=CLOUD
    # The following settings are fixed and should not be changed.
    CommunicationParameters=NoAddrCache
    ReconfigFlags=KeepPowerSaveSettings
    SuspendProgram="/slurm-suspend.sh"
    ResumeProgram="/slurm-resume.sh"

    ConfigMap

    If slurm.conf is stored in the slurm-config ConfigMap, you can run kubectl edit slurm-config to add the following configuration:

    slurm.conf:
    ...
      # The following settings are required when using cloud nodes. 
      # SuspendProgram and ResumeProgram are custom-developed features.
      SuspendTimeout=600
      ResumeTimeout=600
      # The interval after which an idle node is automatically suspended. 
      SuspendTime=600
      # The number of nodes that can be scaled out or in per minute. 
      ResumeRate=1
      SuspendRate=1
      # The NodeName format must be ${cluster_name}-worker-${group_name}-. You must declare the node's resources in this line.
      # Otherwise, slurmctld treats the node as having only 1 CPU core.
      # To avoid resource waste, ensure that the resources declared here match the resources specified in the workerGroup.
      NodeName=slurm-job-demo-worker-cpu-[0-10] Feature=cloud State=CLOUD
      # The following settings are fixed and should not be changed.
      CommunicationParameters=NoAddrCache
      ReconfigFlags=KeepPowerSaveSettings
      SuspendProgram="/slurm-suspend.sh"
      ResumeProgram="/slurm-resume.sh"

    Helm

    1. Modify the values.yaml file and add the following configuration:

      slurm.conf:
      ...
        # The following settings are required when using cloud nodes. 
        # SuspendProgram and ResumeProgram are custom-developed features.
        SuspendTimeout=600
        ResumeTimeout=600
        # The interval after which an idle node is automatically suspended. 
        SuspendTime=600
        # The number of nodes that can be scaled out or in per minute. 
        ResumeRate=1
        SuspendRate=1
        # The NodeName format must be ${cluster_name}-worker-${group_name}-. You must declare the node's resources in this line.
        # Otherwise, slurmctld treats the node as having only 1 CPU core.
        # To avoid resource waste, ensure that the resources declared here match the resources specified in the workerGroup.
        NodeName=slurm-job-demo-worker-cpu-[0-10] Feature=cloud State=CLOUD
        # The following settings are fixed and should not be changed.
        CommunicationParameters=NoAddrCache
        ReconfigFlags=KeepPowerSaveSettings
        SuspendProgram="/slurm-suspend.sh"
        ResumeProgram="/slurm-resume.sh"
    2. Run the helm upgrade command to update the current Slurm configuration.

  3. Apply the new configuration.

    Assuming your SlurmCluster is named slurm-job-demo, you can run kubectl delete sts slurm-job-demo to apply the new configuration to the slurmctld Pod.

  4. Set the replica count for worker nodes to 0. This allows you to observe the auto scaling process from the beginning.

    Manual

    Assuming the submitted SlurmCluster is named slurm-job-demo, run kubectl edit slurmcluster slurm-job-demo and change workerCount in the workerGroup to 0. This sets the replica count for worker nodes to 0.

    Helm

    In values.yaml, set .Values.workerGroup[].workerCount to 0. Then, run helm upgrade slurm-job-demo . to update the current Helm chart and set the worker replica count to 0.

  5. Submit an sbatch job.

    1. Run the following command to create a shell script:

      cat << EOF > cloudnodedemo.sh

      Enter the following content at the prompt:

      #!/bin/bash
      srun hostname
      EOF
    2. Run the following command to verify the content of the script:

      cat cloudnodedemo.sh

      Expected output:

        #!/bin/bash
        srun hostname

      The script output is correct.

    3. Run the following command to submit the script to the SlurmCluster.

      sbatch cloudnodedemo.sh

      Expected output:

      Submitted batch job 1

      The output indicates that the job was successfully submitted and assigned a job ID.

  6. View the cluster scaling status.

    1. Run the following command to view the SlurmCluster scaling logs.

      cat /var/log/slurm-resume.log

      Expected output:

       namespace: default cluster: slurm-demo
        resume called, args [slurm-demo-worker-cpu-0]
        slurm cluster metadata: default slurm-demo
        get SlurmCluster CR slurm-demo succeed
        hostlists: [slurm-demo-worker-cpu-0]
        resume node slurm-demo-worker-cpu-0
        resume worker -cpu-0
        resume node -cpu-0 end

      The log output shows that the SlurmCluster automatically added a compute node to meet the job demand.

    2. Run the following command to view the status of the Pods in the cluster.

      kubectl get pod

      Expected output:

      NAME                                          READY   STATUS    RESTARTS        AGE
      slurm-demo-head-9hn67                         1/1     Running   0               21m
      slurm-demo-worker-cpu-0                       1/1     Running   0               43s

      The output shows that slurm-demo-worker-cpu-0 is the new Pod in the cluster. This indicates that submitting the job triggered the cluster to scale out.

    3. Run the following command to view the cluster node information.

      sinfo

      Expected output:

      PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
      debug*       up   infinite      10  idle~ slurm-job-demo-worker-cpu-[2-10]
      debug*       up   infinite      1   idle slurm-job-demo-worker-cpu-[0-1]

      The output shows that slurm-demo-worker-cpu-0 is the newly launched node. In Cloud Code, 10 additional nodes, from 1 to 10, are available for scaling out.

    4. Run the following command to view information about the job that just ran.

      scontrol show job 1

      Expected output:

      JobId=1 JobName=cloudnodedemo.sh
         UserId=root(0) GroupId=root(0) MCS_label=N/A
         Priority=4294901757 Nice=0 Account=(null) QOS=(null)
         JobState=COMPLETED Reason=None Dependency=(null)
         Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
         RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
         SubmitTime=2024-05-28T11:37:36 EligibleTime=2024-05-28T11:37:36
         AccrueTime=2024-05-28T11:37:36
         StartTime=2024-05-28T11:37:36 EndTime=2024-05-28T11:37:36 Deadline=N/A
         SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-05-28T11:37:36 Scheduler=Main
         Partition=debug AllocNode:Sid=slurm-job-demo:93
         ReqNodeList=(null) ExcNodeList=(null)
         NodeList=slurm-job-demo-worker-cpu-0
         BatchHost=slurm-job-demo-worker-cpu-0
         NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
         ReqTRES=cpu=1,mem=1M,node=1,billing=1
         AllocTRES=cpu=1,mem=1M,node=1,billing=1
         Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
         MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
         Features=(null) DelayBoot=00:00:00
         OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
         Command=//cloudnodedemo.sh
         WorkDir=/
         StdErr=//slurm-1.out
         StdIn=/dev/null
         StdOut=//slurm-1.out
         Power=

      In the output, NodeList=slurm-demo-worker-cpu-0 indicates that the job ran on the newly added node.

    5. After a while, run the following command to view the node scale-in information.

      sinfo

      Expected output:

      PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
      debug*       up   infinite     11  idle~ slurm-demo-worker-cpu-[0-10]

      The output shows that the available nodes for scaling out are again nodes 0 through 10, for a total of 11 nodes. This indicates that the automatic scale-in is complete.