Deploy and manage a Slurm cluster in an ACK cluster - Container Service for Kubernetes

Container Service for Kubernetes (ACK) provides the Slurm on Kubernetes solution and the ack-slurm-operator component, enabling you to efficiently deploy and manage the Slurm (Simple Linux Utility for Resource Management) scheduling system on ACK clusters for high performance computing (HPC) and large-scale AI/ML workloads.

Slurm

Slurm is a powerful, open-source platform for cluster resource management and job scheduling, specifically designed to optimize the performance and efficiency of supercomputers and large compute clusters. Its core components work together to ensure efficient system operation and flexible management. The following figure shows how Slurm works.

slurmctld (Slurm Control Daemon): As the central controller for Slurm, slurmctld monitors system resources, schedules jobs, and manages the overall state of the cluster. For high availability, you can configure a standby slurmctld to prevent service interruptions if the primary controller fails.
slurmd (Slurm Node Daemon): Deployed on each compute node, the slurmd daemon receives instructions from slurmctld and executes jobs. It starts and runs jobs, reports job status, and prepares to accept new work. It acts as a direct interface to compute resources and is fundamental to job execution.
slurmdbd (Slurm Database Daemon): Although an optional component, slurmdbd is critical for the long-term management and auditing of large-scale clusters. It maintains a centralized database to store job history and accounting information. It also supports data aggregation across multiple Slurm-managed clusters, which improves data management efficiency.
SlurmCLI: Provides a suite of command-line tools to facilitate job management and system monitoring:
- scontrol: Provides detailed control over cluster management and configuration.
- squeue: Queries the status of the job queue.
- srun: Submits and manages jobs.
- sbatch: Submits a batch job.
- sinfo: Displays the overall state of the cluster, including node availability.

Slurm on ACK

The Slurm Operator uses the SlurmCluster CustomResource (CR) to simplify the deployment and operation of Slurm clusters. This approach simplifies managing the configuration files and control plane, reducing overall complexity. The following figure shows the architecture of Slurm on ACK. A cluster administrator can deploy and manage a Slurm cluster simply by creating a SlurmCluster object. The Slurm Operator then automatically creates the corresponding Slurm control plane components. You can mount Slurm configuration files to these components by using shared storage or a ConfigMap.

Prerequisites

You need an ACK cluster that runs Kubernetes 1.22 or later and contains at least one GPU-accelerated node. For more information, see Add GPU-accelerated nodes to a cluster and Update clusters.

Step 1: Install the ack-slurm-operator

Log on to the ACK console. In the left navigation pane, click Marketplace > Marketplace.
On the Marketplace page, search for and click the ack-slurm-operator card. On the ack-slurm-operator details page, click Deploy and follow the prompts to configure the component.

You only need to select a target cluster. Keep all other parameters at their default settings.
Click OK.

Step 2: Create a SlurmCluster

Create manually

Create a Secret in your ACK cluster for MUNGE-based authentication.
1. Run the following command to generate a key for MUNGE-based authentication using the OpenSSL tool.
```
openssl rand -base64 512 | tr -d '\r\n'
```
2. Run the following command to create a Secret that stores the MUNGE key you generated.
```
kubectl create secret generic <$MungeKeyName> --from-literal=munge.key=<$MungeKey>
```
  - Replace <$MungeKeyName> with a custom name for your key, such as mungekey.
  - Replace <$MungeKey> with the key string that you generated in the previous step.
Once created, you can configure the SlurmCluster resource to use this Secret for MUNGE-based authentication.

Run the following command to create the ConfigMap that the SlurmCluster resource requires.

In this example, specifying slurmConfPath in the Custom Resource (CR) mounts the ConfigMap to the pods. This ensures the configuration is automatically restored if a pod is recreated.

The data parameter in the code is a sample configuration file. To generate a configuration file, you can use the Easy Configurator or Full Configurator tools.

Command details

kubectl create -f - << EOF
apiVersion: v1
data:
  slurm.conf: |
    ProctrackType=proctrack/linuxproc
    ReturnToService=1
    SlurmctldPidFile=/var/run/slurmctld.pid
    SlurmctldPort=6817
    SlurmdPidFile=/var/run/slurmd.pid
    SlurmdPort=6818
    SlurmdSpoolDir=/var/spool/slurmd
    SlurmUser=root # test2
    StateSaveLocation=/var/spool/slurmctld
    TaskPlugin=task/none
    InactiveLimit=0
    KillWait=30
    MinJobAge=300
    SlurmctldTimeout=120
    SlurmdTimeout=300
    Waittime=0
    SchedulerType=sched/builtin
    SelectType=select/cons_tres
    JobCompType=jobcomp/none
    JobAcctGatherFrequency=30
    SlurmctldDebug=info
    SlurmctldLogFile=/var/log/slurmctld.log
    SlurmdDebug=info
    SlurmdLogFile=/var/log/slurmd.log
    TreeWidth=65533
    MaxNodeCount=10000
    PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP

    ClusterName=slurm-job-demo
    # SlurmctldHost should be set to the name of the SlurmCluster resource with a -0 suffix.
    # For a high-availability deployment, you can use the following configuration.
    # The number of entries depends on the number of slurmctld replicas.
    # SlurmctldHost=slurm-job-demo-0
    # SlurmctldHost=slurm-job-demo-1
    SlurmctldHost=slurm-job-demo-0
kind: ConfigMap
metadata:
  name: slurm-test
  namespace: default
EOF

Expected output:

configmap/slurm-test created

This output indicates that the ConfigMap was created successfully.

Submit the SlurmCluster CR.

Create a file named slurmcluster.yaml and copy the following content into it.

Note

The example uses an Ubuntu-based image that includes CUDA 11.4, Slurm 23.06, and a self-developed component for the auto scaling of Cloud Nodes. To use a custom image, you must create and upload it yourself.

YAML example

# This is a Kubernetes configuration file for deploying a Slurm-managed cluster on Alibaba Cloud ACK by using a Kai Custom Resource Definition (CRD).
apiVersion: kai.alibabacloud.com/v1
kind: SlurmCluster
metadata:
  name: slurm-job-demo # The name of the cluster.
  namespace: default # The namespace where the cluster is deployed.
spec:
  mungeConfPath: /var/munge # The configuration file path for the MUNGE service.
  slurmConfPath: /var/slurm # The configuration file path for the Slurm service.
  slurmctld: # Specifications for the head node (control plane node). A StatefulSet is created to manage the head node.
    template:
      metadata: {}
      spec:
        containers:
        - image: registry-cn-hangzhou.ack.aliyuncs.com/acs/slurm-cuda:23.06-aliyun-cuda-11.4
          imagePullPolicy: Always
          name: slurmctld
          ports:
          - containerPort: 8080
            protocol: TCP
          resources:
            requests:
              cpu: "1"
              memory: 1Gi
          volumeMounts:
          - mountPath: /var/slurm # The volume mount for the Slurm configuration file.
            name: config-slurm-test
          - mountPath: /var/munge # The volume mount for the MUNGE key file.
            name: secret-slurm-test 
        volumes:
        - configMap:
            name: slurm-test
          name: config-slurm-test
        - name: secret-slurm-test
          secret:
            secretName: slurm-test
  workerGroupSpecs: # Specifications for the worker nodes. Two groups are defined here: cpu and cpu1.
  - groupName: cpu
    replicas: 2
    template:
      metadata: {}
      spec:
        containers:
        - env:
          - name: NVIDIA_REQUIRE_CUDA
          image: registry-cn-hangzhou.ack.aliyuncs.com/acs/slurm-cuda:23.06-aliyun-cuda-11.4
          imagePullPolicy: Always
          name: slurmd
          resources:
            requests:
              cpu: "1"
              memory: 1Gi
          volumeMounts:
          - mountPath: /var/slurm
            name: config-slurm-test
          - mountPath: /var/munge
            name: secret-slurm-test
        volumes:
        - configMap:
            name: slurm-test
          name: config-slurm-test
        - name: secret-slurm-test
          secret:
            secretName: slurm-test
  - groupName: cpu1 # The second worker node group. It is similar to the first one, but you can adjust resources or configurations as needed.
    replicas: 2
    template:
      metadata: {}
      spec:
        containers:
        - env:
          - name: NVIDIA_REQUIRE_CUDA
          image: registry-cn-hangzhou.ack.aliyuncs.com/acs/slurm-cuda:23.06-aliyun-cuda-11.4
          imagePullPolicy: Always
          name: slurmd
          resources:
            requests:
              cpu: "1"
              memory: 1Gi
          securityContext: # The security context is configured to allow the container to run in privileged mode.
            privileged: true
          volumeMounts:
          - mountPath: /var/slurm
            name: config-slurm-test
          - mountPath: /var/munge
            name: secret-slurm-test
        volumes:
        - configMap:
            name: slurm-test
          name: config-slurm-test
        - name: secret-slurm-test
          secret:
            secretName: slurm-test

The preceding SlurmCluster CR creates a Slurm-managed cluster with one head node and four worker nodes. These nodes run as pods in the ACK cluster. Note that the mungeConfPath and slurmConfPath specified in the SlurmCluster CR must match the mount paths defined in the slurmctld and workerGroupSpecs templates.

Run the following command to deploy slurmcluster.yaml to the cluster:

kubectl apply -f slurmcluster.yaml

Expected output:

slurmcluster.kai.alibabacloud.com/slurm-job-demo created

Run the following command to check the status of the SlurmCluster.
```
kubectl get slurmcluster
```
Expected output:
```
NAME             AVAILABLE WORKERS   STATUS   AGE
slurm-job-demo   5                   ready    14m
```
The output shows that the Slurm-managed cluster is deployed and all 5 nodes are in the Ready state.

Run the following command to verify that the pods for the slurm-job-demo Slurm-managed cluster are running.

kubectl get pod

Expected output:

NAME                                          READY   STATUS      RESTARTS     AGE
slurm-job-demo-head-x9sgs                     1/1     Running     0            14m
slurm-job-demo-worker-cpu-0                   1/1     Running     0            14m
slurm-job-demo-worker-cpu-1                   1/1     Running     0            14m
slurm-job-demo-worker-cpu1-0                  1/1     Running     0            14m
slurm-job-demo-worker-cpu1-1                  1/1     Running     0            14m

The output shows that the one head node and four worker nodes in the Slurm-managed cluster are running properly.

Create using Helm

You can use the Helm package manager to quickly deploy a Slurm-managed cluster. The SlurmCluster chart, provided by Alibaba Cloud, simplifies installation, management, and configuration. After you download and configure the chart from the Alibaba Cloud chart repository, Helm creates the required resources for you, such as RBAC permissions, a ConfigMap, a Secret, and the SlurmCluster CR.

The Helm chart includes the following resources:

Resource type	Resource name	Description
ConfigMap	{{ .Values.slurmConfigs.configMapName }}	This ConfigMap, created when `.Values.slurmConfigs.createConfigsByConfigMap` is `true`, stores the Slurm configuration file. The file is mounted to the path specified by `.Values.slurmConfigs.slurmConfigPathInPod`. This path value is then used in the `.Spec.SlurmConfPath` field of the SlurmCluster CR and in the pod's startup command. On startup, the pod copies the file to the `/etc/slurm/` directory and sets the correct permissions.
ServiceAccount	{{ .Release.Namespace }}/{{ .Values.clusterName }}	Allows the `slurmctld` pod to modify the SlurmCluster CR. This enables auto scaling with the Cloud Node feature.
Role	{{ .Release.Namespace }}/{{ .Values.clusterName }}	Allows the `slurmctld` pod to modify the SlurmCluster CR. This enables auto scaling with the Cloud Node feature.
RoleBinding	{{ .Release.Namespace }}/{{ .Values.clusterName }}	Allows the `slurmctld` pod to modify the SlurmCluster CR. This enables auto scaling with the Cloud Node feature.
Role	{{ .Values.slurmOperatorNamespace }}/{{ .Values.clusterName }}	Allows the `slurmctld` pod to modify Secrets in the SlurmOperator namespace. This is required for token updates in a mixed deployment scenario.
RoleBinding	{{ .Values.slurmOperatorNamespace }}/{{ .Values.clusterName }}	Allows the `slurmctld` pod to modify Secrets in the SlurmOperator namespace. This is required for token updates in a mixed deployment scenario.
Secret	{{ .Values.mungeConfigs.secretName }}	This Secret authenticates communications between Slurm components. If `.Values.mungeConfigs.createConfigsBySecret` is `true`, Helm creates this Secret with the content `"munge.key"={{ .Values.mungeConfigs.content }}`. Its path is rendered to the `.Spec.MungeConfPath` field and used as a volume mount path in the pod. The pod's startup command then uses this path to initialize the `/etc/munge/munge.key` file.
SlurmCluster		The rendered SlurmCluster CR.

The following table describes the relevant parameters.

Parameter	Sample value	Description
clusterName	""	The name of the cluster. It is used to generate resources such as Secrets and Roles. The name must match `ClusterName` in the Slurm configuration file.
headNodeConfig	None	Required. Defines the pod configuration for `slurmctld`.
workerNodesConfig	None	Defines the pod configuration for `slurmd`.
workerNodesConfig.deleteSelfBeforeSuspend	true	When set to `true`, a preStop hook is added to the worker pod. This hook automatically drains the node and marks it as down before suspension.
slurmdbdConfigs	None	Defines the pod configuration for `slurmdbd`. If this parameter is omitted, the `slurmdbd` pod is not created.
slurmrestdConfigs	None	Defines the pod configuration for `slurmrestd`. If this parameter is omitted, the `slurmrestd` pod is not created.
headNodeConfig.hostNetwork slurmdbdConfigs.hostNetwork slurmrestdConfigs.hostNetwork workerNodesConfig.workerGroups[].hostNetwork	false	Sets the `hostNetwork` field for the corresponding pod(s).
headNodeConfig.setHostnameAsFQDN slurmdbdConfigs.setHostnameAsFQDN slurmrestdConfigs.setHostnameAsFQDN workerNodesConfig.workerGroups[].setHostnameAsFQDN	false	Sets the `setHostnameAsFQDN` field for the corresponding pod(s).
headNodeConfig.nodeSelector slurmdbdConfigs.nodeSelector slurmrestdConfigs.nodeSelector workerNodesConfig.workerGroups[].nodeSelector	`nodeSelector: example: example`	Sets the `nodeSelector` field for the corresponding pod(s).
headNodeConfig.tolerations slurmdbdConfigs.tolerations slurmrestdConfigs.tolerations workerNodesConfig.workerGroups[].tolerations	`tolerations: - key: value: operator:`	Sets the `tolerations` field for the corresponding pod(s).
headNodeConfig.affinity slurmdbdConfigs.affinity slurmrestdConfigs.affinity workerNodesConfig.workerGroups[].affinity	`affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: topology.kubernetes.io/zone operator: In values: - zone-a preferredDuringSchedulingIgnoredDuringExecution: - weight: 1 preference: matchExpressions: - key: another-node-label-key operator: In values: - another-node-label-value`	Sets the `affinity` field for the corresponding pod(s).
headNodeConfig.resources slurmdbdConfigs.resources slurmrestdConfigs.resources workerNodesConfig.workerGroups[].resources	`resources: requests: cpu: 1 limits: cpu: 1`	Specifies the resources for the main container. The resource limits of the main container in a worker pod determine the resource capacity of the corresponding Slurm node.
headNodeConfig.image slurmdbdConfigs.image slurmrestdConfigs.image workerNodesConfig.workerGroups[].image	"registry-cn-hangzhou.ack.aliyuncs.com/acs/slurm:23.06-1.6-aliyun-49259f59"	Sets the container image for the main container in the corresponding pod(s). To use a custom image, see ai-models-on-ack/framework/slurm/building-slurm-image at main · AliyunContainerService/ai-models-on-ack (github.com).
headNodeConfig.imagePullSecrets slurmdbdConfigs.imagePullSecrets slurmrestdConfigs.imagePullSecrets workerNodesConfig.workerGroups[].imagePullSecrets	`imagePullSecrets: - name: example`	Sets the image pull secret for the corresponding pod(s).
headNodeConfig.podSecurityContext slurmdbdConfigs.podSecurityContext slurmrestdConfigs.podSecurityContext workerNodesConfig.workerGroups[].podSecurityContext	`podSecurityContext: runAsUser: 1000 runAsGroup: 3000 fsGroup: 2000 supplementalGroups: [4000]`	Sets the `podSecurityContext` for the corresponding pod(s).
headNodeConfig.securityContext slurmdbdConfigs.securityContext slurmrestdConfigs.securityContext workerNodesConfig.workerGroups[].securityContext	`securityContext: allowPrivilegeEscalation: false`	Sets the security context for the main container of the corresponding pod(s).
headNodeConfig.volumeMounts slurmdbdConfigs.volumeMounts slurmrestdConfigs.volumeMounts workerNodesConfig.workerGroups[].volumeMounts	None	Sets the volume mounts for the main container of the corresponding pod(s).
headNodeConfig.volumes slurmdbdConfigs.volumes slurmrestdConfigs.volumes workerNodesConfig.workerGroups[].volumes	None	Sets the volumes for the corresponding pod(s).
slurmConfigs.slurmConfigPathInPod	""	The mount path for Slurm configurations within the pod. When Slurm configuration files are mounted into the pod by using a volume, you must use this parameter to declare the location of `slurm.conf`. The pod's startup command copies files from this path to `/etc/slurm/` and sets the required permissions.
slurmConfigs.createConfigsByConfigMap	true	Specifies whether to automatically create a ConfigMap to store Slurm configuration files.
slurmConfigs.configMapName	""	The name of the ConfigMap that stores the Slurm configuration files.
slurmConfigs.filesInConfigMap	""	The content of the configuration files when the ConfigMap is automatically created.
mungeConfigs.mungeConfigPathInPod	None	The mount path for MUNGE configurations within the pod. When the MUNGE configuration file is mounted into the pod by using a volume, you must use this parameter to declare the location of `munge.key`. The pod's startup command copies the file from this path to `/etc/munge/` and sets the required permissions.
mungeConfigs.createConfigsBySecret	None	Specifies whether to automatically create a Secret to store the MUNGE configuration file.
mungeConfigs.secretName	None	The name of the Secret when it is automatically created.
mungeConfigs.content	None	The content of the MUNGE configuration file when the Secret is automatically created.

For more information about the content of slurmConfigs.filesInConfigMap, see Slurm System Configuration Tool (schedmd.com).

Important

If you modify slurmConfigs.filesInConfigMap after the pods have started, you must recreate the pods for the changes to take effect. Therefore, confirm the file content before installation.

Follow these steps to install the chart:

Run the following command to add the Alibaba Cloud chart repository to your local Helm client.
```
helm repo add aliyun https://aliacs-app-catalog.oss-cn-hangzhou.aliyuncs.com/charts-incubator/
```
This command allows you to access various charts provided by Alibaba Cloud, including the Slurm chart.
Run the following command to pull and untar the Helm chart.
```
helm pull aliyun/ack-slurm-cluster --untar=true
```
This operation creates a directory named ack-slurm-cluster in the current directory. This directory contains all the files and templates of the Chart.
Run the following commands to modify the chart parameters in the values.yaml file.

The values.yaml file contains the default configuration for the chart. You can edit this file to modify parameters such as the Slurm configuration, resource requests and limits, and storage options.
```
cd ack-slurm-cluster
vi values.yaml
```

Run the following commands to install the chart.

cd ..
helm install my-slurm-cluster ack-slurm-cluster # You can replace my-slurm-cluster with a custom release name.

This command deploys the Slurm-managed cluster.

Verify the deployment

After the deployment is complete, you can use the Kubernetes command-line tool kubectl to check the deployment status and ensure that the Slurm Cluster has started successfully and is running correctly.
```
kubectl get pods -l app.kubernetes.io/name=slurm-cluster
```

Step 3: Log on to the Slurm cluster

For Kubernetes cluster administrators

Kubernetes cluster administrators have full operational permissions on the cluster. Because a Slurm-managed cluster runs as pods within the Kubernetes cluster, an administrator can use the kubectl command-line tool to log on to any of its pods. This access automatically grants them root permissions within the Slurm-managed cluster.

Run the following command to log on to any pod of the Slurm-managed cluster.

# Replace slurm-job-demo-xxxxx with the name of a specific pod in your cluster.
kubectl exec -it slurm-job-demo-xxxxx -- bash

For regular Slurm cluster users

Administrators or regular users of a Slurm-managed cluster might not have the permissions to run the kubectl exec command. In this case, you must log on to the Slurm-managed cluster by using SSH.

Using a Service's external IP address to log on to a head pod provides a persistent, scalable solution for long-term, stable access. This method uses a load balancer, allowing you to access the Slurm-managed cluster from any location within your internal network.
Using port forwarding is a temporary solution for short-term operations or debugging because it requires the kubectl port-forward command to run continuously.

Use an external IP

Create a Service of the LoadBalancer type to forward traffic and expose internal services. For more information, see Use an existing Server Load Balancer instance to expose an application or Expose an application by using an automatically created LoadBalancer Service.
- The Service must use an internal-facing Classic Load Balancer (CLB) instance.
- You must add the kai.alibabacloud.com/slurm-cluster: ack-slurm-cluster-1 and kai.alibabacloud.com/slurm-node-type: head labels to the Service to ensure it routes incoming requests to the correct pod.
Run the following command to obtain the external IP address of the LoadBalancer type Service.
```
kubectl get svc
```

Run the following command to log on to the corresponding head pod by using SSH.

# Replace $YOURUSER with the username in the pod and $EXTERNAL_IP with the external IP address obtained from the Service.
ssh $YOURUSER@$EXTERNAL_IP

Use port forwarding

Warning

To use the port-forward method, you must save the KubeConfig file of the Kubernetes cluster to your local machine. This poses a security risk. Do not use this method in a production environment.

Run the following command on your local machine to start port forwarding. This command maps the local port $LOCALPORT to port 22 (the default SSH port) of the slurmctld pod in the cluster.
```
# Replace $NAMESPACE, $CLUSTERNAME, and $LOCALPORT with their actual values.
kubectl port-forward -n $NAMESPACE svc/$CLUSTERNAME $LOCALPORT:22
```
While the port-forward command is running, any user on the local machine can run the following command to log on to the cluster and submit jobs.
```
# $YOURUSER is the username to use when logging on to the pod.
ssh -p $LOCALPORT $YOURUSER@localhost
```

Step 4: Use SlurmCluster

This section describes how to configure user synchronization, shared logging, and auto scaling for your SlurmCluster.

User synchronization across nodes

Slurm does not provide a built-in service for centralized user authentication. When you submit a job to a SlurmCluster by using the sbatch command, the job may fail if the user account does not exist on the target node. To resolve this, you can configure Lightweight Directory Access Protocol (LDAP) as a centralized authentication backend for your SlurmCluster. This allows Slurm to verify user identities through the LDAP service. Perform the following steps:

Create a file named ldap.yaml with the following content. This configuration deploys a basic LDAP service instance for storing and managing user information.

The ldap.yaml file defines a Pod to run the LDAP service and a Service to expose it on the network.

LDAP backend Pod and Service

---
apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: default
  name: ldap
  labels:
    app: ldap
spec:
  selector:
    matchLabels:
      app: ldap
  revisionHistoryLimit: 10
  template:
    metadata:
      labels:
        app: ldap
    spec:
      securityContext:
        seLinuxOptions: {}
      imagePullSecrets: []
      restartPolicy: Always
      initContainers: []
      containers:
        - image: 'osixia/openldap:1.4.0'
          imagePullPolicy: IfNotPresent
          name: ldap
          volumeMounts:
            - name: openldap-data
              mountPath: /var/lib/ldap
              subPath: data
            - name: openldap-data
              mountPath: /etc/ldap/slapd.d
              subPath: config
            - name: openldap-data
              mountPath: /container/service/slapd/assets/certs
              subPath: certs
            - name: secret-volume
              mountPath: /container/environment/01-custom
            - name: container-run
              mountPath: /container/run
          args:
            - '--copy-service'
          resources:
            limits:
            requests:
          env: []
          readinessProbe:
            tcpSocket:
              port: openldap
            initialDelaySeconds: 20
            timeoutSeconds: 1
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 10
          livenessProbe:
            tcpSocket:
              port: openldap
            initialDelaySeconds: 20
            timeoutSeconds: 1
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 10
          lifecycle: {}
          ports:
            - name: openldap
              containerPort: 389
              protocol: TCP
            - name: ssl-ldap-port
              containerPort: 636
              protocol: TCP
      volumes:
        - name: openldap-data
          emptyDir: {}
        - name: secret-volume
          secret:
            secretName: ldap-secret
            defaultMode: 420
            items: []
        - name: container-run
          emptyDir: {}
      dnsPolicy: ClusterFirst
      dnsConfig: {}
      terminationGracePeriodSeconds: 30
  progressDeadlineSeconds: 600
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 25%
      maxSurge: 25%
  replicas: 1
---
apiVersion: v1
kind: Service
metadata:
  annotations: {}
  labels:
    app: ldap
  name: ldap-service
  namespace: default
spec:
  ports:
    - name: openldap
      port: 389
      protocol: TCP
      targetPort: openldap
    - name: ssl-ldap-port
      port: 636
      protocol: TCP
      targetPort: ssl-ldap-port
  selector:
    app: ldap
  sessionAffinity: None
  type: ClusterIP
---
metadata:
  name: ldap-secret
  namespace: default
  annotations: {}
data:
  env.startup.yaml: >-
    IyBUaGlzIGlzIHRoZSBkZWZhdWx0IGltYWdlIHN0YXJ0dXAgY29uZmlndXJhdGlvbiBmaWxlCiMgdGhpcyBmaWxlIGRlZmluZSBlbnZpcm9ubWVudCB2YXJpYWJsZXMgdXNlZCBkdXJpbmcgdGhlIGNvbnRhaW5lciAqKmZpcnN0IHN0YXJ0KiogaW4gKipzdGFydHVwIGZpbGVzKiouCgojIFRoaXMgZmlsZSBpcyBkZWxldGVkIHJpZ2h0IGFmdGVyIHN0YXJ0dXAgZmlsZXMgYXJlIHByb2Nlc3NlZCBmb3IgdGhlIGZpcnN0IHRpbWUsCiMgYWZ0ZXIgdGhhdCBhbGwgdGhlc2UgdmFsdWVzIHdpbGwgbm90IGJlIGF2YWlsYWJsZSBpbiB0aGUgY29udGFpbmVyIGVudmlyb25tZW50LgojIFRoaXMgaGVscHMgdG8ga2VlcCB5b3VyIGNvbnRhaW5lciBjb25maWd1cmF0aW9uIHNlY3JldC4KIyBtb3JlIGluZm9ybWF0aW9uIDogaHR0cHM6Ly9naXRodWIuY29tL29zaXhpYS9kb2NrZXItbGlnaHQtYmFzZWltYWdlCgojIFJlcXVpcmVkIGFuZCB1c2VkIGZvciBuZXcgbGRhcCBzZXJ2ZXIgb25seQpMREFQX09SR0FOSVNBVElPTjogRXhhbXBsZSBJbmMuCkxEQVBfRE9NQUlOOiBleGFtcGxlLm9yZwpMREFQX0JBU0VfRE46ICNpZiBlbXB0eSBhdXRvbWF0aWNhbGx5IHNldCBmcm9tIExEQVBfRE9NQUlOCgpMREFQX0FETUlOX1BBU1NXT1JEOiBhZG1pbgpMREFQX0NPTkZJR19QQVNTV09SRDogY29uZmlnCgpMREFQX1JFQURPTkxZX1VTRVI6IGZhbHNlCkxEQVBfUkVBRE9OTFlfVVNFUl9VU0VSTkFNRTogcmVhZG9ubHkKTERBUF9SRUFET05MWV9VU0VSX1BBU1NXT1JEOiByZWFkb25seQoKIyBCYWNrZW5kCkxEQVBfQkFDS0VORDogaGRiCgojIFRscwpMREFQX1RMUzogdHJ1ZQpMREFQX1RMU19DUlRfRklMRU5BTUU6IGxkYXAuY3J0CkxEQVBfVExTX0tFWV9GSUxFTkFNRTogbGRhcC5rZXkKTERBUF9UTFNfQ0FfQ1JUX0ZJTEVOQU1FOiBjYS5jcnQKCkxEQVBfVExTX0VORk9SQ0U6IGZhbHNlCkxEQVBfVExTX0NJUEhFUl9TVUlURTogU0VDVVJFMjU2Oi1WRVJTLVNTTDMuMApMREFQX1RMU19QUk9UT0NPTF9NSU46IDMuMQpMREFQX1RMU19WRVJJRllfQ0xJRU5UOiBkZW1hbmQKCiMgUmVwbGljYXRpb24KTERBUF9SRVBMSUNBVElPTjogZmFsc2UKIyB2YXJpYWJsZXMgJExEQVBfQkFTRV9ETiwgJExEQVBfQURNSU5fUEFTU1dPUkQsICRMREFQX0NPTkZJR19QQVNTV09SRAojIGFyZSBhdXRvbWF0aWNhbHkgcmVwbGFjZWQgYXQgcnVuIHRpbWUKCiMgaWYgeW91IHdhbnQgdG8gYWRkIHJlcGxpY2F0aW9uIHRvIGFuIGV4aXN0aW5nIGxkYXAKIyBhZGFwdCBMREFQX1JFUExJQ0FUSU9OX0NPTkZJR19TWU5DUFJPViBhbmQgTERBUF9SRVBMSUNBVElPTl9EQl9TWU5DUFJPViB0byB5b3VyIGNvbmZpZ3VyYXRpb24KIyBhdm9pZCB1c2luZyAkTERBUF9CQVNFX0ROLCAkTERBUF9BRE1JTl9QQVNTV09SRCBhbmQgJExEQVBfQ09ORklHX1BBU1NXT1JEIHZhcmlhYmxlcwpMREFQX1JFUExJQ0FUSU9OX0NPTkZJR19TWU5DUFJPVjogYmluZGRuPSJjbj1hZG1pbixjbj1jb25maWciIGJpbmRtZXRob2Q9c2ltcGxlIGNyZWRlbnRpYWxzPSRMREFQX0NPTkZJR19QQVNTV09SRCBzZWFyY2hiYXNlPSJjbj1jb25maWciIHR5cGU9cmVmcmVzaEFuZFBlcnNpc3QgcmV0cnk9IjYwICsiIHRpbWVvdXQ9MSBzdGFydHRscz1jcml0aWNhbApMREFQX1JFUExJQ0FUSU9OX0RCX1NZTkNQUk9WOiBiaW5kZG49ImNuPWFkbWluLCRMREFQX0JBU0VfRE4iIGJpbmRtZXRob2Q9c2ltcGxlIGNyZWRlbnRpYWxzPSRMREFQX0FETUlOX1BBU1NXT1JEIHNlYXJjaGJhc2U9IiRMREFQX0JBU0VfRE4iIHR5cGU9cmVmcmVzaEFuZFBlcnNpc3QgaW50ZXJ2YWw9MDA6MDA6MDA6MTAgcmV0cnk9IjYwICsiIHRpbWVvdXQ9MSBzdGFydHRscz1jcml0aWNhbApMREFQX1JFUExJQ0FUSU9OX0hPU1RTOgogIC0gbGRhcDovL2xkYXAuZXhhbXBsZS5vcmcgIyBUaGUgb3JkZXIgbXVzdCBiZSB0aGUgc2FtZSBvbiBhbGwgbGRhcCBzZXJ2ZXJzCiAgLSBsZGFwOi8vbGRhcDIuZXhhbXBsZS5vcmcKCgojIFJlbW92ZSBjb25maWcgYWZ0ZXIgc2V0dXAKTERBUF9SRU1PVkVfQ09ORklHX0FGVEVSX1NFVFVQOiB0cnVlCgojIGNmc3NsIGVudmlyb25tZW50IHZhcmlhYmxlcyBwcmVmaXgKTERBUF9DRlNTTF9QUkVGSVg6IGxkYXAgIyBjZnNzbC1oZWxwZXIgZmlyc3Qgc2VhcmNoIGNvbmZpZyBmcm9tIExEQVBfQ0ZTU0xfKiB2YXJpYWJsZXMsIGJlZm9yZSBDRlNTTF8qIHZhcmlhYmxlcy4K
  env.yaml: >-
    IyBUaGlzIGlzIHRoZSBkZWZhdWx0IGltYWdlIGNvbmZpZ3VyYXRpb24gZmlsZQojIFRoZXNlIHZhbHVlcyB3aWxsIHBlcnNpc3RzIGluIGNvbnRhaW5lciBlbnZpcm9ubWVudC4KCiPCoEFsbCBlbnZpcm9ubWVudCB2YXJpYWJsZXMgdXNlZCBhZnRlciB0aGUgY29udGFpbmVyIGZpcnN0IHN0YXJ0CiMgbXVzdCBiZSBkZWZpbmVkIGhlcmUuCiMgbW9yZSBpbmZvcm1hdGlvbiA6IGh0dHBzOi8vZ2l0aHViLmNvbS9vc2l4aWEvZG9ja2VyLWxpZ2h0LWJhc2VpbWFnZQoKIyBHZW5lcmFsIGNvbnRhaW5lciBjb25maWd1cmF0aW9uCiMgc2VlIHRhYmxlIDUuMSBpbiBodHRwOi8vd3d3Lm9wZW5sZGFwLm9yZy9kb2MvYWRtaW4yNC9zbGFwZGNvbmYyLmh0bWwgZm9yIHRoZSBhdmFpbGFibGUgbG9nIGxldmVscy4KTERBUF9MT0dfTEVWRUw6IDI1Ngo=
type: Opaque
kind: Secret
apiVersion: v1

Run the following command to deploy the LDAP backend service:

kubectl apply -f ldap.yaml

Expected output:

deployment.apps/ldap created
service/ldap-service created
secret/ldap-secret created

(Optional) To improve management efficiency, you can deploy a frontend interface. Create a file named phpldapadmin.yaml with the following content to deploy a frontend Pod and Service.

LDAP frontend Pod and Service

---
apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: default
  name: phpldapadmin
  labels:
    io.kompose.service: phpldapadmin
spec:
  selector:
    matchLabels:
      io.kompose.service: phpldapadmin
  revisionHistoryLimit: 10
  template:
    metadata:
      labels:
        io.kompose.service: phpldapadmin
    spec:
      securityContext:
        seLinuxOptions: {}
      imagePullSecrets: []
      restartPolicy: Always
      initContainers: []
      containers:
        - image: 'osixia/phpldapadmin:0.9.0'
          imagePullPolicy: Always
          name: phpldapadmin
          volumeMounts: []
          resources:
            limits:
            requests:
          env:
            - name: PHPLDAPADMIN_HTTPS
              value: 'false'
            - name: PHPLDAPADMIN_LDAP_HOSTS
              value: ldap-service
          lifecycle: {}
          ports:
            - containerPort: 80
              protocol: TCP
      volumes: []
      dnsPolicy: ClusterFirst
      dnsConfig: {}
      terminationGracePeriodSeconds: 30
  progressDeadlineSeconds: 600
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 25%
      maxSurge: 25%
  replicas: 1
---
apiVersion: v1
kind: Service
metadata:
  namespace: default
  name: phpldapadmin
  annotations:
    k8s.kuboard.cn/workload: phpldapadmin
  labels:
    io.kompose.service: phpldapadmin
spec:
  selector:
    io.kompose.service: phpldapadmin
  type: ClusterIP
  ports:
    - port: 8080
      targetPort: 80
      protocol: TCP
      name: '8080'
      nodePort: 0
  sessionAffinity: None

Run the following command to deploy the LDAP frontend service:

kubectl apply -f phpldapadmin.yaml

Log on to a Pod in the SlurmCluster as described in Step 3. Then, run the following commands to install the LDAP client package:
```
apt update
apt install libnss-ldapd
```

After the libnss-ldapd package is installed, configure the network authentication service for the SlurmCluster from within the Pod.

Run the following commands to install the Vim package for editing scripts and files:
```
apt update
apt install vim
```

Modify the following parameters in the /etc/ldap/ldap.conf file to configure the LDAP client:

...
BASE	dc=example,dc=org # Replace this with the base DN of your LDAP directory.
URI	ldap://ldap-service # Replace this with the address of your LDAP server.
...

Modify the following parameters in the /etc/nslcd.conf file to define the connection to the LDAP server:

...
uri ldap://ldap-service # Replace this with the address of your LDAP server.
base dc=example,dc=org # Set this based on your LDAP directory structure.
...
tls_cacertfile /etc/ssl/certs/ca-certificates.crt # Specifies the path to the CA certificate file used to verify the LDAP server certificate.
...

Log sharing and access

By default, job logs generated by sbatch are stored directly on the node where the job runs, which can make viewing logs inconvenient. To centralize log access, you can create a NAS file system to store all job logs. This collects logs from all nodes in a single location, simplifying management. Perform the following steps:

Create a NAS file system to store and share logs from all nodes. For more information, see Create a file system.
Log on to the ACK console and create a Persistent Volume (PV) and a Persistent Volume Claim (PVC) for the NAS file system. For more information, see Use a statically provisioned NAS volume.

Modify the SlurmCluster CR.

Add the volumeMounts and volumes parameters to headGroupSpec and each workerGroupSpec to reference the created PVC and mount it to the /home directory. The following is an example:

headGroupSpec:
...
# Add a volume mount for /home.
  volumeMounts:
  - mountPath: /home
    name: test  # The name of the volume that references the PVC.
  volumes:
# Add the PVC definition.
  - name: test  # This must match the name in volumeMounts.
    persistentVolumeClaim:
      claimName: test  # Replace this with the name of your PVC.
...
workerGroupSpecs:
  # ... Repeat the preceding volume and volumeMounts configuration for each workerGroupSpec.

Run the following command to apply the changes to the SlurmCluster CR:

Important
If the SlurmCluster CR resource fails to deploy, run the kubectl delete slurmcluster slurm-job-demo command to delete the CR resource, and then deploy it again.
```
kubectl  apply -f slurmcluster.yaml
```
After the deployment, all worker nodes share the same file system.

Auto scaling

The root directory of the default Slurm image includes executable files and scripts such as slurm-resume.sh, slurm-suspend.sh, and slurmctld-copilot. These components interact with slurmctld to manage cluster scaling.

Slurm auto scaling with cloud nodes

local node: A physical compute node that is directly connected to the cluster manager.
cloud node: A logical node that represents a VM instance that can be created and terminated on demand by a cloud provider.

Auto scaling in Slurm on ACK

Procedure

Configure permissions for auto scaling. If you installed the cluster by using Helm, these permissions are created automatically for slurmctld, and you can skip this step.
Auto scaling requires the head Pod to have permissions to access and update the SlurmCluster CR. Use RBAC to grant the head Pod the necessary permissions.

First, you need to create the ServiceAccount, Role, and RoleBinding required by slurmctld. Assume that the Name of your SlurmCluster is slurm-job-demo and the Namespace is default. Save the following content to a file named rbac.yaml:
```
apiVersion: v1
kind: ServiceAccount
metadata:
  name: slurm-job-demo
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: slurm-job-demo
rules:
- apiGroups: ["kai.alibabacloud.com"]
  resources: ["slurmclusters"]
  verbs: ["get", "watch", "list", "update", "patch"]
  resourceNames: ["slurm-job-demo"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: slurm-job-demo
subjects:
- kind: ServiceAccount
  name: slurm-job-demo
roleRef:
  kind: Role
  name: slurm-job-demo
  apiGroup: rbac.authorization.k8s.io
```
After you save the file, run kubectl apply -f rbac.yaml to apply the manifest.

Second, assign these permissions to the slurmctld Pod. Run kubectl edit slurmcluster slurm-job-demo to edit the SlurmCluster, and set .spec.slurmctld.template.spec.serviceAccountName to the ServiceAccount that you just created.
```
apiVersion: kai.alibabacloud.com/v1
kind: SlurmCluster
...
spec:
  slurmctld:
    template:
      spec:
        serviceAccountName: slurm-job-demo
...
```
Then, recreate the StatefulSet that manages slurmctld to apply the preceding changes. You can view the StatefulSet that currently manages the slurmctld pod by running kubectl get sts slurm-job-demo and delete it by running kubectl delete sts slurm-job-demo. The slurmoperator will then recreate the StatefulSet and apply the new configuration.

Configure auto scaling in the /etc/slurm/slurm.conf file.

Shared file system

# The following settings are required when using cloud nodes. 
# SuspendProgram and ResumeProgram are custom-developed features.
SuspendTimeout=600
ResumeTimeout=600
# The interval after which an idle node is automatically suspended. 
SuspendTime=600
# The number of nodes that can be scaled out or in per minute. 
ResumeRate=1
SuspendRate=1
# The NodeName format must be ${cluster_name}-worker-${group_name}-. You must declare the node's resources in this line.
# Otherwise, slurmctld treats the node as having only 1 CPU core.
# To avoid resource waste, ensure that the resources declared here match the resources specified in the workerGroup.
NodeName=slurm-job-demo-worker-cpu-[0-10] Feature=cloud State=CLOUD
# The following settings are fixed and should not be changed.
CommunicationParameters=NoAddrCache
ReconfigFlags=KeepPowerSaveSettings
SuspendProgram="/slurm-suspend.sh"
ResumeProgram="/slurm-resume.sh"

ConfigMap

If slurm.conf is stored in the slurm-config ConfigMap, you can run kubectl edit slurm-config to add the following configuration:

slurm.conf:
...
  # The following settings are required when using cloud nodes. 
  # SuspendProgram and ResumeProgram are custom-developed features.
  SuspendTimeout=600
  ResumeTimeout=600
  # The interval after which an idle node is automatically suspended. 
  SuspendTime=600
  # The number of nodes that can be scaled out or in per minute. 
  ResumeRate=1
  SuspendRate=1
  # The NodeName format must be ${cluster_name}-worker-${group_name}-. You must declare the node's resources in this line.
  # Otherwise, slurmctld treats the node as having only 1 CPU core.
  # To avoid resource waste, ensure that the resources declared here match the resources specified in the workerGroup.
  NodeName=slurm-job-demo-worker-cpu-[0-10] Feature=cloud State=CLOUD
  # The following settings are fixed and should not be changed.
  CommunicationParameters=NoAddrCache
  ReconfigFlags=KeepPowerSaveSettings
  SuspendProgram="/slurm-suspend.sh"
  ResumeProgram="/slurm-resume.sh"

Helm

Modify the values.yaml file and add the following configuration:

slurm.conf:
...
  # The following settings are required when using cloud nodes. 
  # SuspendProgram and ResumeProgram are custom-developed features.
  SuspendTimeout=600
  ResumeTimeout=600
  # The interval after which an idle node is automatically suspended. 
  SuspendTime=600
  # The number of nodes that can be scaled out or in per minute. 
  ResumeRate=1
  SuspendRate=1
  # The NodeName format must be ${cluster_name}-worker-${group_name}-. You must declare the node's resources in this line.
  # Otherwise, slurmctld treats the node as having only 1 CPU core.
  # To avoid resource waste, ensure that the resources declared here match the resources specified in the workerGroup.
  NodeName=slurm-job-demo-worker-cpu-[0-10] Feature=cloud State=CLOUD
  # The following settings are fixed and should not be changed.
  CommunicationParameters=NoAddrCache
  ReconfigFlags=KeepPowerSaveSettings
  SuspendProgram="/slurm-suspend.sh"
  ResumeProgram="/slurm-resume.sh"

Run the helm upgrade command to update the current Slurm configuration.

Apply the new configuration.

Assuming your SlurmCluster is named slurm-job-demo, you can run kubectl delete sts slurm-job-demo to apply the new configuration to the slurmctld Pod.
Set the replica count for worker nodes to 0. This allows you to observe the auto scaling process from the beginning.

Manual

Assuming the submitted SlurmCluster is named slurm-job-demo, run kubectl edit slurmcluster slurm-job-demo and change workerCount in the workerGroup to 0. This sets the replica count for worker nodes to 0.

Helm

In values.yaml, set .Values.workerGroup[].workerCount to 0. Then, run helm upgrade slurm-job-demo . to update the current Helm chart and set the worker replica count to 0.
Submit an sbatch job.
1. Run the following command to create a shell script:
```
cat << EOF > cloudnodedemo.sh
```
  Enter the following content at the prompt:
```
#!/bin/bash
srun hostname
EOF
```
2. Run the following command to verify the content of the script:
```
cat cloudnodedemo.sh
```
  Expected output:
```
  #!/bin/bash
  srun hostname
```
  The script output is correct.
3. Run the following command to submit the script to the SlurmCluster.
```
sbatch cloudnodedemo.sh
```
  Expected output:
```
Submitted batch job 1
```
  The output indicates that the job was successfully submitted and assigned a job ID.

View the cluster scaling status.

Run the following command to view the SlurmCluster scaling logs.

cat /var/log/slurm-resume.log

Expected output:

 namespace: default cluster: slurm-demo
  resume called, args [slurm-demo-worker-cpu-0]
  slurm cluster metadata: default slurm-demo
  get SlurmCluster CR slurm-demo succeed
  hostlists: [slurm-demo-worker-cpu-0]
  resume node slurm-demo-worker-cpu-0
  resume worker -cpu-0
  resume node -cpu-0 end

The log output shows that the SlurmCluster automatically added a compute node to meet the job demand.

Run the following command to view the status of the Pods in the cluster.

kubectl get pod

Expected output:

NAME                                          READY   STATUS    RESTARTS        AGE
slurm-demo-head-9hn67                         1/1     Running   0               21m
slurm-demo-worker-cpu-0                       1/1     Running   0               43s

The output shows that slurm-demo-worker-cpu-0 is the new Pod in the cluster. This indicates that submitting the job triggered the cluster to scale out.

Run the following command to view the cluster node information.
```
sinfo
```
Expected output:
```
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      10  idle~ slurm-job-demo-worker-cpu-[2-10]
debug*       up   infinite      1   idle slurm-job-demo-worker-cpu-[0-1]
```
The output shows that slurm-demo-worker-cpu-0 is the newly launched node. In Cloud Code, 10 additional nodes, from 1 to 10, are available for scaling out.

Run the following command to view information about the job that just ran.

scontrol show job 1

Expected output:

JobId=1 JobName=cloudnodedemo.sh
   UserId=root(0) GroupId=root(0) MCS_label=N/A
   Priority=4294901757 Nice=0 Account=(null) QOS=(null)
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2024-05-28T11:37:36 EligibleTime=2024-05-28T11:37:36
   AccrueTime=2024-05-28T11:37:36
   StartTime=2024-05-28T11:37:36 EndTime=2024-05-28T11:37:36 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-05-28T11:37:36 Scheduler=Main
   Partition=debug AllocNode:Sid=slurm-job-demo:93
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=slurm-job-demo-worker-cpu-0
   BatchHost=slurm-job-demo-worker-cpu-0
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=1,mem=1M,node=1,billing=1
   AllocTRES=cpu=1,mem=1M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=//cloudnodedemo.sh
   WorkDir=/
   StdErr=//slurm-1.out
   StdIn=/dev/null
   StdOut=//slurm-1.out
   Power=

In the output, NodeList=slurm-demo-worker-cpu-0 indicates that the job ran on the newly added node.

After a while, run the following command to view the node scale-in information.
```
sinfo
```
Expected output:
```
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite     11  idle~ slurm-demo-worker-cpu-[0-10]
```
The output shows that the available nodes for scaling out are again nodes 0 through 10, for a total of 11 nodes. This indicates that the automatic scale-in is complete.

Slurm

Slurm on ACK

Prerequisites

Step 1: Install the ack-slurm-operator

Step 2: Create a SlurmCluster

Create manually

Create using Helm

Step 3: Log on to the Slurm cluster

For Kubernetes cluster administrators

For regular Slurm cluster users

Use an external IP

Use port forwarding

Step 4: Use SlurmCluster

User synchronization across nodes

Log sharing and access

Auto scaling

Slurm auto scaling with cloud nodes

Auto scaling in Slurm on ACK

Procedure

Shared file system

ConfigMap

Helm

Manual

Helm