Deploy Slurm-Kubernetes Colocated Scheduling on ACK - Container Service for Kubernetes

When you run Slurm High Performance Computing (HPC) jobs and Kubernetes workloads on the same ACK cluster using static allocation, each Slurm Pod permanently reserves fixed resources — even when idle. Those idle resources are unavailable to Kubernetes, which causes resource fragmentation. Changing a Slurm Pod's resource specs also requires deleting and recreating it, making workload migration difficult when resource demand fluctuates.

The ack-slurm-operator colocated scheduling solution solves this by letting Slurm jobs and Kubernetes Pods share the same physical nodes dynamically. A SlurmCopilot component running in the Kubernetes cluster coordinates resource allocation with Slurm in real time, so both schedulers use the same physical capacity without conflicting over allocations.

How it works

The following figure shows how colocated scheduling works for Slurm HPC and Kubernetes workloads.

Key components

Component	Description
SlurmOperator	Launches a containerized Slurm cluster. Worker Pods run exclusively on dedicated cluster nodes; other Slurm system components are scheduled on random nodes.
SlurmCopilot	Communicates with Slurmctld using a cluster token for resource coordination. When an AdmissionCheck is added to a GenericNode, SlurmCopilot updates the available resources for that node in Slurmctld. After successfully modifying the resources, it writes the status back to the GenericNode and notifies the ACK scheduler to complete scheduling.
Slurmctld	The central manager daemon for Slurm. Monitors cluster resources and jobs, schedules jobs, and allocates resources. Supports a backup Slurmctld for high availability.
GenericNode	A custom resource that acts as a resource ledger between Kubernetes and Slurm. Before the ACK scheduler places a Pod on a node, it adds an AdmissionCheck to the GenericNode to request resource confirmation from Slurm.
Slurmd	The Slurm node daemon. Runs on each compute node, executes jobs, and reports node and job status to Slurmctld.
Slurmdbd	The Slurm database daemon for job accounting. Optional — you can store accounting data in files instead.
Slurmrestd	The Slurm REST API daemon. Optional — you can interact with Slurm using CLI tools instead.

By default, when Slurmctld starts, a JWT token is auto-generated and written to a Kubernetes Secret via kubectl. To override this, use a custom startup script or revoke the Secret update permission, then manually update the token in the ack-slurm-jwt-token Secret in the ack-slurm-operator namespace. In the Data field, use the cluster name as the key and the Base64-encoded token (base64 --wrap=0) as the value.

Static allocation vs. colocated scheduling

The following table compares the two approaches.

Static allocation	Colocated scheduling

SlurmCopilot communicates with Slurm using the OpenAPI, so colocated scheduling also works with non-containerized Slurm clusters. See Extend colocated scheduling to non-containerized clusters for details.

Prerequisites

Before you begin, ensure that you have:

An ACK cluster running Kubernetes v1.26 or later. For setup, see Add GPU nodes to a cluster and Upgrade a cluster
Helm installed and configured

Install ack-slurm-operator

Log on to the ACK consoleACK console and click the name of your cluster to open the cluster details page.
Follow the steps shown below to install ack-slurm-operator. Leave the Application Name and Namespace fields blank. After clicking Next, a Confirm dialog box appears. Click Yes to use the default application name (ack-slurm-operator) and namespace (ack-slurm-operator).
Set the Chart Version to the latest version, set enableCopilot to true, and set watchNamespace to default (or a custom namespace). Click OK to complete installation.
(Optional) To update ack-slurm-operator later: on the Cluster Information page, click the Applications > Helm tab, find ack-slurm-operator, and click Update.

Install and configure ack-slurm-cluster

Use Helm to deploy the SlurmCluster chart from Alibaba Cloud. The chart creates all required resources — RBAC, ConfigMaps, Secrets, and the SlurmCluster resource — from a single values.yaml file.

Resources and parameters

The chart creates the following resources.

Resource type	Resource name	Purpose
ConfigMap	`{{ .Values.slurmConfigs.configMapName }}`	Stores user-defined Slurm configuration files. Created when `createConfigsByConfigMap=true`. Mounted to `.Values.slurmConfigs.slurmConfigPathInPod` and copied to `/etc/slurm/` on Pod startup.
ServiceAccount	`{{ .Release.Namespace }}/{{ .Values.clusterName }}`	Allows the Slurmctld Pod to modify the SlurmCluster resource for auto scaling with the CloudNode feature.
Role	`{{ .Release.Namespace }}/{{ .Values.clusterName }}`	Same purpose as the ServiceAccount above.
RoleBinding	`{{ .Release.Namespace }}/{{ .Values.clusterName }}`	Same purpose as the ServiceAccount above.
Role	`{{ .Values.slurmOperatorNamespace }}/{{ .Values.clusterName }}`	Allows the Slurmctld Pod to modify Secrets in the slurm-operator namespace for token updates during hybrid deployment.
RoleBinding	`{{ .Values.slurmOperatorNamespace }}/{{ .Values.clusterName }}`	Same purpose as the role above.
Secret	`{{ .Values.mungeConfigs.secretName }}`	Used for authentication between Slurm components. Created when `createConfigsBySecret=true`, with content `"munge.key"={{ .Values.mungeConfigs.content }}`.
SlurmCluster	Custom	The rendered SlurmCluster resource.

The following table describes the key parameters.

Parameter	Example value	Description
`clusterName`	—	The cluster name, used to generate resources such as Secrets and Roles. Must match `ClusterName` in the Slurm configuration files.
`headNodeConfig`	—	Required. Configurations for the Slurmctld Pod.
`workerNodesConfig`	—	Configurations for the Slurmd Pod.
`workerNodesConfig.deleteSelfBeforeSuspend`	`true`	When `true`, adds a preStop hook to the worker Pod that drains the node and marks it offline before termination.
`slurmdbdConfigs`	—	Configurations for the Slurmdbd Pod. If not specified, no Slurmdbd Pod is created.
`slurmrestdConfigs`	—	Configurations for the Slurmrestd Pod. If not specified, no Slurmrestd Pod is created.
`headNodeConfig.hostNetwork` `slurmdbdConfigs.hostNetwork` `slurmrestdConfigs.hostNetwork` `workerNodesConfig.workerGroups[].hostNetwork`	`false`	Specifies `hostNetwork` for the respective Pod.
`headNodeConfig.setHostnameAsFQDN` `slurmdbdConfigs.setHostnameAsFQDN` `slurmrestdConfigs.setHostnameAsFQDN` `workerNodesConfig.workerGroups[].setHostnameAsFQDN`	`false`	Specifies `setHostnameAsFQDN` for the respective Pod.
`headNodeConfig.nodeSelector` `slurmdbdConfigs.nodeSelector` `slurmrestdConfigs.nodeSelector` `workerNodesConfig.workerGroups[].nodeSelector`	`nodeSelector:`<br> `example: example`	Specifies `nodeSelector` for the respective Pod.
`headNodeConfig.tolerations` `slurmdbdConfigs.tolerations` `slurmrestdConfigs.tolerations` `workerNodesConfig.workerGroups[].tolerations`	`tolerations:`<br>`- key:`<br> `value:`<br> `operator:`	Specifies `tolerations` for the respective Pod.
`headNodeConfig.affinity` `slurmdbdConfigs.affinity` `slurmrestdConfigs.affinity` `workerNodesConfig.workerGroups[].affinity`	`affinity:`<br> `nodeAffinity:`<br> `requiredDuringSchedulingIgnoredDuringExecution:`<br> `nodeSelectorTerms:`<br> `- matchExpressions:`<br> `- key: topology.kubernetes.io/zone`<br> `operator: In`<br> `values:`<br> `- zone-a`	Specifies `affinity` for the respective Pod.
`headNodeConfig.resources` `slurmdbdConfigs.resources` `slurmrestdConfigs.resources` `workerNodesConfig.workerGroups[].resources`	`resources:`<br> `requests:`<br> `cpu: 1`<br> `limits:`<br> `cpu: 1`	Specifies resource requests and limits for the primary container. The resource limits of worker Pod containers define the resource upper limits for the Slurm node.
`headNodeConfig.image` `slurmdbdConfigs.image` `slurmrestdConfigs.image` `workerNodesConfig.workerGroups[].image`	`registry-cn-hangzhou.ack.aliyuncs.com/acs/slurm:23.06-1.6-aliyun-49259f59`	The container image. To build a custom image, see the Alibaba Cloud open source repository.
`headNodeConfig.imagePullSecrets` `slurmdbdConfigs.imagePullSecrets` `slurmrestdConfigs.imagePullSecrets` `workerNodesConfig.workerGroups[].imagePullSecrets`	`imagePullSecrets:`<br>`- name: example`	The image pull secret for the respective Pod.
`headNodeConfig.podSecurityContext` `slurmdbdConfigs.podSecurityContext` `slurmrestdConfigs.podSecurityContext` `workerNodesConfig.workerGroups[].podSecurityContext`	`podSecurityContext:`<br> `runAsUser: 1000`<br> `runAsGroup: 3000`<br> `fsGroup: 2000`<br> `supplementalGroups: [4000]`	Specifies `podSecurityContext` for the respective Pod.
`headNodeConfig.securityContext` `slurmdbdConfigs.securityContext` `slurmrestdConfigs.securityContext` `workerNodesConfig.workerGroups[].securityContext`	`securityContext:`<br> `allowPrivilegeEscalation: false`	Specifies `securityContext` for the primary container.
`headNodeConfig.volumeMounts` `slurmdbdConfigs.volumeMounts` `slurmrestdConfigs.volumeMounts` `workerNodesConfig.workerGroups[].volumeMounts`	—	Volume mounts for the primary container.
`headNodeConfig.volumes` `slurmdbdConfigs.volumes` `slurmrestdConfigs.volumes` `workerNodesConfig.workerGroups[].volumes`	—	Volumes for the respective Pod.
`slurmConfigs.slurmConfigPathInPod`	—	The mount path for Slurm configuration files inside the Pod. When configuration files are mounted via a volume, declare the location of `slurm.conf` here. On startup, the Pod copies files from this path to `/etc/slurm/` and sets permissions.
`slurmConfigs.createConfigsByConfigMap`	`true`	When `true`, automatically creates a ConfigMap to store Slurm configuration files.
`slurmConfigs.configMapName`	—	The name of the ConfigMap that stores Slurm configuration files.
`slurmConfigs.filesInConfigMap`	—	The content of configuration files when auto-creating a ConfigMap. For configuration reference, see the Slurm System Configuration Tool.
`mungeConfigs.mungeConfigPathInPod`	—	The mount path for MUNGE configuration files inside the Pod. Declare the location of `munge.key` here when mounting via a volume. On startup, the Pod copies files to `/etc/munge/` and sets permissions.
`mungeConfigs.createConfigsBySecret`	—	When `true`, automatically creates a Secret for MUNGE configuration files.
`mungeConfigs.secretName`	—	The name of the Secret for MUNGE configuration files.
`mungeConfigs.content`	—	The content of the MUNGE configuration file when auto-creating a Secret.

Important

Changes to slurmConfigs.filesInConfigMap after a Pod has started require you to recreate the Pod. Confirm the file content before starting the Pod.

Step 1: Pull the Helm chart

Add the Alibaba Cloud Helm repository.

helm repo add aliyun https://aliacs-app-catalog.oss-cn-hangzhou.aliyuncs.com/charts-incubator/

Pull and unpack the chart. This creates an ack-slurm-cluster directory containing all chart files and templates.
```
helm pull aliyun/ack-slurm-cluster --untar=true
```
Open values.yaml to configure the chart parameters.
```
cd ack-slurm-cluster
vi values.yaml
```

Step 2: Generate and import a JWT key

SlurmCopilot and Slurmrestd use JSON Web Token (JWT) authentication. Generate a JSON Web Key (JWK) and import it into the cluster before installing the chart.

The JWT authentication plug-in uses JWKs (RFC7517) to sign and verify tokens. The private key signs tokens; the public key is configured in the plug-in to verify them.

Generate a JWK using an online tool such as mkjwk.org or generate one yourself. The following is a valid JWK example:

{
  "kty": "RSA",
  "e": "AQAB",
  "kid": "O9fpdhrViq2zaaaBEWZITz",
  "use": "sig",
  "alg": "RS256",
  "n": "qSVxcknOm0uCq5vGsOmaorPDzHUubBmZZ4UXj-9do7w9X1uKFXAnqfto4TepSNuYU2bA_-tzSLAGBsR-BqvT6w9SjxakeiyQpVmexxnDw5WZwpWenUAcYrfSPEoNU-0hAQwFYgqZwJQMN8ptxkd0170PFauwACOx4Hfr-9FPGy8NCoIO4MfLXzJ3mJ7xqgIZp3NIOGXz-GIAbCf13ii7kSStpYqN3L_zzpvXUAos1FJ9IPXRV84tIZpFVh2lmRh0h8ImK-vI42dwlD_hOIzayL1Xno2R0T-d5AwTSdnep7g-Fwu8-sj4cCRWq3bd61Zs2QOJ8iustH0vSRMYdP5oYQ"
}

This example is in JSON format. Convert to YAML if configuring the plug-in via YAML. Only configure the public key in the JWT authentication plug-in. Store the private key securely.

The plug-in supports the following signing algorithms.

Signing algorithm	Supported `alg` values
RSASSA-PKCS1-V1_5 with SHA-2	RS256, RS384, RS512
Elliptic Curve (ECDSA) with SHA-2	ES256, ES384, ES512
HMAC using SHA-2	HS256, HS384, HS512

Important

For HS256, HS384, or HS512 keys, the key must be Base64 URL-encoded. If you get an Invalid Signature error, verify the key format matches the format used to generate the token.

Import the JWK into the cluster.

kubectl create configmap jwt --from-literal=jwt_hs256.key='<Your-JWK>'

Step 3: Configure Slurm

Edit slurmConfigs.filesInConfigMap in values.yaml to declare the Generic Resource Scheduling (GRES) configuration, database address, and authentication settings.

slurmConfigs:
  ...
  filesInConfigMap:
    gres.conf: |
      # Used by SlurmCopilot to sync Kubernetes allocated resources to Slurm.
      Name=k8scpu Flags=CountOnly
      Name=k8smemory Flags=CountOnly
    slurmdbd.conf: |
      # Log path. Must match the path used for verification.
      LogFile=/var/log/slurmdbd.log
      # JWT authentication is required when using Slurmrestd.
      AuthAltTypes=auth/jwt
      # Slurmdbd uses this key to authenticate tokens. Mount the key into the Pod as shown below.
      AuthAltParameters=jwt_key=/var/jwt/jwt_hs256.key
      AuthType=auth/munge
      SlurmUser=slurm
      # Set MySQL database account information.
      StoragePass=
      StorageHost=
      StorageType=accounting_storage/mysql
      StorageUser=root
      StoragePort=3306
    slurm.conf: |
      # Sets k8scpu and k8smemory extended resource properties when a node joins the cluster,
      # preventing the node from being set to DOWN state.
      NodeFeaturesPlugins=node_features/k8s_resources
      # Automatically adds k8scpu and k8smemory extended resources when submitting Slurm jobs.
      JobSubmitPlugins=k8s_resource_completion
      AccountingStorageHost=slurm-test-slurmdbd
      # JWT authentication is required when using Slurmrestd.
      AuthAltTypes=auth/jwt
      # Slurmctld uses this key to generate tokens. Mount the key into the Pod as shown below.
      AuthAltParameters=jwt_key=/var/jwt/jwt_hs256.key
      # Used by SlurmCopilot to sync Kubernetes allocated resources to Slurm.
      GresTypes=k8scpu,k8smemory
      # Enter ${slurmClusterName}-slurmdbd.
      AccountingStorageHost=
      AccountingStoragePort=6819
      AccountingStorageType=accounting_storage/slurmdbd
      # MySQL database settings for the JobComp plug-in.
      JobCompHost=
      JobCompLoc=/var/log/slurm/slurm_jobcomp.log
      JobCompPass=
      JobCompPort=3306
      JobCompType=jobcomp/mysql
      JobCompUser=root
      # High availability configuration.
      SlurmctldHost=

Configure the Slurmctld, Slurmdbd, and Slurmrestd Pods to mount the JWT key.

...
headNodeConfig:
  image: "registry-cn-hangzhou.ack.aliyuncs.com/acs/slurm:23.06-1.6-aliyun-49259f59"
  # Mount the JWT key into Slurm. The mount path must match the path in the configuration above.
  volumes:
  - configMap:
      defaultMode: 444
      name: jwt
    name: config-jwt
  volumeMounts:
  - mountPath: /var/jwt
    name: config-jwt
slurmdbdConfigs:
  nodeSelector: {}
  tolerations: []
  affinity: {}
  resources: {}
  image: "registry-cn-hangzhou.ack.aliyuncs.com/acs/slurm:23.06-1.6-aliyun-49259f59"
  imagePullSecrets: []
  # if .slurmConfigs.createConfigsByConfigMap is true, slurmConfPath, volume, and volumeMounts
  # are automatically set. This also applies to mungeConfigs.createConfigsBySecret.
  volumes:
  - configMap:
      defaultMode: 444
      name: jwt
    name: config-jwt
  volumeMounts:
  - mountPath: /var/jwt
    name: config-jwt

slurmrestdConfigs:
  nodeSelector: {}
  tolerations: []
  affinity: {}
  resources: {}
  image: "registry-cn-hangzhou.ack.aliyuncs.com/acs/slurm:23.06-1.6-aliyun-49259f59"
  imagePullSecrets: []
  volumes:
  - configMap:
      defaultMode: 444
      name: jwt
    name: config-jwt
  volumeMounts:
  - mountPath: /var/jwt
    name: config-jwt

Step 4: Install the chart

Install the chart. If ack-slurm-cluster is already installed, run helm upgrade instead. After an upgrade, manually delete the existing Pods and the Slurmctld StatefulSet to apply configuration changes.

cd ..
helm install my-slurm-cluster ack-slurm-cluster

Verify installation

Confirm the chart is deployed.

helm list

Expected output:

NAME                    NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
ack-slurm-cluster       default         1               2024-07-19 14:47:58.126357 +0800 CST    deployed        ack-slurm-cluster-2.0.0 2.0.0

Confirm all Pods are running.

kubectl get pod

Expected output shows one worker Pod and three control plane Pods running.

NAME                          READY   STATUS    RESTARTS   AGE
slurm-test-slurmctld-dlncz    1/1     Running   0          3h49m
slurm-test-slurmdbd-8f75r     1/1     Running   0          3h49m
slurm-test-slurmrestd-mjdzt   1/1     Running   0          3h49m
slurm-test-worker-cpu-0       1/1     Running   0          166m

Confirm Slurmdbd started correctly.

kubectl exec slurm-test-slurmdbd-8f75r cat /var/log/slurmdbd.log | head

Expected output:

kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
[2024-07-22T19:52:55.727] accounting_storage/as_mysql: _check_mysql_concat_is_sane: MySQL server version is: 8.0.34
[2024-07-22T19:52:55.737] error: Database settings not recommended values: innodb_lock_wait_timeout
[2024-07-22T19:52:56.089] slurmdbd version 23.02.7 started

Build a generic Slurm image

If you need additional dependencies in Slurm, build a custom image. The registry-cn-beijing.ack.aliyuncs.com/acs/slurm:23.06-1.6-aliyun-49259f59 image includes all packages required by the examples in this document.

Include the following plug-ins:

kubectl and node_features/k8s_resources (required)
job_submit/k8s_resource_completion (optional, required for auto-filling GRES resources)

About the `k8s_resources` plug-in

By default, Slurmctld checks a node's GRES resource usage when its Slurmd sends a _slurm_rpc_node_registration request. If Slurmctld detects any GRES change, it marks the node as INVAL, which prevents new tasks from scheduling and requires the node to rejoin the cluster. The k8s_resources plug-in prevents this: when a node's ActivateFeature is updated, the plug-in sets k8s cpu and k8s memory resources to 0 and sets the node_feature flag for both to true, allowing the node to bypass the GRES check.

The plug-in and Dockerfile source code are available in the Alibaba Cloud open source repository.

Example Dockerfile

FROM nvidia/cuda:11.4.3-cudnn8-devel-ubuntu20.04 as exporterBuilder
ENV TZ=Asia/Shanghai
ENV DEBIAN_FRONTEND=noninteractive
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
RUN apt-get update && apt install -y golang git munge libhttp-parser-dev libjson-c-dev libyaml-dev libjwt-dev libgtk2.0-dev libreadline-dev libpmix-dev libmysqlclient-dev libhwloc-dev openmpi-bin openmpi-common libopenmpi-dev rpm libmunge-dev libmunge2 libpam-dev perl python3 systemd lua5.3 libnvidia-ml-dev libhdf5-dev
# Download the source code before building the image
COPY ./slurm-23.02.7.tar.bz2 ./slurm-23.02.7.tar.bz2
RUN tar -xaf slurm-23.02.7.tar.bz2
COPY ../node_features/k8s_resources ./slurm-23.02.7/src/plugins/node_features/k8s_resources
RUN sed -i '/"src\/plugins\/node_features\/Makefile") CONFIG_FILES="\$CONFIG_FILES src\/plugins\/node_features\/Makefile" ;;/ a "    src/plugins/node_features/k8s_resources/Makefile") CONFIG_FILES="\$CONFIG_FILES src/plugins/node_features/k8s_resources/Makefile" ;;' ./slurm-23.02.7/configure
RUN awk '/^ac_config_files="\$ac_config_files/ && !found { print; print "ac_config_files=\"$ac_config_files src/plugins/node_features/k8s_resources/Makefile\""; found=1; next } { print }' ./slurm-23.02.7/configure > ./slurm-23.02.7/configure.new && mv ./slurm-23.02.7/configure.new ./slurm-23.02.7/configure && chmod +x ./slurm-23.02.7/configure
RUN cat ./slurm-23.02.7/configure
RUN sed -i '/^SUBDIRS =/ s/$/ k8s_resources/' ./slurm-23.02.7/src/plugins/node_features/Makefile & \
sed -i '/^SUBDIRS =/ s/$/ k8s_resources/' ./slurm-23.02.7/src/plugins/node_features/Makefile.in & \
sed -i '/^SUBDIRS =/ s/$/ k8s_resources/' ./slurm-23.02.7/src/plugins/node_features/Makefile.am
RUN cd slurm-23.02.7 && ./configure --prefix=/usr/ --sysconfdir=/etc/slurm && make

FROM nvidia/cuda:11.4.3-cudnn8-runtime-ubuntu20.04
ENV TZ=Asia/Shanghai
ENV DEBIAN_FRONTEND=noninteractive
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone

RUN apt update
RUN apt install -y munge libhttp-parser-dev libjson-c-dev libyaml-dev libjwt-dev libgtk2.0-dev libreadline-dev libpmix-dev libmysqlclient-dev libhwloc-dev openmpi-bin openmpi-common libopenmpi-dev rpm libmunge-dev libmunge2 libpam-dev perl python3 systemd lua5.3 inotify-tools openssh-server pip libnvidia-ml-dev libhdf5-dev
COPY --from=0 /slurm-23.02.7 /slurm-23.02.7
RUN cd slurm-23.02.7 && make install && cd ../ && rm -rf /slurm-23.02.7
RUN apt remove libnvidia-ml-dev libnvidia-compute-545 -y; apt autoremove -y ; ln -s /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so
COPY ./sh ./
RUN mkdir /etc/slurm
RUN chmod +x create-users.sh munge-inisitalization.sh slurm-initialization.sh slurm-suspend.sh slurm-resume.sh slurmd slurmctld slurmdbd slurmrestd
RUN touch /var/log/slurm-resume.log /var/log/slurm-suspend.log ; chmod 777 /var/log/slurm-resume.log /var/log/slurm-suspend.log
RUN mv slurmd /etc/init.d/slurmd && mv slurmdbd /etc/init.d/slurmdbd && mv slurmctld /etc/init.d/slurmctld
RUN ./create-users.sh && ./munge-inisitalization.sh && ./slurm-initialization.sh
RUN rm ./create-users.sh ./munge-inisitalization.sh ./slurm-initialization.sh
ENV NVIDIA_VISIBLE_DEVICES=
RUN apt-get update && apt-get upgrade -y && rm -rf /var/cache/apt/

Verify colocated scheduling

Check the GenericNode resource to view resource allocation across both Slurm and Kubernetes.

kubectl get genericnode

Expected output:

NAME                    CLUSTERNAME   ALIAS                     TYPE    ALLOCATEDRESOURCES
cn-hongkong.10.1.0.19                 slurm-test-worker-cpu-0   Slurm   [{"allocated":{"cpu":"0","memory":"0"},"type":"Slurm"},{"allocated":{"cpu":"1735m","memory":"2393Mi"},"type":"Kubernetes"}]

Submit a Slurm job and scale a Kubernetes Deployment to observe how allocations are reflected on the GenericNode.

[root@iZj6c1wf3c25dbynbna3qgZ ~]# kubectl exec slurm-test-slurmctld-dlncz -- nohup srun --cpus-per-task=3 --mem=4000 --gres=k8scpu:3，k8smemory:4000 sleep inf &
[1] 4132674

[root@iZj6c1wf3c25dbynbna3qgZ ~]# kubectl scale deployment nginx-deployment-basic --replicas 2
deployment.apps/nginx-deployment-basic scaled

[root@iZj6c1wf3c25dbynbna3qgZ ~]# kubectl get genericnode
NAME                    CLUSTERNAME   ALIAS                     TYPE    ALLOCATEDRESOURCES
cn-hongkong.10.1.0.19                 slurm-test-worker-cpu-0   Slurm   [{"allocated":{"cpu":"3","memory":"4000Mi"},"type":"Slurm"}，{"allocated":{"cpu":"2735m"，"memory":"3417Mi"}，"type":"Kubernetes"}]

Submit a second Slurm job. Because all resources are now allocated, the job enters Pending (PD) state.

[root@iZj6c1wf3c25dbynbna3qgZ ~]# kubectl exec slurm-test-slurmctld-dlncz -- nohup srun --cpus-per-task=3 --mem=4000 sleep inf &
[2] 4133454

[root@iZj6c1wf3c25dbynbna3qgZ ~]# srun: job 2 queued and waiting for resources

[root@iZj6c1wf3c25dbynbna3qgZ ~]# kubectl exec slurm-test-slurmctld-dlncz -- squeue
 JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
     2     debug    sleep     root PD       0:00      1 (Resources)
     1     debug    sleep     root  R       2:34      1 slurm-test-worker-cpu-0

The srun command in step 3 omits the --gres flag because the job_resource_completion plug-in, loaded at cluster startup, automatically calculates and fills in the GRES amounts from the CPU and memory requests. If this plug-in is not enabled, add --gres=k8scpu:3,k8smemory:4000 manually.

Slurm job script examples

When submitting Slurm jobs without the auto-fill plug-in, calculate the required GRES manually.

`srun` and `sbatch` parameters

Parameter	Description
`--tres-per-task`	Specifies the Trackable Resource (TRE) required per task.
`--gres`	Specifies the Generic Resources (GRES) required for the job.
`--nodes` / `-N`	Number of nodes to allocate.
`--ntasks-per-node` / `--tasks-per-node`	Number of tasks per node.
`--cpus-per-task`	Number of vCPUs per task.
`--time` / `-t`	Maximum run time.
`--job-name` / `-J`	Job name.

Slurm resource calculation example

Calculating GRES resources

Use the following formulas for a single node:

Total vCPUs per node = (Tasks per node) x (CPUs per task)
Total memory per node = (Tasks per node) x (Memory per task)

Example: With 2 nodes, 4 tasks per node, and 2 CPUs per task: 4 x 2 = 8 vCPUs per node.

Using the auto-fill plug-in

The job_submit/k8s_resource_completion plug-in automatically populates --gres from CPU and memory requests. When using this plug-in:

Specify task count using -n or --ntasks (required)
Request GPU resources using --gpus-per-task, not --gpus or --gpus-per-socket

A code sample is available for compiling the plug-in.

Example job script

#!/bin/bash
#SBATCH --job-name=test_job                   # Job name
#SBATCH --nodes=2                             # Number of nodes required
#SBATCH --ntasks-per-node=4                   # Number of tasks per node
#SBATCH --cpus-per-task=2                     # Number of vCPUs per task
#SBATCH --time=01:00:00                       # Maximum run time
#SBATCH --output=job_output_%j.txt            # Standard output file
#SBATCH --error=job_error_%j.txt              # Standard error file

srun my_program

Alternatively, pass parameters on the command line:

sbatch --nodes=2 --ntasks-per-node=4 --cpus-per-task=2 --time=01:00:00 --job-name=test_job my_job_script.sh

Extend colocated scheduling to non-containerized clusters

SlurmCopilot communicates with Slurm via the OpenAPI, so it also supports Slurm clusters that are not running inside containers.

In a non-containerized scenario, create the following Kubernetes resources manually in addition to the JWT token described earlier.

Create a Service for each Slurm cluster. SlurmCopilot retrieves Service information from the cluster and sends OpenAPI requests to ${.metadata.name}.${.metadata.namespace}.svc.cluster.local:${.spec.ports[0].port}. The Service name must be ${slurmCluster}-slurmrestd, where ${slurmCluster} matches the value specified in the GenericNode.
```
apiVersion: v1
kind: Service
metadata:
  name: slurm-slurmrestd
  namespace: default
spec:
  ports:
  - name: slurmrestd
    port: 8080
    protocol: TCP
    targetPort: 8080
```
Create a DNS record for each Slurm cluster. Create a DNS record that resolves ${.metadata.name}.${.metadata.namespace}.svc.cluster.local:${.spec.ports[0].port} to the address of the Slurmrestd process.
Create GenericNode resources for Slurm nodes. GenericNode provides SlurmCopilot with an alias mapping to a node in the Slurm cluster. The GenericNode name must match the Kubernetes node name, .spec.alias must match the node name in Slurm, and the labels kai.alibabacloud.com/cluster-name and kai.alibabacloud.com/cluster-namespace must match the Service.
```
apiVersion: kai.alibabacloud.com/v1alpha1
kind: GenericNode
metadata:
  labels:
    kai.alibabacloud.com/cluster-name: slurm-test
    kai.alibabacloud.com/cluster-namespace: default
  name: cn-hongkong.10.1.0.19
spec:
  alias: slurm-test-worker-cpu-0
  type: Slurm
```

Summary

With colocated scheduling, you can use Slurm to schedule HPC jobs and Kubernetes to orchestrate containerized workloads on the same cluster. This solution lets you leverage the Kubernetes ecosystem and services, including Helm charts, CI/CD pipelines, and monitoring tools. You can use a unified platform to submit, schedule, and manage both HPC jobs and containerized workloads, consolidating these workloads into a single cluster to use hardware resources more efficiently.