All Products
Search
Document Center

Container Service for Kubernetes:Colocated scheduling for Slurm HPC and Kubernetes

Last Updated:Mar 27, 2026

When you run Slurm High Performance Computing (HPC) jobs and Kubernetes workloads on the same ACK cluster using static allocation, each Slurm Pod permanently reserves fixed resources — even when idle. Those idle resources are unavailable to Kubernetes, which causes resource fragmentation. Changing a Slurm Pod's resource specs also requires deleting and recreating it, making workload migration difficult when resource demand fluctuates.

The ack-slurm-operator colocated scheduling solution solves this by letting Slurm jobs and Kubernetes Pods share the same physical nodes dynamically. A SlurmCopilot component running in the Kubernetes cluster coordinates resource allocation with Slurm in real time, so both schedulers use the same physical capacity without conflicting over allocations.

How it works

The following figure shows how colocated scheduling works for Slurm HPC and Kubernetes workloads.

Architecture diagram

Key components

ComponentDescription
SlurmOperatorLaunches a containerized Slurm cluster. Worker Pods run exclusively on dedicated cluster nodes; other Slurm system components are scheduled on random nodes.
SlurmCopilotCommunicates with Slurmctld using a cluster token for resource coordination. When an AdmissionCheck is added to a GenericNode, SlurmCopilot updates the available resources for that node in Slurmctld. After successfully modifying the resources, it writes the status back to the GenericNode and notifies the ACK scheduler to complete scheduling.
SlurmctldThe central manager daemon for Slurm. Monitors cluster resources and jobs, schedules jobs, and allocates resources. Supports a backup Slurmctld for high availability.
GenericNodeA custom resource that acts as a resource ledger between Kubernetes and Slurm. Before the ACK scheduler places a Pod on a node, it adds an AdmissionCheck to the GenericNode to request resource confirmation from Slurm.
SlurmdThe Slurm node daemon. Runs on each compute node, executes jobs, and reports node and job status to Slurmctld.
SlurmdbdThe Slurm database daemon for job accounting. Optional — you can store accounting data in files instead.
SlurmrestdThe Slurm REST API daemon. Optional — you can interact with Slurm using CLI tools instead.
By default, when Slurmctld starts, a JWT token is auto-generated and written to a Kubernetes Secret via kubectl. To override this, use a custom startup script or revoke the Secret update permission, then manually update the token in the ack-slurm-jwt-token Secret in the ack-slurm-operator namespace. In the Data field, use the cluster name as the key and the Base64-encoded token (base64 --wrap=0) as the value.

Static allocation vs. colocated scheduling

The following table compares the two approaches.

Static allocationColocated scheduling
Static allocation diagramColocated scheduling diagram
SlurmCopilot communicates with Slurm using the OpenAPI, so colocated scheduling also works with non-containerized Slurm clusters. See Extend colocated scheduling to non-containerized clusters for details.

Prerequisites

Before you begin, ensure that you have:

Install ack-slurm-operator

  1. Log on to the ACK consoleACK console and click the name of your cluster to open the cluster details page.

  2. Follow the steps shown below to install ack-slurm-operator. Leave the Application Name and Namespace fields blank. After clicking Next, a Confirm dialog box appears. Click Yes to use the default application name (ack-slurm-operator) and namespace (ack-slurm-operator).

    ack-slurm-operator installation step

  3. Set the Chart Version to the latest version, set enableCopilot to true, and set watchNamespace to default (or a custom namespace). Click OK to complete installation.

    ack-slurm-operator configuration

  4. (Optional) To update ack-slurm-operator later: on the Cluster Information page, click the Applications > Helm tab, find ack-slurm-operator, and click Update.

    ack-slurm-operator update

Install and configure ack-slurm-cluster

Use Helm to deploy the SlurmCluster chart from Alibaba Cloud. The chart creates all required resources — RBAC, ConfigMaps, Secrets, and the SlurmCluster resource — from a single values.yaml file.

Resources and parameters

The chart creates the following resources.

Resource typeResource namePurpose
ConfigMap{{ .Values.slurmConfigs.configMapName }}Stores user-defined Slurm configuration files. Created when createConfigsByConfigMap=true. Mounted to .Values.slurmConfigs.slurmConfigPathInPod and copied to /etc/slurm/ on Pod startup.
ServiceAccount{{ .Release.Namespace }}/{{ .Values.clusterName }}Allows the Slurmctld Pod to modify the SlurmCluster resource for auto scaling with the CloudNode feature.
Role{{ .Release.Namespace }}/{{ .Values.clusterName }}Same purpose as the ServiceAccount above.
RoleBinding{{ .Release.Namespace }}/{{ .Values.clusterName }}Same purpose as the ServiceAccount above.
Role{{ .Values.slurmOperatorNamespace }}/{{ .Values.clusterName }}Allows the Slurmctld Pod to modify Secrets in the slurm-operator namespace for token updates during hybrid deployment.
RoleBinding{{ .Values.slurmOperatorNamespace }}/{{ .Values.clusterName }}Same purpose as the role above.
Secret{{ .Values.mungeConfigs.secretName }}Used for authentication between Slurm components. Created when createConfigsBySecret=true, with content "munge.key"={{ .Values.mungeConfigs.content }}.
SlurmClusterCustomThe rendered SlurmCluster resource.

The following table describes the key parameters.

ParameterExample valueDescription
clusterNameThe cluster name, used to generate resources such as Secrets and Roles. Must match ClusterName in the Slurm configuration files.
headNodeConfigRequired. Configurations for the Slurmctld Pod.
workerNodesConfigConfigurations for the Slurmd Pod.
workerNodesConfig.deleteSelfBeforeSuspendtrueWhen true, adds a preStop hook to the worker Pod that drains the node and marks it offline before termination.
slurmdbdConfigsConfigurations for the Slurmdbd Pod. If not specified, no Slurmdbd Pod is created.
slurmrestdConfigsConfigurations for the Slurmrestd Pod. If not specified, no Slurmrestd Pod is created.
headNodeConfig.hostNetwork slurmdbdConfigs.hostNetwork slurmrestdConfigs.hostNetwork workerNodesConfig.workerGroups[].hostNetworkfalseSpecifies hostNetwork for the respective Pod.
headNodeConfig.setHostnameAsFQDN slurmdbdConfigs.setHostnameAsFQDN slurmrestdConfigs.setHostnameAsFQDN workerNodesConfig.workerGroups[].setHostnameAsFQDNfalseSpecifies setHostnameAsFQDN for the respective Pod.
headNodeConfig.nodeSelector slurmdbdConfigs.nodeSelector slurmrestdConfigs.nodeSelector workerNodesConfig.workerGroups[].nodeSelectornodeSelector:<br> example: exampleSpecifies nodeSelector for the respective Pod.
headNodeConfig.tolerations slurmdbdConfigs.tolerations slurmrestdConfigs.tolerations workerNodesConfig.workerGroups[].tolerationstolerations:<br>- key:<br> value:<br> operator:Specifies tolerations for the respective Pod.
headNodeConfig.affinity slurmdbdConfigs.affinity slurmrestdConfigs.affinity workerNodesConfig.workerGroups[].affinityaffinity:<br> nodeAffinity:<br> requiredDuringSchedulingIgnoredDuringExecution:<br> nodeSelectorTerms:<br> - matchExpressions:<br> - key: topology.kubernetes.io/zone<br> operator: In<br> values:<br> - zone-aSpecifies affinity for the respective Pod.
headNodeConfig.resources slurmdbdConfigs.resources slurmrestdConfigs.resources workerNodesConfig.workerGroups[].resourcesresources:<br> requests:<br> cpu: 1<br> limits:<br> cpu: 1Specifies resource requests and limits for the primary container. The resource limits of worker Pod containers define the resource upper limits for the Slurm node.
headNodeConfig.image slurmdbdConfigs.image slurmrestdConfigs.image workerNodesConfig.workerGroups[].imageregistry-cn-hangzhou.ack.aliyuncs.com/acs/slurm:23.06-1.6-aliyun-49259f59The container image. To build a custom image, see the Alibaba Cloud open source repository.
headNodeConfig.imagePullSecrets slurmdbdConfigs.imagePullSecrets slurmrestdConfigs.imagePullSecrets workerNodesConfig.workerGroups[].imagePullSecretsimagePullSecrets:<br>- name: exampleThe image pull secret for the respective Pod.
headNodeConfig.podSecurityContext slurmdbdConfigs.podSecurityContext slurmrestdConfigs.podSecurityContext workerNodesConfig.workerGroups[].podSecurityContextpodSecurityContext:<br> runAsUser: 1000<br> runAsGroup: 3000<br> fsGroup: 2000<br> supplementalGroups: [4000]Specifies podSecurityContext for the respective Pod.
headNodeConfig.securityContext slurmdbdConfigs.securityContext slurmrestdConfigs.securityContext workerNodesConfig.workerGroups[].securityContextsecurityContext:<br> allowPrivilegeEscalation: falseSpecifies securityContext for the primary container.
headNodeConfig.volumeMounts slurmdbdConfigs.volumeMounts slurmrestdConfigs.volumeMounts workerNodesConfig.workerGroups[].volumeMountsVolume mounts for the primary container.
headNodeConfig.volumes slurmdbdConfigs.volumes slurmrestdConfigs.volumes workerNodesConfig.workerGroups[].volumesVolumes for the respective Pod.
slurmConfigs.slurmConfigPathInPodThe mount path for Slurm configuration files inside the Pod. When configuration files are mounted via a volume, declare the location of slurm.conf here. On startup, the Pod copies files from this path to /etc/slurm/ and sets permissions.
slurmConfigs.createConfigsByConfigMaptrueWhen true, automatically creates a ConfigMap to store Slurm configuration files.
slurmConfigs.configMapNameThe name of the ConfigMap that stores Slurm configuration files.
slurmConfigs.filesInConfigMapThe content of configuration files when auto-creating a ConfigMap. For configuration reference, see the Slurm System Configuration Tool.
mungeConfigs.mungeConfigPathInPodThe mount path for MUNGE configuration files inside the Pod. Declare the location of munge.key here when mounting via a volume. On startup, the Pod copies files to /etc/munge/ and sets permissions.
mungeConfigs.createConfigsBySecretWhen true, automatically creates a Secret for MUNGE configuration files.
mungeConfigs.secretNameThe name of the Secret for MUNGE configuration files.
mungeConfigs.contentThe content of the MUNGE configuration file when auto-creating a Secret.
Important

Changes to slurmConfigs.filesInConfigMap after a Pod has started require you to recreate the Pod. Confirm the file content before starting the Pod.

Step 1: Pull the Helm chart

  1. Add the Alibaba Cloud Helm repository.

    helm repo add aliyun https://aliacs-app-catalog.oss-cn-hangzhou.aliyuncs.com/charts-incubator/
  2. Pull and unpack the chart. This creates an ack-slurm-cluster directory containing all chart files and templates.

    helm pull aliyun/ack-slurm-cluster --untar=true
  3. Open values.yaml to configure the chart parameters.

    cd ack-slurm-cluster
    vi values.yaml

Step 2: Generate and import a JWT key

SlurmCopilot and Slurmrestd use JSON Web Token (JWT) authentication. Generate a JSON Web Key (JWK) and import it into the cluster before installing the chart.

The JWT authentication plug-in uses JWKs (RFC7517) to sign and verify tokens. The private key signs tokens; the public key is configured in the plug-in to verify them.

Generate a JWK using an online tool such as mkjwk.org or generate one yourself. The following is a valid JWK example:

{
  "kty": "RSA",
  "e": "AQAB",
  "kid": "O9fpdhrViq2zaaaBEWZITz",
  "use": "sig",
  "alg": "RS256",
  "n": "qSVxcknOm0uCq5vGsOmaorPDzHUubBmZZ4UXj-9do7w9X1uKFXAnqfto4TepSNuYU2bA_-tzSLAGBsR-BqvT6w9SjxakeiyQpVmexxnDw5WZwpWenUAcYrfSPEoNU-0hAQwFYgqZwJQMN8ptxkd0170PFauwACOx4Hfr-9FPGy8NCoIO4MfLXzJ3mJ7xqgIZp3NIOGXz-GIAbCf13ii7kSStpYqN3L_zzpvXUAos1FJ9IPXRV84tIZpFVh2lmRh0h8ImK-vI42dwlD_hOIzayL1Xno2R0T-d5AwTSdnep7g-Fwu8-sj4cCRWq3bd61Zs2QOJ8iustH0vSRMYdP5oYQ"
}
This example is in JSON format. Convert to YAML if configuring the plug-in via YAML. Only configure the public key in the JWT authentication plug-in. Store the private key securely.

The plug-in supports the following signing algorithms.

Signing algorithmSupported alg values
RSASSA-PKCS1-V1_5 with SHA-2RS256, RS384, RS512
Elliptic Curve (ECDSA) with SHA-2ES256, ES384, ES512
HMAC using SHA-2HS256, HS384, HS512
Important

For HS256, HS384, or HS512 keys, the key must be Base64 URL-encoded. If you get an Invalid Signature error, verify the key format matches the format used to generate the token.

Import the JWK into the cluster.

kubectl create configmap jwt --from-literal=jwt_hs256.key='<Your-JWK>'

Step 3: Configure Slurm

Edit slurmConfigs.filesInConfigMap in values.yaml to declare the Generic Resource Scheduling (GRES) configuration, database address, and authentication settings.

slurmConfigs:
  ...
  filesInConfigMap:
    gres.conf: |
      # Used by SlurmCopilot to sync Kubernetes allocated resources to Slurm.
      Name=k8scpu Flags=CountOnly
      Name=k8smemory Flags=CountOnly
    slurmdbd.conf: |
      # Log path. Must match the path used for verification.
      LogFile=/var/log/slurmdbd.log
      # JWT authentication is required when using Slurmrestd.
      AuthAltTypes=auth/jwt
      # Slurmdbd uses this key to authenticate tokens. Mount the key into the Pod as shown below.
      AuthAltParameters=jwt_key=/var/jwt/jwt_hs256.key
      AuthType=auth/munge
      SlurmUser=slurm
      # Set MySQL database account information.
      StoragePass=
      StorageHost=
      StorageType=accounting_storage/mysql
      StorageUser=root
      StoragePort=3306
    slurm.conf: |
      # Sets k8scpu and k8smemory extended resource properties when a node joins the cluster,
      # preventing the node from being set to DOWN state.
      NodeFeaturesPlugins=node_features/k8s_resources
      # Automatically adds k8scpu and k8smemory extended resources when submitting Slurm jobs.
      JobSubmitPlugins=k8s_resource_completion
      AccountingStorageHost=slurm-test-slurmdbd
      # JWT authentication is required when using Slurmrestd.
      AuthAltTypes=auth/jwt
      # Slurmctld uses this key to generate tokens. Mount the key into the Pod as shown below.
      AuthAltParameters=jwt_key=/var/jwt/jwt_hs256.key
      # Used by SlurmCopilot to sync Kubernetes allocated resources to Slurm.
      GresTypes=k8scpu,k8smemory
      # Enter ${slurmClusterName}-slurmdbd.
      AccountingStorageHost=
      AccountingStoragePort=6819
      AccountingStorageType=accounting_storage/slurmdbd
      # MySQL database settings for the JobComp plug-in.
      JobCompHost=
      JobCompLoc=/var/log/slurm/slurm_jobcomp.log
      JobCompPass=
      JobCompPort=3306
      JobCompType=jobcomp/mysql
      JobCompUser=root
      # High availability configuration.
      SlurmctldHost=

Configure the Slurmctld, Slurmdbd, and Slurmrestd Pods to mount the JWT key.

...
headNodeConfig:
  image: "registry-cn-hangzhou.ack.aliyuncs.com/acs/slurm:23.06-1.6-aliyun-49259f59"
  # Mount the JWT key into Slurm. The mount path must match the path in the configuration above.
  volumes:
  - configMap:
      defaultMode: 444
      name: jwt
    name: config-jwt
  volumeMounts:
  - mountPath: /var/jwt
    name: config-jwt
slurmdbdConfigs:
  nodeSelector: {}
  tolerations: []
  affinity: {}
  resources: {}
  image: "registry-cn-hangzhou.ack.aliyuncs.com/acs/slurm:23.06-1.6-aliyun-49259f59"
  imagePullSecrets: []
  # if .slurmConfigs.createConfigsByConfigMap is true, slurmConfPath, volume, and volumeMounts
  # are automatically set. This also applies to mungeConfigs.createConfigsBySecret.
  volumes:
  - configMap:
      defaultMode: 444
      name: jwt
    name: config-jwt
  volumeMounts:
  - mountPath: /var/jwt
    name: config-jwt

slurmrestdConfigs:
  nodeSelector: {}
  tolerations: []
  affinity: {}
  resources: {}
  image: "registry-cn-hangzhou.ack.aliyuncs.com/acs/slurm:23.06-1.6-aliyun-49259f59"
  imagePullSecrets: []
  volumes:
  - configMap:
      defaultMode: 444
      name: jwt
    name: config-jwt
  volumeMounts:
  - mountPath: /var/jwt
    name: config-jwt

Step 4: Install the chart

Install the chart. If ack-slurm-cluster is already installed, run helm upgrade instead. After an upgrade, manually delete the existing Pods and the Slurmctld StatefulSet to apply configuration changes.

cd ..
helm install my-slurm-cluster ack-slurm-cluster

Verify installation

  1. Confirm the chart is deployed.

    helm list

    Expected output:

    NAME                    NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
    ack-slurm-cluster       default         1               2024-07-19 14:47:58.126357 +0800 CST    deployed        ack-slurm-cluster-2.0.0 2.0.0
  2. Confirm all Pods are running.

    kubectl get pod

    Expected output shows one worker Pod and three control plane Pods running.

    NAME                          READY   STATUS    RESTARTS   AGE
    slurm-test-slurmctld-dlncz    1/1     Running   0          3h49m
    slurm-test-slurmdbd-8f75r     1/1     Running   0          3h49m
    slurm-test-slurmrestd-mjdzt   1/1     Running   0          3h49m
    slurm-test-worker-cpu-0       1/1     Running   0          166m
  3. Confirm Slurmdbd started correctly.

    kubectl exec slurm-test-slurmdbd-8f75r cat /var/log/slurmdbd.log | head

    Expected output:

    kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
    [2024-07-22T19:52:55.727] accounting_storage/as_mysql: _check_mysql_concat_is_sane: MySQL server version is: 8.0.34
    [2024-07-22T19:52:55.737] error: Database settings not recommended values: innodb_lock_wait_timeout
    [2024-07-22T19:52:56.089] slurmdbd version 23.02.7 started

Build a generic Slurm image

If you need additional dependencies in Slurm, build a custom image. The registry-cn-beijing.ack.aliyuncs.com/acs/slurm:23.06-1.6-aliyun-49259f59 image includes all packages required by the examples in this document.

Include the following plug-ins:

  • kubectl and node_features/k8s_resources (required)

  • job_submit/k8s_resource_completion (optional, required for auto-filling GRES resources)

About the `k8s_resources` plug-in

By default, Slurmctld checks a node's GRES resource usage when its Slurmd sends a _slurm_rpc_node_registration request. If Slurmctld detects any GRES change, it marks the node as INVAL, which prevents new tasks from scheduling and requires the node to rejoin the cluster. The k8s_resources plug-in prevents this: when a node's ActivateFeature is updated, the plug-in sets k8s cpu and k8s memory resources to 0 and sets the node_feature flag for both to true, allowing the node to bypass the GRES check.

The plug-in and Dockerfile source code are available in the Alibaba Cloud open source repository.

Example Dockerfile

FROM nvidia/cuda:11.4.3-cudnn8-devel-ubuntu20.04 as exporterBuilder
ENV TZ=Asia/Shanghai
ENV DEBIAN_FRONTEND=noninteractive
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
RUN apt-get update && apt install -y golang git munge libhttp-parser-dev libjson-c-dev libyaml-dev libjwt-dev libgtk2.0-dev libreadline-dev libpmix-dev libmysqlclient-dev libhwloc-dev openmpi-bin openmpi-common libopenmpi-dev rpm libmunge-dev libmunge2 libpam-dev perl python3 systemd lua5.3 libnvidia-ml-dev libhdf5-dev
# Download the source code before building the image
COPY ./slurm-23.02.7.tar.bz2 ./slurm-23.02.7.tar.bz2
RUN tar -xaf slurm-23.02.7.tar.bz2
COPY ../node_features/k8s_resources ./slurm-23.02.7/src/plugins/node_features/k8s_resources
RUN sed -i '/"src\/plugins\/node_features\/Makefile") CONFIG_FILES="\$CONFIG_FILES src\/plugins\/node_features\/Makefile" ;;/ a "    src/plugins/node_features/k8s_resources/Makefile") CONFIG_FILES="\$CONFIG_FILES src/plugins/node_features/k8s_resources/Makefile" ;;' ./slurm-23.02.7/configure
RUN awk '/^ac_config_files="\$ac_config_files/ && !found { print; print "ac_config_files=\"$ac_config_files src/plugins/node_features/k8s_resources/Makefile\""; found=1; next } { print }' ./slurm-23.02.7/configure > ./slurm-23.02.7/configure.new && mv ./slurm-23.02.7/configure.new ./slurm-23.02.7/configure && chmod +x ./slurm-23.02.7/configure
RUN cat ./slurm-23.02.7/configure
RUN sed -i '/^SUBDIRS =/ s/$/ k8s_resources/' ./slurm-23.02.7/src/plugins/node_features/Makefile & \
sed -i '/^SUBDIRS =/ s/$/ k8s_resources/' ./slurm-23.02.7/src/plugins/node_features/Makefile.in & \
sed -i '/^SUBDIRS =/ s/$/ k8s_resources/' ./slurm-23.02.7/src/plugins/node_features/Makefile.am
RUN cd slurm-23.02.7 && ./configure --prefix=/usr/ --sysconfdir=/etc/slurm && make

FROM nvidia/cuda:11.4.3-cudnn8-runtime-ubuntu20.04
ENV TZ=Asia/Shanghai
ENV DEBIAN_FRONTEND=noninteractive
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone

RUN apt update
RUN apt install -y munge libhttp-parser-dev libjson-c-dev libyaml-dev libjwt-dev libgtk2.0-dev libreadline-dev libpmix-dev libmysqlclient-dev libhwloc-dev openmpi-bin openmpi-common libopenmpi-dev rpm libmunge-dev libmunge2 libpam-dev perl python3 systemd lua5.3 inotify-tools openssh-server pip libnvidia-ml-dev libhdf5-dev
COPY --from=0 /slurm-23.02.7 /slurm-23.02.7
RUN cd slurm-23.02.7 && make install && cd ../ && rm -rf /slurm-23.02.7
RUN apt remove libnvidia-ml-dev libnvidia-compute-545 -y; apt autoremove -y ; ln -s /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so
COPY ./sh ./
RUN mkdir /etc/slurm
RUN chmod +x create-users.sh munge-inisitalization.sh slurm-initialization.sh slurm-suspend.sh slurm-resume.sh slurmd slurmctld slurmdbd slurmrestd
RUN touch /var/log/slurm-resume.log /var/log/slurm-suspend.log ; chmod 777 /var/log/slurm-resume.log /var/log/slurm-suspend.log
RUN mv slurmd /etc/init.d/slurmd && mv slurmdbd /etc/init.d/slurmdbd && mv slurmctld /etc/init.d/slurmctld
RUN ./create-users.sh && ./munge-inisitalization.sh && ./slurm-initialization.sh
RUN rm ./create-users.sh ./munge-inisitalization.sh ./slurm-initialization.sh
ENV NVIDIA_VISIBLE_DEVICES=
RUN apt-get update && apt-get upgrade -y && rm -rf /var/cache/apt/

Verify colocated scheduling

  1. Check the GenericNode resource to view resource allocation across both Slurm and Kubernetes.

    kubectl get genericnode

    Expected output:

    NAME                    CLUSTERNAME   ALIAS                     TYPE    ALLOCATEDRESOURCES
    cn-hongkong.10.1.0.19                 slurm-test-worker-cpu-0   Slurm   [{"allocated":{"cpu":"0","memory":"0"},"type":"Slurm"},{"allocated":{"cpu":"1735m","memory":"2393Mi"},"type":"Kubernetes"}]
  2. Submit a Slurm job and scale a Kubernetes Deployment to observe how allocations are reflected on the GenericNode.

    [root@iZj6c1wf3c25dbynbna3qgZ ~]# kubectl exec slurm-test-slurmctld-dlncz -- nohup srun --cpus-per-task=3 --mem=4000 --gres=k8scpu:3,k8smemory:4000 sleep inf &
    [1] 4132674
    
    [root@iZj6c1wf3c25dbynbna3qgZ ~]# kubectl scale deployment nginx-deployment-basic --replicas 2
    deployment.apps/nginx-deployment-basic scaled
    
    [root@iZj6c1wf3c25dbynbna3qgZ ~]# kubectl get genericnode
    NAME                    CLUSTERNAME   ALIAS                     TYPE    ALLOCATEDRESOURCES
    cn-hongkong.10.1.0.19                 slurm-test-worker-cpu-0   Slurm   [{"allocated":{"cpu":"3","memory":"4000Mi"},"type":"Slurm"},{"allocated":{"cpu":"2735m","memory":"3417Mi"},"type":"Kubernetes"}]
  3. Submit a second Slurm job. Because all resources are now allocated, the job enters Pending (PD) state.

    [root@iZj6c1wf3c25dbynbna3qgZ ~]# kubectl exec slurm-test-slurmctld-dlncz -- nohup srun --cpus-per-task=3 --mem=4000 sleep inf &
    [2] 4133454
    
    [root@iZj6c1wf3c25dbynbna3qgZ ~]# srun: job 2 queued and waiting for resources
    
    [root@iZj6c1wf3c25dbynbna3qgZ ~]# kubectl exec slurm-test-slurmctld-dlncz -- squeue
     JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
         2     debug    sleep     root PD       0:00      1 (Resources)
         1     debug    sleep     root  R       2:34      1 slurm-test-worker-cpu-0
The srun command in step 3 omits the --gres flag because the job_resource_completion plug-in, loaded at cluster startup, automatically calculates and fills in the GRES amounts from the CPU and memory requests. If this plug-in is not enabled, add --gres=k8scpu:3,k8smemory:4000 manually.

Slurm job script examples

When submitting Slurm jobs without the auto-fill plug-in, calculate the required GRES manually.

`srun` and `sbatch` parameters

ParameterDescription
--tres-per-taskSpecifies the Trackable Resource (TRE) required per task.
--gresSpecifies the Generic Resources (GRES) required for the job.
--nodes / -NNumber of nodes to allocate.
--ntasks-per-node / --tasks-per-nodeNumber of tasks per node.
--cpus-per-taskNumber of vCPUs per task.
--time / -tMaximum run time.
--job-name / -JJob name.

Slurm resource calculation example

Calculating GRES resources

Use the following formulas for a single node:

  • Total vCPUs per node = (Tasks per node) x (CPUs per task)

  • Total memory per node = (Tasks per node) x (Memory per task)

Example: With 2 nodes, 4 tasks per node, and 2 CPUs per task: 4 x 2 = 8 vCPUs per node.

Using the auto-fill plug-in

The job_submit/k8s_resource_completion plug-in automatically populates --gres from CPU and memory requests. When using this plug-in:

  • Specify task count using -n or --ntasks (required)

  • Request GPU resources using --gpus-per-task, not --gpus or --gpus-per-socket

A code sample is available for compiling the plug-in.

Example job script

#!/bin/bash
#SBATCH --job-name=test_job                   # Job name
#SBATCH --nodes=2                             # Number of nodes required
#SBATCH --ntasks-per-node=4                   # Number of tasks per node
#SBATCH --cpus-per-task=2                     # Number of vCPUs per task
#SBATCH --time=01:00:00                       # Maximum run time
#SBATCH --output=job_output_%j.txt            # Standard output file
#SBATCH --error=job_error_%j.txt              # Standard error file

srun my_program

Alternatively, pass parameters on the command line:

sbatch --nodes=2 --ntasks-per-node=4 --cpus-per-task=2 --time=01:00:00 --job-name=test_job my_job_script.sh

Extend colocated scheduling to non-containerized clusters

SlurmCopilot communicates with Slurm via the OpenAPI, so it also supports Slurm clusters that are not running inside containers.

In a non-containerized scenario, create the following Kubernetes resources manually in addition to the JWT token described earlier.

  1. Create a Service for each Slurm cluster. SlurmCopilot retrieves Service information from the cluster and sends OpenAPI requests to ${.metadata.name}.${.metadata.namespace}.svc.cluster.local:${.spec.ports[0].port}. The Service name must be ${slurmCluster}-slurmrestd, where ${slurmCluster} matches the value specified in the GenericNode.

    apiVersion: v1
    kind: Service
    metadata:
      name: slurm-slurmrestd
      namespace: default
    spec:
      ports:
      - name: slurmrestd
        port: 8080
        protocol: TCP
        targetPort: 8080
  2. Create a DNS record for each Slurm cluster. Create a DNS record that resolves ${.metadata.name}.${.metadata.namespace}.svc.cluster.local:${.spec.ports[0].port} to the address of the Slurmrestd process.

  3. Create GenericNode resources for Slurm nodes. GenericNode provides SlurmCopilot with an alias mapping to a node in the Slurm cluster. The GenericNode name must match the Kubernetes node name, .spec.alias must match the node name in Slurm, and the labels kai.alibabacloud.com/cluster-name and kai.alibabacloud.com/cluster-namespace must match the Service.

    apiVersion: kai.alibabacloud.com/v1alpha1
    kind: GenericNode
    metadata:
      labels:
        kai.alibabacloud.com/cluster-name: slurm-test
        kai.alibabacloud.com/cluster-namespace: default
      name: cn-hongkong.10.1.0.19
    spec:
      alias: slurm-test-worker-cpu-0
      type: Slurm

Summary

With colocated scheduling, you can use Slurm to schedule HPC jobs and Kubernetes to orchestrate containerized workloads on the same cluster. This solution lets you leverage the Kubernetes ecosystem and services, including Helm charts, CI/CD pipelines, and monitoring tools. You can use a unified platform to submit, schedule, and manage both HPC jobs and containerized workloads, consolidating these workloads into a single cluster to use hardware resources more efficiently.