Configure Slurm queue priority - Container Service for Kubernetes

Queues are essential configuration items in task scheduling for effective resource management and allocation, optimized job scheduling, improved system utilization, and meeting diverse job requirements. Proper queue configuration ensures high-priority tasks receive required resources first, maximizing resource utilization efficiency. This topic describes how to implement appropriate queue configuration strategies in a Slurm environment to process as many tasks as possible when jobs are submitted or job states change, achieving optimal performance.

1. Core Slurm features

Resource allocation: Allocates CPU/memory/GPU resources as needed, avoiding conflicts and waste.
Job scheduling: Dynamically schedules job queues, executes according to priority, and monitors task status throughout.
Priority control: High-priority queue tasks are scheduled first.
Monitoring tools: Monitors resource usage and job status through scontrol/sacct.
Customization support: Multiple queues adapt to different requirements (such as CPU-intensive/memory/GPU tasks).
System optimization: Improves resource utilization, reduces idle time, and increases computational efficiency.

Note

This topic is based on Slurm version 24.05. Other versions may have differences.

2. Slurm queue types

Slurm tasks are executed in priority order. If a partition has unschedulable tasks, subsequent tasks are paused. High-priority tasks can preempt resources from low-priority tasks. Preempted tasks can be canceled, reset, or suspended. If you enable backfill scheduling (default), the system calculates whether low-priority tasks can run without delaying high-priority tasks based on the bf_interval cycle. This requires occupying entire machines and may trigger machine-wide preemption. Scheduling configuration is specified through SchedulerType (default sched/backfill plugin) and detailed parameters SchedulerParameters in slurm.conf. For specific parameter configurations, see official documentation.

During scheduling, all tasks are integrated into a single list, and their execution order is determined through different priority algorithms. Slurm supports the following two queue types:

First in, first out (FIFO) queue, where tasks are sorted based on the order of their submission time.
MultiFactors queue, which is a more advanced task queuing mechanism that is enabled by default. It can calculate job priority based on multiple factors.

2.1 First in, first out queue

By default, Slurm uses FIFO as the basis for allocating job priorities. The configuration file for priority scheduling is stored in slurm.conf. You can configure priority by modifying the PriorityType parameter.

# 1. Find and edit the slurm.conf file
sudo nano /etc/slurm-llnl/slurm.conf

# 2. Enable preemption mode and specify a preemption strategy based on first in, first out priority
PriorityType=priority/basic

Important

You are advised to back up the original slurm.conf file before making changes, so you can restore it if problems occur. Additionally, for any major changes in a production environment, you are advised to thoroughly test them in a test environment first.

2.2 Multifactors job queue

Slurm multifactor scheduling determines task priority through weighted calculation of the following factors: job execution time, resource differences (allocated vs. consumed), job size, user parameters, data partitioning, TRES (Total Resource Equivalents) types, and Quality of Service (QoS). For weight allocation and specific calculation logic, see multifactor priority configuration instructions.

Job_priority =
	site_factor +
	(PriorityWeightAge) * (age_factor) +
	(PriorityWeightAssoc) * (assoc_factor) +
	(PriorityWeightFairshare) * (fair-share_factor) +
	(PriorityWeightJobSize) * (job_size_factor) +
	(PriorityWeightPartition) * (priority_job_factor) +
	(PriorityWeightQOS) * (QOS_factor) +
	SUM(TRES_weight_cpu * TRES_factor_cpu,
	    TRES_weight_<type> * TRES_factor_<type>,
	    ...)
	- nice_factor

Note

Slurm job priority is calculated using the following weighted factors:

Base value: site_factor (custom score).
Job waiting time weight: The longer the job waiting time, the higher the weight (PriorityWeightAge × age_factor).
Association weight: Resource usage fairness of user group/account (PriorityWeightAssoc × assoc_factor).
Fair share weight: Adjusts score based on resource usage proportion (PriorityWeightFairshare × fair-share_factor).
Job size weight: Small/large job priority (PriorityWeightJobSize × job_size_factor).
Partition weight: Partition priority (PriorityWeightPartition × priority_job_factor).
QoS weight: Quality of service level (PriorityWeightQOS × QOS_factor).
Resource weight: Resource type (CPU/GPU, etc.) weighted.
Nice downgrade: - nice_factor (higher value means lower priority).

You can achieve fair and efficient task scheduling by dynamically adjusting weight parameters.

Typical application examples.

Quickly complete small jobs:
Set PriorityWeightJobSize=-1, lowering the priority of large jobs so small jobs are scheduled faster.
Guarantee critical users/groups:
Ensure jobs from important teams run first through PriorityWeightAssoc and Fair-share_factor.
Resource starvation protection:
Configure PriorityWeightFairshare=2000, significantly increasing the priority of jobs from users with low resource usage.

Examples: Setting up multifactor job priority

Customizing partition priority

Slurm can divide machines by organizational structure, limiting tasks to run only in their assigned resource pools. Tasks are classified as urgent (preempting low-priority tasks) or non-urgent (executing quickly but not blocking urgent tasks). When tasks approach deadlines and need to be marked as urgent, Slurm cannot automatically adjust this. Manual migration to high-priority queues is required.

You can create two partitions pointing to the same node pool (distinguishing urgent/non-urgent tasks) and dynamically adjust priorities by switching the partition a task belongs to. This improves resource utilization and simplifies operations. It supports flexible scheduling of dynamic workloads while reducing management complexity. This not only helps the system better adapt to dynamically changing workload requirements but also simplifies the management work of operations personnel in complex job environments. You can refer to the following steps for setup.

First, enable the preemption function switch in the cluster and set the preemption type to preempt/partition_prio.

# 1. Find and edit the slurm.conf file
sudo nano /etc/slurm-llnl/slurm.conf

# 2. Enable preemption mode and specify a preemption strategy based on partition priority
PreemptMode=preempt/partition_prio
   
# 3. Behavior when a job is preempted, defining what happens when a job is preempted.
# cancel means terminating the job; suspend will pause it until resources are available again. Choose based on your requirements.
PreemptType=suspend  # or "cancel"

Important

Click to view detailed parameter descriptions

Parameter	Recommended value	Function
SelectType	select/cons_tres	Defines the resource allocation strategy, specifying how to allocate tasks to node resources. Note Workers created by Slurm cluster use the dynamic node feature, so only select/cons_tres type is supported.
SelectTypeParameters	CR_Core	Specific parameters passed to the `SelectType` plugin, controlling resource allocation details.
SchedulerType	sched/backfill	Specifies the scheduling algorithm type, determining how tasks are arranged to nodes.
PriorityType	priority/multifactor	Defines task priority calculation rules, determining task scheduling order.
PreemptMode	SUSPEND,GANG	Conditions for enabling the preemption strategy, determining under what circumstances running tasks can be preempted. Note When preemption is enabled, only SUSPEND and GANG are allowed under the select/cons_tres type select plugin.
PreemptType	preempt/partition_prio	Selects the preemption mechanism type, specifying exactly how to execute preemption behavior. Note Currently supports preempt/qos and preempt/partition_prio. In this example, partition is used as the basis for preemption.

You can add a high-priority partition in a cluster by using the following command, or by adding a new partition record in slurm.conf. scontrol create partition=hipri PriorityTier=2 nodes=ALL After that, you can achieve job preemption by submitting jobs to the hipri partition or changing jobs to the high-priority partition. The following is an example of job submission.

# 1. Add a high-priority partition in the Slurm cluster.
root@slurm-test-0:/# scontrol create partition=hipri PriorityTier=2 nodes=ALL

# 2. View current cluster partitions.
root@slurm-test-0:/# scontrol show partition
# Result.
PartitionName=debug
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=YES QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
   Nodes=slurm-test-worker-cpu-0
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1
   OverTimeLimit=NONE PreemptMode=GANG,SUSPEND
   State=UP TotalCPUs=4 TotalNodes=1 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
   TRES=cpu=4,mem=6401M,node=1,billing=4
   ResumeTimeout=GLOBAL SuspendTimeout=GLOBAL SuspendTime=GLOBAL PowerDownOnIdle=NO

PartitionName=hipri
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
   Nodes=slurm-test-worker-cpu-0
   PriorityJobFactor=1 PriorityTier=2 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=GANG,SUSPEND
   State=UP TotalCPUs=0 TotalNodes=0 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
   TRES=(null)
   ResumeTimeout=GLOBAL SuspendTimeout=GLOBAL SuspendTime=GLOBAL PowerDownOnIdle=NO
# Submit 4 consecutive tasks.
root@slurm-test-0:/# srun sleep 1d &
root@slurm-test-0:/# srun sleep 1d &
root@slurm-test-0:/# srun sleep 1d &
root@slurm-test-0:/# srun sleep 1d &
# Check the current cluster status.
root@slurm-test-0:/# squeue
# The cluster currently has 4 running tasks.
JOBID  PARTITION   NAME     USER   ST     TIME  NODES NODELIST(REASON)
    4     debug    sleep     root  R       0:03  1    slurm-test-worker-cpu-0
    2     debug    sleep     root  R       0:04  1    slurm-test-worker-cpu-0
    3     debug    sleep     root  R       0:04  1    slurm-test-worker-cpu-0
    1     debug    sleep     root  R       0:05  1    slurm-test-worker-cpu-0
# Submit a task to the high-priority partition.
root@slurm-test-0:/# srun --partition=hipri sleep 1d &
root@slurm-test-0:/# squeue
# The ST (status) of task 4 has changed from R to S, and the status of task 5 has changed to R, indicating that task 4 has been suspended.
 JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
     2     debug    sleep     root  R       1:06      1 slurm-test-worker-cpu-0
     3     debug    sleep     root  R       1:06      1 slurm-test-worker-cpu-0
     1     debug    sleep     root  R       1:07      1 slurm-test-worker-cpu-0
     4     debug    sleep     root  S       0:59      1 slurm-test-worker-cpu-0
     5     hipri    sleep     root  R       0:06      1 slurm-test-worker-cpu-0
# Submit a low-priority task.
root@slurm-test-0:/# srun sleep 1d &
# Update the task to high priority.
root@slurm-test-0:/# scontrol update jobid=6 partition=hipri
root@slurm-test-0:/# squeue
# Tasks 1 and 2 have been suspended. This is because tasks in the same partition share execution time, so 1, 2, 3, 4 will execute in a time-sharing manner.
 JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
     4     debug    sleep     root  R       3:21      1 slurm-test-worker-cpu-0
     3     debug    sleep     root  R       3:33      1 slurm-test-worker-cpu-0
     2     debug    sleep     root  S       3:41      1 slurm-test-worker-cpu-0
     1     debug    sleep     root  S       4:01      1 slurm-test-worker-cpu-0
     6     hipri    sleep     root  R       0:03      1 slurm-test-worker-cpu-0
     5     hipri    sleep     root  R       3:33      1 slurm-test-worker-cpu-0

Customizing QoS service quality priority

Slurm needs to configure high/low priority QoS (by default, normal with priority 0 already exists) and enable preemption by creating high-priority QoS through sacctmgr. Preemption functionality needs to be enabled in slurm.conf (such as PreemptMode=priority), but note: If PreemptType=SUSPEND,GANG, after high-priority tasks preempt, low-priority tasks will coexist with high-priority tasks in time-sharing mode (not completely interrupted). Configuring QoS requires using the sacctmgr tool. Below is a common command for creating a high-priority QoS.

sacctmgr add qos high preempt=normal preemptmode=gang,suspend priority=10

Note

preempt=normal: Specifies that high QoS can preempt tasks with normal QoS.
preemptmode=gang,suspend:
- Gang mode: Preempting tasks need to fully acquire resources before starting execution.
- Suspend mode: Preempted tasks are suspended rather than terminated, releasing resources for the preemptor to use, and resuming execution when the preempting task finishes.
priority=10: The default priority base score for high QoS tasks is 10 (a higher value means a higher priority).

Enabling preemption-related switches in slurm.conf involves the following parameters. Additionally, when configuring a Partition, you need to add OverSubscribe=FORCE:1 at the end of the configuration.

Click to view detailed parameter descriptions

Parameter	Recommended value	Function
SelectType	select/cons_tres	Defines the resource allocation strategy, specifying how to allocate tasks to node resources. Note Workers created by Slurm cluster use the dynamic node feature, so only select/cons_tres type is supported.
SelectTypeParameters	CR_Core	Specific parameters passed to the `SelectType` plugin, controlling resource allocation details.
SchedulerType	sched/backfill	Specifies the scheduling algorithm type, determining how tasks are arranged to nodes.
PriorityType	priority/multifactor	Defines task priority calculation rules, determining task scheduling order.
PreemptMode	SUSPEND,GANG	Conditions for enabling the preemption strategy, determining under what circumstances running tasks can be preempted. Note When preemption is enabled, only SUSPEND and GANG are allowed under the select/cons_tres type select plugin.
PreemptType	preempt/qos	Selects the preemption mechanism type, specifying exactly how to execute preemption behavior. Note Currently supports preempt/qos and preempt/partition_prio. In this example, QoS is used as the basis for preemption.

Here is an example of using different QoS for task preemption management:

# View the current QoS.
root@slurm-test-0:/# sacctmgr show qos format=name
      Name
----------
    normal
# Create high-priority QoS.
root@slurm-test-0:/# sacctmgr add qos high preempt=normal preemptmode=gang,suspend priority=10
 Adding QOS(s)
  high
 Settings
  Description    = high
  Preempt                  = normal
  PreemptMode              = GANG,SUSPEND
  Priority                 = 10
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
# View current QoS.
root@slurm-test-0:/# sacctmgr show qos format=name,priority,preempt
      Name   Priority    Preempt
---------- ---------- ----------
    normal          0
      high         10     normal
# The content of test.sh is as follows.
# #!/bin/bash
# srun sleep 10m
# Submit five consecutive tasks.
root@slurm-test-0:/# sbatch test.sh
Submitted batch job 4
root@slurm-test-0:/# sbatch test.sh
Submitted batch job 5
root@slurm-test-0:/# sbatch test.sh
Submitted batch job 6
root@slurm-test-0:/# sbatch test.sh
Submitted batch job 7
root@slurm-test-0:/# sbatch test.sh
Submitted batch job 8
root@slurm-test-0:/# squeue # Task 8 is in Pending status
 JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
     8     debug  test.sh     root PD       0:00      1 (Resources)
     7     debug  test.sh     root  R       0:03      1 slurm-test-worker-cpu-0
     5     debug  test.sh     root  R       0:15      1 slurm-test-worker-cpu-0
     6     debug  test.sh     root  R       0:15      1 slurm-test-worker-cpu-0
     4     debug  test.sh     root  R       0:18      1 slurm-test-worker-cpu-0
root@slurm-test-0:/# sbatch --qos=high test.sh # Submit a task to high-priority QOS
Submitted batch job 9
root@slurm-test-0:/# squeue # High-priority QoS begins execution, sharing resources with other tasks in a time-sharing manner
 JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
     8     debug  test.sh     root PD       0:00      1 (Resources)
     7     debug  test.sh     root  R       0:26      1 slurm-test-worker-cpu-0
     5     debug  test.sh     root  R       0:38      1 slurm-test-worker-cpu-0
     6     debug  test.sh     root  R       0:38      1 slurm-test-worker-cpu-0
     4     debug  test.sh     root  R       0:41      1 slurm-test-worker-cpu-0
     9     debug  test.sh     root  S       0:00      1 slurm-test-worker-cpu-0

Customizing job size priority

Job size priority is determined by both PriorityWeightJobSize and PriorityWeightAge=1000.

Job Size Factor
Non-urgent tasks need to efficiently utilize cluster resources (without exceeding deadlines). When task execution time is unknown, backfill scheduling fails. In this case, prioritizing small tasks reduces head-of-line blocking, while increasing the priority of large tasks based on queue time prevents starvation. Large tasks approaching deadlines can preempt resources from small tasks (suspending small tasks until they complete).
To improve cluster utilization for non-urgent tasks (without exceeding deadlines), you can adopt the following strategies:
- Prioritize scheduling small tasks to reduce head-of-line blocking.
- Increase the priority of large tasks based on queue length to prevent starvation.
- Allow large tasks approaching deadlines to preempt resources from small tasks (small tasks are suspended until they complete). When task execution time is unknown, backfill scheduling fails, requiring the above mechanisms to ensure efficient resource utilization.
By implementing these measures, you can maximize cluster resource utilization while ensuring critical tasks complete on time, while also balancing different types of tasks.
The following configuration needs to be made in slurm.conf (only special configurations are shown here, other configurations in slurm.conf are not affected):
```
PriorityFavorSmall=YES
PriorityWeightAge=1000
PriorityWeightJobSize=1000
PriorityMaxAge=1-0
```
Job Waiting Time Factor
After setting up job size priority, submission waiting time becomes the second factor. Slurm calculates the job size score based on the ratio of requested resources to total cluster resources. If PriorityFavorSmall=YES is enabled, the score formula is: score = (1 - resource ratio) × PriorityWeightJobSize. For example, when a cluster has 4 CPU cores available:
- The score of a task requesting 1 core: (1 - 1/4) × weight = 0.75×weight → example score of 0.375 (if weight is 0.5).
- A task requesting 4 cores scores 0 (fully occupying resources).
AgeFactor priority calculation:
- Tasks exceeding PriorityMaxAge: Directly receive full PriorityWeightAge points.
- Other tasks score based on submission time proportion. For example, with PriorityWeightAge=1000, each minute adds approximately 0.69 points, accumulating to the full 1000 points after 24 hours.
Backfill scheduling recommendation: If task execution time can be estimated, you are advised to enable the default backfill scheduling (or manually configure SchedulerType=sched/backfill), allowing it to schedule small tasks to fill in idle periods before large tasks. Combined with the system's default large task priority mechanism and deadline-approaching preemption functionality, this can balance resource utilization and fairness.