Queues are essential configuration items in task scheduling for effective resource management and allocation, optimized job scheduling, improved system utilization, and meeting diverse job requirements. Proper queue configuration ensures high-priority tasks receive required resources first, maximizing resource utilization efficiency. This topic describes how to implement appropriate queue configuration strategies in a Slurm environment to process as many tasks as possible when jobs are submitted or job states change, achieving optimal performance.
1. Core Slurm features
Resource allocation: Allocates CPU/memory/GPU resources as needed, avoiding conflicts and waste.
Job scheduling: Dynamically schedules job queues, executes according to priority, and monitors task status throughout.
Priority control: High-priority queue tasks are scheduled first.
Monitoring tools: Monitors resource usage and job status through
scontrol/sacct.Customization support: Multiple queues adapt to different requirements (such as CPU-intensive/memory/GPU tasks).
System optimization: Improves resource utilization, reduces idle time, and increases computational efficiency.
This topic is based on Slurm version 24.05. Other versions may have differences.
2. Slurm queue types
Slurm tasks are executed in priority order. If a partition has unschedulable tasks, subsequent tasks are paused. High-priority tasks can preempt resources from low-priority tasks. Preempted tasks can be canceled, reset, or suspended. If you enable backfill scheduling (default), the system calculates whether low-priority tasks can run without delaying high-priority tasks based on the bf_interval cycle. This requires occupying entire machines and may trigger machine-wide preemption. Scheduling configuration is specified through SchedulerType (default sched/backfill plugin) and detailed parameters SchedulerParameters in slurm.conf. For specific parameter configurations, see official documentation.
During scheduling, all tasks are integrated into a single list, and their execution order is determined through different priority algorithms. Slurm supports the following two queue types:
First in, first out (FIFO) queue, where tasks are sorted based on the order of their submission time.
MultiFactors queue, which is a more advanced task queuing mechanism that is enabled by default. It can calculate job priority based on multiple factors.
2.1 First in, first out queue
By default, Slurm uses FIFO as the basis for allocating job priorities. The configuration file for priority scheduling is stored in slurm.conf. You can configure priority by modifying the PriorityType parameter.
# 1. Find and edit the slurm.conf file
sudo nano /etc/slurm-llnl/slurm.conf
# 2. Enable preemption mode and specify a preemption strategy based on first in, first out priority
PriorityType=priority/basic You are advised to back up the original slurm.conf file before making changes, so you can restore it if problems occur. Additionally, for any major changes in a production environment, you are advised to thoroughly test them in a test environment first.
2.2 Multifactors job queue
Slurm multifactor scheduling determines task priority through weighted calculation of the following factors: job execution time, resource differences (allocated vs. consumed), job size, user parameters, data partitioning, TRES (Total Resource Equivalents) types, and Quality of Service (QoS). For weight allocation and specific calculation logic, see multifactor priority configuration instructions.
Job_priority =
site_factor +
(PriorityWeightAge) * (age_factor) +
(PriorityWeightAssoc) * (assoc_factor) +
(PriorityWeightFairshare) * (fair-share_factor) +
(PriorityWeightJobSize) * (job_size_factor) +
(PriorityWeightPartition) * (priority_job_factor) +
(PriorityWeightQOS) * (QOS_factor) +
SUM(TRES_weight_cpu * TRES_factor_cpu,
TRES_weight_<type> * TRES_factor_<type>,
...)
- nice_factorSlurm job priority is calculated using the following weighted factors:
Base value:
site_factor(custom score).Job waiting time weight: The longer the job waiting time, the higher the weight (
PriorityWeightAge × age_factor).Association weight: Resource usage fairness of user group/account (
PriorityWeightAssoc × assoc_factor).Fair share weight: Adjusts score based on resource usage proportion (
PriorityWeightFairshare × fair-share_factor).Job size weight: Small/large job priority (
PriorityWeightJobSize × job_size_factor).Partition weight: Partition priority (
PriorityWeightPartition × priority_job_factor).QoS weight: Quality of service level (
PriorityWeightQOS × QOS_factor).Resource weight: Resource type (CPU/GPU, etc.) weighted.
Nice downgrade:
- nice_factor(higher value means lower priority).
You can achieve fair and efficient task scheduling by dynamically adjusting weight parameters.
Typical application examples.
Quickly complete small jobs:
Set
PriorityWeightJobSize=-1, lowering the priority of large jobs so small jobs are scheduled faster.Guarantee critical users/groups:
Ensure jobs from important teams run first through
PriorityWeightAssocandFair-share_factor.Resource starvation protection:
Configure
PriorityWeightFairshare=2000, significantly increasing the priority of jobs from users with low resource usage.
Examples: Setting up multifactor job priority
Customizing partition priority
Slurm can divide machines by organizational structure, limiting tasks to run only in their assigned resource pools. Tasks are classified as urgent (preempting low-priority tasks) or non-urgent (executing quickly but not blocking urgent tasks). When tasks approach deadlines and need to be marked as urgent, Slurm cannot automatically adjust this. Manual migration to high-priority queues is required.
You can create two partitions pointing to the same node pool (distinguishing urgent/non-urgent tasks) and dynamically adjust priorities by switching the partition a task belongs to. This improves resource utilization and simplifies operations. It supports flexible scheduling of dynamic workloads while reducing management complexity. This not only helps the system better adapt to dynamically changing workload requirements but also simplifies the management work of operations personnel in complex job environments. You can refer to the following steps for setup.
First, enable the preemption function switch in the cluster and set the preemption type to
preempt/partition_prio.# 1. Find and edit the slurm.conf file sudo nano /etc/slurm-llnl/slurm.conf # 2. Enable preemption mode and specify a preemption strategy based on partition priority PreemptMode=preempt/partition_prio # 3. Behavior when a job is preempted, defining what happens when a job is preempted. # cancel means terminating the job; suspend will pause it until resources are available again. Choose based on your requirements. PreemptType=suspend # or "cancel"ImportantYou are advised to back up the original
slurm.conffile before making changes, so you can restore it if problems occur. Additionally, for any major changes in a production environment, you are advised to thoroughly test them in a test environment first.You can add a high-priority partition in a cluster by using the following command, or by adding a new partition record in slurm.conf.
scontrol create partition=hipri PriorityTier=2 nodes=ALLAfter that, you can achieve job preemption by submitting jobs to thehipripartition or changing jobs to the high-priority partition. The following is an example of job submission.# 1. Add a high-priority partition in the Slurm cluster. root@slurm-test-0:/# scontrol create partition=hipri PriorityTier=2 nodes=ALL # 2. View current cluster partitions. root@slurm-test-0:/# scontrol show partition # Result. PartitionName=debug AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=YES QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED Nodes=slurm-test-worker-cpu-0 PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1 OverTimeLimit=NONE PreemptMode=GANG,SUSPEND State=UP TotalCPUs=4 TotalNodes=1 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED TRES=cpu=4,mem=6401M,node=1,billing=4 ResumeTimeout=GLOBAL SuspendTimeout=GLOBAL SuspendTime=GLOBAL PowerDownOnIdle=NO PartitionName=hipri AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED Nodes=slurm-test-worker-cpu-0 PriorityJobFactor=1 PriorityTier=2 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=GANG,SUSPEND State=UP TotalCPUs=0 TotalNodes=0 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED TRES=(null) ResumeTimeout=GLOBAL SuspendTimeout=GLOBAL SuspendTime=GLOBAL PowerDownOnIdle=NO # Submit 4 consecutive tasks. root@slurm-test-0:/# srun sleep 1d & root@slurm-test-0:/# srun sleep 1d & root@slurm-test-0:/# srun sleep 1d & root@slurm-test-0:/# srun sleep 1d & # Check the current cluster status. root@slurm-test-0:/# squeue # The cluster currently has 4 running tasks. JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 4 debug sleep root R 0:03 1 slurm-test-worker-cpu-0 2 debug sleep root R 0:04 1 slurm-test-worker-cpu-0 3 debug sleep root R 0:04 1 slurm-test-worker-cpu-0 1 debug sleep root R 0:05 1 slurm-test-worker-cpu-0 # Submit a task to the high-priority partition. root@slurm-test-0:/# srun --partition=hipri sleep 1d & root@slurm-test-0:/# squeue # The ST (status) of task 4 has changed from R to S, and the status of task 5 has changed to R, indicating that task 4 has been suspended. JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 debug sleep root R 1:06 1 slurm-test-worker-cpu-0 3 debug sleep root R 1:06 1 slurm-test-worker-cpu-0 1 debug sleep root R 1:07 1 slurm-test-worker-cpu-0 4 debug sleep root S 0:59 1 slurm-test-worker-cpu-0 5 hipri sleep root R 0:06 1 slurm-test-worker-cpu-0 # Submit a low-priority task. root@slurm-test-0:/# srun sleep 1d & # Update the task to high priority. root@slurm-test-0:/# scontrol update jobid=6 partition=hipri root@slurm-test-0:/# squeue # Tasks 1 and 2 have been suspended. This is because tasks in the same partition share execution time, so 1, 2, 3, 4 will execute in a time-sharing manner. JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 4 debug sleep root R 3:21 1 slurm-test-worker-cpu-0 3 debug sleep root R 3:33 1 slurm-test-worker-cpu-0 2 debug sleep root S 3:41 1 slurm-test-worker-cpu-0 1 debug sleep root S 4:01 1 slurm-test-worker-cpu-0 6 hipri sleep root R 0:03 1 slurm-test-worker-cpu-0 5 hipri sleep root R 3:33 1 slurm-test-worker-cpu-0
Customizing QoS service quality priority
Slurm needs to configure high/low priority QoS (by default, normal with priority 0 already exists) and enable preemption by creating high-priority QoS through sacctmgr. Preemption functionality needs to be enabled in slurm.conf (such as PreemptMode=priority), but note: If PreemptType=SUSPEND,GANG, after high-priority tasks preempt, low-priority tasks will coexist with high-priority tasks in time-sharing mode (not completely interrupted). Configuring QoS requires using the sacctmgr tool. Below is a common command for creating a high-priority QoS.
sacctmgr add qos high preempt=normal preemptmode=gang,suspend priority=10preempt=normal: Specifies thathighQoS can preempt tasks withnormalQoS.preemptmode=gang,suspend:Gang mode: Preempting tasks need to fully acquire resources before starting execution.
Suspend mode: Preempted tasks are suspended rather than terminated, releasing resources for the preemptor to use, and resuming execution when the preempting task finishes.
priority=10: The default priority base score forhighQoS tasks is 10 (a higher value means a higher priority).
Enabling preemption-related switches in slurm.conf involves the following parameters. Additionally, when configuring a Partition, you need to add OverSubscribe=FORCE:1 at the end of the configuration.
Here is an example of using different QoS for task preemption management:
# View the current QoS.
root@slurm-test-0:/# sacctmgr show qos format=name
Name
----------
normal
# Create high-priority QoS.
root@slurm-test-0:/# sacctmgr add qos high preempt=normal preemptmode=gang,suspend priority=10
Adding QOS(s)
high
Settings
Description = high
Preempt = normal
PreemptMode = GANG,SUSPEND
Priority = 10
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
# View current QoS.
root@slurm-test-0:/# sacctmgr show qos format=name,priority,preempt
Name Priority Preempt
---------- ---------- ----------
normal 0
high 10 normal
# The content of test.sh is as follows.
# #!/bin/bash
# srun sleep 10m
# Submit five consecutive tasks.
root@slurm-test-0:/# sbatch test.sh
Submitted batch job 4
root@slurm-test-0:/# sbatch test.sh
Submitted batch job 5
root@slurm-test-0:/# sbatch test.sh
Submitted batch job 6
root@slurm-test-0:/# sbatch test.sh
Submitted batch job 7
root@slurm-test-0:/# sbatch test.sh
Submitted batch job 8
root@slurm-test-0:/# squeue # Task 8 is in Pending status
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
8 debug test.sh root PD 0:00 1 (Resources)
7 debug test.sh root R 0:03 1 slurm-test-worker-cpu-0
5 debug test.sh root R 0:15 1 slurm-test-worker-cpu-0
6 debug test.sh root R 0:15 1 slurm-test-worker-cpu-0
4 debug test.sh root R 0:18 1 slurm-test-worker-cpu-0
root@slurm-test-0:/# sbatch --qos=high test.sh # Submit a task to high-priority QOS
Submitted batch job 9
root@slurm-test-0:/# squeue # High-priority QoS begins execution, sharing resources with other tasks in a time-sharing manner
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
8 debug test.sh root PD 0:00 1 (Resources)
7 debug test.sh root R 0:26 1 slurm-test-worker-cpu-0
5 debug test.sh root R 0:38 1 slurm-test-worker-cpu-0
6 debug test.sh root R 0:38 1 slurm-test-worker-cpu-0
4 debug test.sh root R 0:41 1 slurm-test-worker-cpu-0
9 debug test.sh root S 0:00 1 slurm-test-worker-cpu-0
Customizing job size priority
Job size priority is determined by both PriorityWeightJobSize and PriorityWeightAge=1000.
Job Size Factor
Non-urgent tasks need to efficiently utilize cluster resources (without exceeding deadlines). When task execution time is unknown, backfill scheduling fails. In this case, prioritizing small tasks reduces head-of-line blocking, while increasing the priority of large tasks based on queue time prevents starvation. Large tasks approaching deadlines can preempt resources from small tasks (suspending small tasks until they complete).
To improve cluster utilization for non-urgent tasks (without exceeding deadlines), you can adopt the following strategies:
Prioritize scheduling small tasks to reduce head-of-line blocking.
Increase the priority of large tasks based on queue length to prevent starvation.
Allow large tasks approaching deadlines to preempt resources from small tasks (small tasks are suspended until they complete). When task execution time is unknown, backfill scheduling fails, requiring the above mechanisms to ensure efficient resource utilization.
By implementing these measures, you can maximize cluster resource utilization while ensuring critical tasks complete on time, while also balancing different types of tasks.
The following configuration needs to be made in slurm.conf (only special configurations are shown here, other configurations in slurm.conf are not affected):
PriorityFavorSmall=YES PriorityWeightAge=1000 PriorityWeightJobSize=1000 PriorityMaxAge=1-0Job Waiting Time Factor
After setting up job size priority, submission waiting time becomes the second factor. Slurm calculates the job size score based on the ratio of requested resources to total cluster resources. If
PriorityFavorSmall=YESis enabled, the score formula is: score = (1 - resource ratio) × PriorityWeightJobSize. For example, when a cluster has 4 CPU cores available:The score of a task requesting 1 core:
(1 - 1/4) × weight = 0.75×weight → example score of 0.375 (if weight is 0.5).A task requesting 4 cores scores 0 (fully occupying resources).
AgeFactor priority calculation:
Tasks exceeding
PriorityMaxAge: Directly receive fullPriorityWeightAgepoints.Other tasks score based on submission time proportion. For example, with
PriorityWeightAge=1000, each minute adds approximately 0.69 points, accumulating to the full 1000 points after 24 hours.
Backfill scheduling recommendation: If task execution time can be estimated, you are advised to enable the default backfill scheduling (or manually configure
SchedulerType=sched/backfill), allowing it to schedule small tasks to fill in idle periods before large tasks. Combined with the system's default large task priority mechanism and deadline-approaching preemption functionality, this can balance resource utilization and fairness.