Slurm schedules jobs based on priority, so queue configuration directly determines which workloads get resources first. This topic covers three strategies—partition priority, Quality of Service (QoS)-based preemption, and job size priority—so you can choose the approach that fits your cluster.
This topic is based on Slurm version 24.05. Other versions may differ.
Choose a preemption strategy
Before diving into configuration, use this table to select the right strategy for your use case:
| Strategy | Key parameter | Best for |
|---|---|---|
| Partition priority | PreemptType=preempt/partition_prio |
Separating urgent from non-urgent workloads into distinct partitions |
| QoS preemption | PreemptType=preempt/qos |
Fine-grained priority control across users or job classes |
| Job size priority | PriorityFavorSmall=YES |
Mixed small/large job clusters where runtimes are unpredictable |
All three strategies require the following base configuration in slurm.conf:
| Parameter | Recommended value | Description |
|---|---|---|
SelectType |
select/cons_tres |
Resource allocation strategy. Slurm cluster workers use the dynamic node feature, so only select/cons_tres is supported. |
SelectTypeParameters |
CR_Core |
Controls resource allocation details passed to the SelectType plugin. |
SchedulerType |
sched/backfill |
Scheduling algorithm. Backfill scheduling is enabled by default. |
PriorityType |
priority/multifactor |
Priority calculation method. |
PreemptMode |
SUSPEND,GANG |
Preemption behavior. When preemption is enabled with select/cons_tres, only SUSPEND and GANG are supported. |
How Slurm evaluates and schedules jobs
Understanding how the scheduler selects jobs helps you predict the effect of your configuration.
Scheduling behavior is controlled by SchedulerType (default: sched/backfill) and fine-tuned through SchedulerParameters in slurm.conf. For the full parameter reference, see the Slurm scheduling configuration guide.
Queue types
Slurm supports two priority mechanisms:
| Queue type | How it works | When to use |
|---|---|---|
| First In, First Out (FIFO) | Jobs run in submission order (priority/basic) |
Simple clusters where submission order is sufficient |
| Multifactor | Job priority is a weighted sum of multiple factors | Clusters with diverse workloads or fairness requirements |
Multifactor scheduling is enabled by default. If your cluster runs uniform workloads with no preemption requirements, FIFO may be sufficient.
FIFO queue
Set PriorityType=priority/basic in slurm.conf to enable FIFO scheduling:
# Back up slurm.conf before editing
sudo cp /etc/slurm-llnl/slurm.conf /etc/slurm-llnl/slurm.conf.bak
sudo nano /etc/slurm-llnl/slurm.conf
# Add or update:
PriorityType=priority/basic
Back up slurm.conf before making changes. Test configuration changes in a non-production environment before applying them to production.
Multifactor queue
Multifactor scheduling computes each job's priority as a weighted sum:
Job_priority =
site_factor +
(PriorityWeightAge) * (age_factor) +
(PriorityWeightAssoc) * (assoc_factor) +
(PriorityWeightFairshare) * (fair-share_factor) +
(PriorityWeightJobSize) * (job_size_factor) +
(PriorityWeightPartition) * (priority_job_factor) +
(PriorityWeightQOS) * (QOS_factor) +
SUM(TRES_weight_cpu * TRES_factor_cpu,
TRES_weight_<type> * TRES_factor_<type>,
...)
- nice_factor
Each weight controls a different scheduling behavior:
| Factor | Weight parameter | Effect |
|---|---|---|
| Waiting time | PriorityWeightAge |
Longer-waiting jobs score higher |
| Association | PriorityWeightAssoc |
Adjusts score by user group or account |
| Fair-share | PriorityWeightFairshare |
Boosts jobs from users who have consumed fewer resources |
| Job size | PriorityWeightJobSize |
Favors small or large jobs depending on configuration |
| Partition | PriorityWeightPartition |
Raises score for jobs in designated partitions |
| QoS | PriorityWeightQOS |
Applies Quality of Service level scoring |
| Trackable RESources (TRES) | TRES_weight_<type> |
Weights specific resource types (CPU, GPU, and others) |
| Nice | — | Subtracts from priority; a higher nice value lowers priority |
For full weight configuration details, see the multifactor priority plugin documentation.
Common configurations:
-
Prioritize small jobs: Set
PriorityWeightJobSize=-1to lower large job priority and reduce head-of-line blocking. -
Protect critical teams: Tune
PriorityWeightAssocandfair-share_factorto ensure jobs from important accounts run first. -
Prevent starvation: Set
PriorityWeightFairshare=2000to significantly boost jobs from users with low historical resource usage.
Configure partition priority preemption
Partition priority preemption lets you create two partitions over the same node pool—one for urgent jobs and one for non-urgent jobs. Submit a job to the high-priority partition, or move a pending job there, to preempt lower-priority work. Preempted jobs are suspended or canceled depending on your PreemptMode setting, and resume when the high-priority job completes.
Prerequisites
Before you begin, ensure that you have:
-
A running Slurm cluster on ACK with a worker node pool
-
Root or
sudoaccess to editslurm.conf -
A non-production environment to validate changes before applying them to production
Set up partition-based preemption
-
Edit
slurm.confto enable preemption:ImportantBack up
slurm.confbefore editing. Test changes in a non-production environment first.sudo nano /etc/slurm-llnl/slurm.confAdd or update the following parameters:
# Preemption based on partition priority tier PreemptType=preempt/partition_prio # Behavior when a job is preempted: # suspend — pauses the job until resources are available again # cancel — terminates the job immediately PreemptMode=SUSPEND,GANG -
Create a high-priority partition:
scontrol create partition=hipri PriorityTier=2 nodes=ALLAlternatively, add the partition directly to
slurm.conf. -
Verify both partitions are visible:
scontrol show partitionExpected output (key fields shown):
PartitionName=debug PriorityTier=1 OverSubscribe=FORCE:1 PreemptMode=GANG,SUSPEND State=UP TotalCPUs=4 TotalNodes=1 PartitionName=hipri PriorityTier=2 OverSubscribe=NO PreemptMode=GANG,SUSPEND State=UP TotalCPUs=0 TotalNodes=0
Trigger preemption
Submit a job to the high-priority partition to preempt a running low-priority job:
# Submit four jobs to the default partition
srun sleep 1d &
srun sleep 1d &
srun sleep 1d &
srun sleep 1d &
# Confirm all four are running
squeueJOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
4 debug sleep root R 0:03 1 slurm-test-worker-cpu-0
3 debug sleep root R 0:04 1 slurm-test-worker-cpu-0
2 debug sleep root R 0:04 1 slurm-test-worker-cpu-0
1 debug sleep root R 0:05 1 slurm-test-worker-cpu-0# Submit a job to the high-priority partition
srun --partition=hipri sleep 1d &
squeue
Job 4 is now suspended (ST changes from R to S), and job 5 is running:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
5 hipri sleep root R 0:06 1 slurm-test-worker-cpu-0
4 debug sleep root S 0:59 1 slurm-test-worker-cpu-0
3 debug sleep root R 1:06 1 slurm-test-worker-cpu-0
2 debug sleep root R 1:06 1 slurm-test-worker-cpu-0
1 debug sleep root R 1:07 1 slurm-test-worker-cpu-0
To escalate a pending job without resubmitting it, update its partition:
scontrol update jobid=6 partition=hipri
squeue
Jobs 1 and 2 are now suspended because jobs in the same partition (hipri) share execution time via gang scheduling:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
6 hipri sleep root R 0:03 1 slurm-test-worker-cpu-0
5 hipri sleep root R 3:33 1 slurm-test-worker-cpu-0
4 debug sleep root R 3:21 1 slurm-test-worker-cpu-0
3 debug sleep root R 3:33 1 slurm-test-worker-cpu-0
2 debug sleep root S 3:41 1 slurm-test-worker-cpu-0
1 debug sleep root S 4:01 1 slurm-test-worker-cpu-0
Configure QoS-based preemption
QoS preemption assigns priority levels to jobs regardless of which partition they run in. A job submitted with a high-priority QoS can preempt jobs with a lower-priority QoS. When PreemptMode=SUSPEND,GANG, preempted jobs are suspended and share resources in time-sharing mode—they are not terminated.
By default, Slurm includes a normal QoS with priority 0.
Prerequisites
Before you begin, ensure that you have:
-
A running Slurm cluster on ACK
-
Root or
sudoaccess to editslurm.confand runsacctmgr -
The base preemption parameters already set in
slurm.conf(see Choose a preemption strategy)
Set up QoS-based preemption
-
Enable QoS preemption in
slurm.conf:sudo nano /etc/slurm-llnl/slurm.confAdd or update:
PreemptType=preempt/qos PreemptMode=SUSPEND,GANGAlso add
OverSubscribe=FORCE:1to each affected partition definition inslurm.conf. -
Create a high-priority QoS:
sacctmgr add qos high preempt=normal preemptmode=gang,suspend priority=10When prompted, enter
yto confirm. This is a normalsacctmgrconfirmation—it is not an error. Parameter meanings:Parameter Value Effect preemptnormalAllows highQoS jobs to preemptnormalQoS jobspreemptmodegang,suspendGang scheduling: the preempting job acquires all resources before starting. Suspend: preempted jobs pause and resume after the preempting job finishes. priority10Base priority score for highQoS jobs (higher = higher priority) -
Verify the QoS configuration:
sacctmgr show qos format=name,priority,preemptExpected output:
Name Priority Preempt ---------- ---------- ---------- normal 0 high 10 normal
Trigger QoS preemption
# Submit five jobs with normal QoS
sbatch test.sh # job 4
sbatch test.sh # job 5
sbatch test.sh # job 6
sbatch test.sh # job 7
sbatch test.sh # job 8
squeueJOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
8 debug test.sh root PD 0:00 1 (Resources)
7 debug test.sh root R 0:03 1 slurm-test-worker-cpu-0
6 debug test.sh root R 0:15 1 slurm-test-worker-cpu-0
5 debug test.sh root R 0:15 1 slurm-test-worker-cpu-0
4 debug test.sh root R 0:18 1 slurm-test-worker-cpu-0# Submit a high-priority QoS job
sbatch --qos=high test.sh # job 9
squeue
Job 9 begins running immediately, sharing resources with the other jobs in time-sharing mode:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
9 debug test.sh root S 0:00 1 slurm-test-worker-cpu-0
8 debug test.sh root PD 0:00 1 (Resources)
7 debug test.sh root R 0:26 1 slurm-test-worker-cpu-0
6 debug test.sh root R 0:38 1 slurm-test-worker-cpu-0
5 debug test.sh root R 0:38 1 slurm-test-worker-cpu-0
4 debug test.sh root R 0:41 1 slurm-test-worker-cpu-0
Configure job size priority
Job size priority is useful when you cannot predict job runtimes and backfill scheduling cannot determine which gaps to fill. The strategy prioritizes small jobs by default (reducing head-of-line blocking), then uses waiting time to prevent large jobs from starving.
How scoring works
When PriorityFavorSmall=YES, the job size score is:
size_score = (1 - requested_resources / total_cluster_resources) x PriorityWeightJobSize
Example with a 4-CPU cluster and PriorityWeightJobSize=1000:
| Job size | Score calculation | Score |
|---|---|---|
| 1 CPU | (1 - 1/4) x 1000 | 750 |
| 4 CPUs | (1 - 4/4) x 1000 | 0 |
The age factor accumulates over time. With PriorityWeightAge=1000 and PriorityMaxAge=1-0 (1 day):
-
Each minute of waiting adds approximately 0.69 points.
-
After 24 hours, the job receives the full 1,000 age points.
-
Jobs waiting longer than
PriorityMaxAgeimmediately receive the fullPriorityWeightAgescore.
This prevents large jobs from waiting indefinitely: as waiting time grows, the age factor eventually outweighs the size penalty.
Set up job size priority
Add the following to slurm.conf:
PriorityFavorSmall=YES
PriorityWeightAge=1000
PriorityWeightJobSize=1000
PriorityMaxAge=1-0
If job runtimes can be estimated, also enable backfill scheduling (it is on by default):
SchedulerType=sched/backfill
Backfill scheduling fills idle time before large jobs with small jobs that can complete without delaying the larger ones.
What's next
-
Slurm multifactor priority plugin — full weight configuration reference
-
Slurm scheduling configuration — backfill and scheduler parameters