All Products
Search
Document Center

Container Service for Kubernetes:Set Slurm queue priorities

Last Updated:Mar 26, 2026

Slurm schedules jobs based on priority, so queue configuration directly determines which workloads get resources first. This topic covers three strategies—partition priority, Quality of Service (QoS)-based preemption, and job size priority—so you can choose the approach that fits your cluster.

Note

This topic is based on Slurm version 24.05. Other versions may differ.

Choose a preemption strategy

Before diving into configuration, use this table to select the right strategy for your use case:

Strategy Key parameter Best for
Partition priority PreemptType=preempt/partition_prio Separating urgent from non-urgent workloads into distinct partitions
QoS preemption PreemptType=preempt/qos Fine-grained priority control across users or job classes
Job size priority PriorityFavorSmall=YES Mixed small/large job clusters where runtimes are unpredictable

All three strategies require the following base configuration in slurm.conf:

Parameter Recommended value Description
SelectType select/cons_tres Resource allocation strategy. Slurm cluster workers use the dynamic node feature, so only select/cons_tres is supported.
SelectTypeParameters CR_Core Controls resource allocation details passed to the SelectType plugin.
SchedulerType sched/backfill Scheduling algorithm. Backfill scheduling is enabled by default.
PriorityType priority/multifactor Priority calculation method.
PreemptMode SUSPEND,GANG Preemption behavior. When preemption is enabled with select/cons_tres, only SUSPEND and GANG are supported.

How Slurm evaluates and schedules jobs

Understanding how the scheduler selects jobs helps you predict the effect of your configuration.

Scheduling behavior is controlled by SchedulerType (default: sched/backfill) and fine-tuned through SchedulerParameters in slurm.conf. For the full parameter reference, see the Slurm scheduling configuration guide.

Queue types

Slurm supports two priority mechanisms:

Queue type How it works When to use
First In, First Out (FIFO) Jobs run in submission order (priority/basic) Simple clusters where submission order is sufficient
Multifactor Job priority is a weighted sum of multiple factors Clusters with diverse workloads or fairness requirements

Multifactor scheduling is enabled by default. If your cluster runs uniform workloads with no preemption requirements, FIFO may be sufficient.

FIFO queue

Set PriorityType=priority/basic in slurm.conf to enable FIFO scheduling:

# Back up slurm.conf before editing
sudo cp /etc/slurm-llnl/slurm.conf /etc/slurm-llnl/slurm.conf.bak

sudo nano /etc/slurm-llnl/slurm.conf
# Add or update:
PriorityType=priority/basic
Important

Back up slurm.conf before making changes. Test configuration changes in a non-production environment before applying them to production.

Multifactor queue

Multifactor scheduling computes each job's priority as a weighted sum:

Job_priority =
  site_factor +
  (PriorityWeightAge)       * (age_factor) +
  (PriorityWeightAssoc)     * (assoc_factor) +
  (PriorityWeightFairshare) * (fair-share_factor) +
  (PriorityWeightJobSize)   * (job_size_factor) +
  (PriorityWeightPartition) * (priority_job_factor) +
  (PriorityWeightQOS)       * (QOS_factor) +
  SUM(TRES_weight_cpu * TRES_factor_cpu,
      TRES_weight_<type> * TRES_factor_<type>,
      ...)
  - nice_factor

Each weight controls a different scheduling behavior:

Factor Weight parameter Effect
Waiting time PriorityWeightAge Longer-waiting jobs score higher
Association PriorityWeightAssoc Adjusts score by user group or account
Fair-share PriorityWeightFairshare Boosts jobs from users who have consumed fewer resources
Job size PriorityWeightJobSize Favors small or large jobs depending on configuration
Partition PriorityWeightPartition Raises score for jobs in designated partitions
QoS PriorityWeightQOS Applies Quality of Service level scoring
Trackable RESources (TRES) TRES_weight_<type> Weights specific resource types (CPU, GPU, and others)
Nice Subtracts from priority; a higher nice value lowers priority

For full weight configuration details, see the multifactor priority plugin documentation.

Common configurations:

  • Prioritize small jobs: Set PriorityWeightJobSize=-1 to lower large job priority and reduce head-of-line blocking.

  • Protect critical teams: Tune PriorityWeightAssoc and fair-share_factor to ensure jobs from important accounts run first.

  • Prevent starvation: Set PriorityWeightFairshare=2000 to significantly boost jobs from users with low historical resource usage.

Configure partition priority preemption

Partition priority preemption lets you create two partitions over the same node pool—one for urgent jobs and one for non-urgent jobs. Submit a job to the high-priority partition, or move a pending job there, to preempt lower-priority work. Preempted jobs are suspended or canceled depending on your PreemptMode setting, and resume when the high-priority job completes.

Prerequisites

Before you begin, ensure that you have:

  • A running Slurm cluster on ACK with a worker node pool

  • Root or sudo access to edit slurm.conf

  • A non-production environment to validate changes before applying them to production

Set up partition-based preemption

  1. Edit slurm.conf to enable preemption:

    Important

    Back up slurm.conf before editing. Test changes in a non-production environment first.

    sudo nano /etc/slurm-llnl/slurm.conf

    Add or update the following parameters:

    # Preemption based on partition priority tier
    PreemptType=preempt/partition_prio
    
    # Behavior when a job is preempted:
    # suspend — pauses the job until resources are available again
    # cancel  — terminates the job immediately
    PreemptMode=SUSPEND,GANG
  2. Create a high-priority partition:

    scontrol create partition=hipri PriorityTier=2 nodes=ALL

    Alternatively, add the partition directly to slurm.conf.

  3. Verify both partitions are visible:

    scontrol show partition

    Expected output (key fields shown):

    PartitionName=debug
       PriorityTier=1 OverSubscribe=FORCE:1
       PreemptMode=GANG,SUSPEND
       State=UP TotalCPUs=4 TotalNodes=1
    
    PartitionName=hipri
       PriorityTier=2 OverSubscribe=NO
       PreemptMode=GANG,SUSPEND
       State=UP TotalCPUs=0 TotalNodes=0

Trigger preemption

Submit a job to the high-priority partition to preempt a running low-priority job:

# Submit four jobs to the default partition
srun sleep 1d &
srun sleep 1d &
srun sleep 1d &
srun sleep 1d &

# Confirm all four are running
squeue
JOBID  PARTITION   NAME     USER   ST     TIME  NODES NODELIST(REASON)
    4     debug    sleep     root  R       0:03  1    slurm-test-worker-cpu-0
    3     debug    sleep     root  R       0:04  1    slurm-test-worker-cpu-0
    2     debug    sleep     root  R       0:04  1    slurm-test-worker-cpu-0
    1     debug    sleep     root  R       0:05  1    slurm-test-worker-cpu-0
# Submit a job to the high-priority partition
srun --partition=hipri sleep 1d &
squeue

Job 4 is now suspended (ST changes from R to S), and job 5 is running:

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    5     hipri    sleep     root  R       0:06      1 slurm-test-worker-cpu-0
    4     debug    sleep     root  S       0:59      1 slurm-test-worker-cpu-0
    3     debug    sleep     root  R       1:06      1 slurm-test-worker-cpu-0
    2     debug    sleep     root  R       1:06      1 slurm-test-worker-cpu-0
    1     debug    sleep     root  R       1:07      1 slurm-test-worker-cpu-0

To escalate a pending job without resubmitting it, update its partition:

scontrol update jobid=6 partition=hipri
squeue

Jobs 1 and 2 are now suspended because jobs in the same partition (hipri) share execution time via gang scheduling:

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    6     hipri    sleep     root  R       0:03      1 slurm-test-worker-cpu-0
    5     hipri    sleep     root  R       3:33      1 slurm-test-worker-cpu-0
    4     debug    sleep     root  R       3:21      1 slurm-test-worker-cpu-0
    3     debug    sleep     root  R       3:33      1 slurm-test-worker-cpu-0
    2     debug    sleep     root  S       3:41      1 slurm-test-worker-cpu-0
    1     debug    sleep     root  S       4:01      1 slurm-test-worker-cpu-0

Configure QoS-based preemption

QoS preemption assigns priority levels to jobs regardless of which partition they run in. A job submitted with a high-priority QoS can preempt jobs with a lower-priority QoS. When PreemptMode=SUSPEND,GANG, preempted jobs are suspended and share resources in time-sharing mode—they are not terminated.

By default, Slurm includes a normal QoS with priority 0.

Prerequisites

Before you begin, ensure that you have:

  • A running Slurm cluster on ACK

  • Root or sudo access to edit slurm.conf and run sacctmgr

  • The base preemption parameters already set in slurm.conf (see Choose a preemption strategy)

Set up QoS-based preemption

  1. Enable QoS preemption in slurm.conf:

    sudo nano /etc/slurm-llnl/slurm.conf

    Add or update:

    PreemptType=preempt/qos
    PreemptMode=SUSPEND,GANG

    Also add OverSubscribe=FORCE:1 to each affected partition definition in slurm.conf.

  2. Create a high-priority QoS:

    sacctmgr add qos high preempt=normal preemptmode=gang,suspend priority=10

    When prompted, enter y to confirm. This is a normal sacctmgr confirmation—it is not an error. Parameter meanings:

    Parameter Value Effect
    preempt normal Allows high QoS jobs to preempt normal QoS jobs
    preemptmode gang,suspend Gang scheduling: the preempting job acquires all resources before starting. Suspend: preempted jobs pause and resume after the preempting job finishes.
    priority 10 Base priority score for high QoS jobs (higher = higher priority)
  3. Verify the QoS configuration:

    sacctmgr show qos format=name,priority,preempt

    Expected output:

          Name   Priority    Preempt
    ---------- ---------- ----------
        normal          0
          high         10     normal

Trigger QoS preemption

# Submit five jobs with normal QoS
sbatch test.sh   # job 4
sbatch test.sh   # job 5
sbatch test.sh   # job 6
sbatch test.sh   # job 7
sbatch test.sh   # job 8
squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    8     debug  test.sh     root PD       0:00      1 (Resources)
    7     debug  test.sh     root  R       0:03      1 slurm-test-worker-cpu-0
    6     debug  test.sh     root  R       0:15      1 slurm-test-worker-cpu-0
    5     debug  test.sh     root  R       0:15      1 slurm-test-worker-cpu-0
    4     debug  test.sh     root  R       0:18      1 slurm-test-worker-cpu-0
# Submit a high-priority QoS job
sbatch --qos=high test.sh   # job 9
squeue

Job 9 begins running immediately, sharing resources with the other jobs in time-sharing mode:

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    9     debug  test.sh     root  S       0:00      1 slurm-test-worker-cpu-0
    8     debug  test.sh     root PD       0:00      1 (Resources)
    7     debug  test.sh     root  R       0:26      1 slurm-test-worker-cpu-0
    6     debug  test.sh     root  R       0:38      1 slurm-test-worker-cpu-0
    5     debug  test.sh     root  R       0:38      1 slurm-test-worker-cpu-0
    4     debug  test.sh     root  R       0:41      1 slurm-test-worker-cpu-0

Configure job size priority

Job size priority is useful when you cannot predict job runtimes and backfill scheduling cannot determine which gaps to fill. The strategy prioritizes small jobs by default (reducing head-of-line blocking), then uses waiting time to prevent large jobs from starving.

How scoring works

When PriorityFavorSmall=YES, the job size score is:

size_score = (1 - requested_resources / total_cluster_resources) x PriorityWeightJobSize

Example with a 4-CPU cluster and PriorityWeightJobSize=1000:

Job size Score calculation Score
1 CPU (1 - 1/4) x 1000 750
4 CPUs (1 - 4/4) x 1000 0

The age factor accumulates over time. With PriorityWeightAge=1000 and PriorityMaxAge=1-0 (1 day):

  • Each minute of waiting adds approximately 0.69 points.

  • After 24 hours, the job receives the full 1,000 age points.

  • Jobs waiting longer than PriorityMaxAge immediately receive the full PriorityWeightAge score.

This prevents large jobs from waiting indefinitely: as waiting time grows, the age factor eventually outweighs the size penalty.

Set up job size priority

Add the following to slurm.conf:

PriorityFavorSmall=YES
PriorityWeightAge=1000
PriorityWeightJobSize=1000
PriorityMaxAge=1-0

If job runtimes can be estimated, also enable backfill scheduling (it is on by default):

SchedulerType=sched/backfill

Backfill scheduling fills idle time before large jobs with small jobs that can complete without delaying the larger ones.

What's next