All Products
Search
Document Center

Elastic High Performance Computing:Best practices for managed PowerSaving Slurm clusters

Last Updated:Mar 04, 2026

A managed PowerSaving Slurm cluster pre-registers virtual nodes for each queue. It automatically creates ECS instances when you submit jobs and automatically releases them after jobs finish. You pay for compute resources only while jobs run.

Use cases

HPC workloads often exhibit clear peaks and valleys. You need large amounts of computing power during batch job submissions but require almost no resources during idle periods. Static clusters continue running nodes even when idle, incurring ongoing costs. PowerSaving mode ties compute resources to the job lifecycle. The scheduler pre-registers a set of virtual nodes. It automatically creates ECS instances when you submit jobs and automatically releases them after an idle timeout. You pay for compute resources only while jobs run.

How PowerSaving works

The Slurm scheduler pre-registers a group of virtual nodes (in the idle~ state) for each queue. These nodes appear in the scheduler but do not consume ECS resources. After you submit a job, the scheduler creates an ECS instance (alloc# state). When the instance is ready, it runs the job (alloc state). After the job finishes, the node starts an idle timer. If the idle time exceeds the threshold—defaulting to 1 minute—the system releases the instance and returns the node to the idle~ state.

Node state identifiers and transitions:

Symbol

State

Meaning

~

POWERED_DOWN

The node is powered off. No ECS resources are allocated. The node can accept job scheduling.

#

POWERING_UP

The system is creating an ECS instance. It waits for the instance to become ready.

(no suffix)

POWER_UP

The ECS instance is ready. The node is running normally.

!

POWER_DOWN

The node receives a power-down signal and begins the release process. This state lasts for a very short time.

%

POWERING_DOWN

The system is releasing the ECS instance.

Typical transition path: idle~ → alloc# → alloc → idle → idle! → idle% → idle~

For more information, see Slurm Power Saving official documentation.

Create a managed Slurm cluster

Cluster creation takes about 10 to 15 minutes. Fill in other parameters as needed.

  1. Create a Standard Edition cluster with the following settings.

    Configuration item

    Example value

    Series

    Managed

    Deployment mode

    Public cloud cluster

    Cluster type

    SLURM

    Maximum number of nodes in the cluster

    1000 (set based on your peak workload)

    Number of nodes in the queue

    Skip this step when creating the cluster. Configure the PowerSaving queue separately after the cluster is ready.

    Shared file storage

    By default, mount /home and /opt as shared storage directories.

    Logon node

    Instance type: ecs.c7.large (2 vCPU / 4 GiB). Image: CentOS 7.6 64-bit. Enable EIP as needed.

  2. Go to the Cluster Details page.

    1. Log on to the E-HPC console.

    2. In the left part of the top navigation bar, select a region.

    3. In the left-side navigation pane, click Cluster.

    4. On the Cluster List page, find the cluster that you want to manage and click the cluster ID.

  3. In the left-side navigation pane, click User Management.

  4. You must submit jobs as a non-root user. In the User management page, click Add user.

  5. In the dialog box that appears, configure user information. Use usertest as an example.

    • Username: usertest.

    • User permission: Select Ordinary permission group. This group is suitable for users who submit and debug jobs.

    • Password and Confirm password: Set the logon password.

Configure a PowerSaving queue

After the cluster is ready, configure a PowerSaving queue to define elastic scaling behavior.

  1. In the Clusters list page, click the name of your target cluster.

  2. In the navigation pane on the left, choose Nodes and queues > Queues.

  3. Click Create queue and fill in the parameters using the following settings.

Basic settings

Configuration item

Example value and description

Queue name

testq

Queue auto scaling

Enable. After enabling, the queue automatically creates and releases nodes based on job load.

Number of nodes in the queue

0–20. The system pre-registers 20 virtual nodes and assigns hostnames and IP addresses. Elastic scaling selects nodes from this pool.

Node configuration

Configuration item

Example value and description

Node interconnection

VPC network

Virtual switch

Select the vSwitch for the queue (for example, vsw-bp*******************). The system uses this to determine the zone and pre-create nodes.

Instance type group

General-purpose g7 / ecs.g7.large / CentOS 7.6 64-bit / 40 GiB ESSD PL0 system disk. If you specify multiple types, only the first one applies. Choose available types based on inventory in your zone.

Auto scaling

Configuration item

Example value and description

Scaling policy

Capacity-first policy. The system prioritizes zones in order and tries to meet capacity requirements.

Hostname prefix

testq. The system generates node ordinal numbers using this prefix (for example, testq001, testq002).

Submit a job and verify auto scaling

After you configure the queue, submit a CPU stress test job to verify auto scaling.

Connect to the cluster and check node status

  1. In the Clusters list page, click the name of your target cluster.

  2. In the options bar in the upper-right corner, select Remote connection.

  3. After you open the Workbench page, run the sinfo command. Confirm that nodes testq[001–020] are in the idle~ state.

Install test tools and submit a job

  1. Install the stress tool in the /opt shared directory so all compute nodes can access it.

cd /tmp
wget https://fossies.org/linux/privat/stress-1.0.7.tar.gz
tar -xzf stress-1.0.7.tar.gz
cd stress-1.0.7/
./configure --prefix=/opt/stress
make && make install
  1. Confirm the installation succeeded.

/opt/stress/bin/stress --version
  1. Switch to the usertest user and go to the home directory.

su usertest
cd ~
  1. Create a job script named cpu_stress.slurm. Request 1 node and 2 CPU cores from the testq queue. Run the stress test for 10 minutes.

cat > cpu_stress.slurm << EOF
#!/bin/bash
#SBATCH --job-name=cpu_stress_test
#SBATCH --partition=testq
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=2
#SBATCH --time=00:10:00
#SBATCH --output=cpu_stress_%j.out
#SBATCH --error=cpu_stress_%j.err

echo "=== CPU Stress Test ==="
echo "Job ID: \$SLURM_JOB_ID"
echo "Node: \$(hostname)"
echo "Requested CPUs: \$SLURM_CPUS_PER_TASK"
echo "Start time: \$(date)"

NUM_CPUS=\${SLURM_CPUS_PER_TASK:-\$(nproc)}

if command -v /opt/stress/bin/stress >/dev/null 2>&1; then
    /opt/stress/bin/stress --cpu \$NUM_CPUS --timeout \${SLURM_TIME_LIMIT:-600}
else
    echo "ERROR: stress is not installed"
    exit 1
fi

echo "End time: \$(date)"
echo "=== CPU Stress Test Completed ==="
EOF
  1. Submit the job to the testq queue.

sbatch -p testq cpu_stress.slurm

Observe auto scale-out and scale-in

After you submit the job, run sinfo and squeue to watch node state changes.

  1. Scale-out phase: sinfo shows the node in the alloc# state, meaning the ECS instance is being created. In the console, the corresponding testq001 node shows as initializing.

  2. Running phase: The node changes to the alloc state. squeue shows the job in the R (Running) state. In the console, the CPU usage for testq001 is high.

  3. Scale-in phase: After the job finishes, the node is automatically revoked if idle for more than 1 minute. Job output is saved in cpu_stress_<job_id>.out.

    === CPU Stress Test ===
    Job ID: 1
    Node: testq001
    Requested CPUs: 2
    Start time: Tue Feb 10 19:30:24 CST 2026
    Using stress (fallback)
    stress: info: [14051] dispatching hogs: 2 cpu, 0 io, 0 vm, 0 hdd

    After the node is revoked, testq001 returns to the idle~ state and the ECS instance is released.

Limits

  • Region limits: Supported only in Hangzhou, Beijing, Heyuan, and Shanghai.

  • Maximum nodes per cluster: 1000.

  • Queues support only single-zone, single-instance-type, and homogeneous instance types.

  • Queues do not support exception nodes, hostname prefix changes, hyper-threading, or eRDMA.

  • Maximum nodes per queue: 500.

  • Manual scale-out and scale-in are not supported.

  • srun and MPI jobs are not supported. Only sbatch batch jobs are supported.