A managed PowerSaving Slurm cluster pre-registers virtual nodes for each queue. It automatically creates ECS instances when you submit jobs and automatically releases them after jobs finish. You pay for compute resources only while jobs run.
Use cases
HPC workloads often exhibit clear peaks and valleys. You need large amounts of computing power during batch job submissions but require almost no resources during idle periods. Static clusters continue running nodes even when idle, incurring ongoing costs. PowerSaving mode ties compute resources to the job lifecycle. The scheduler pre-registers a set of virtual nodes. It automatically creates ECS instances when you submit jobs and automatically releases them after an idle timeout. You pay for compute resources only while jobs run.
How PowerSaving works
The Slurm scheduler pre-registers a group of virtual nodes (in the idle~ state) for each queue. These nodes appear in the scheduler but do not consume ECS resources. After you submit a job, the scheduler creates an ECS instance (alloc# state). When the instance is ready, it runs the job (alloc state). After the job finishes, the node starts an idle timer. If the idle time exceeds the threshold—defaulting to 1 minute—the system releases the instance and returns the node to the idle~ state.
Node state identifiers and transitions:
Symbol | State | Meaning |
~ | POWERED_DOWN | The node is powered off. No ECS resources are allocated. The node can accept job scheduling. |
# | POWERING_UP | The system is creating an ECS instance. It waits for the instance to become ready. |
(no suffix) | POWER_UP | The ECS instance is ready. The node is running normally. |
! | POWER_DOWN | The node receives a power-down signal and begins the release process. This state lasts for a very short time. |
% | POWERING_DOWN | The system is releasing the ECS instance. |
Typical transition path: idle~ → alloc# → alloc → idle → idle! → idle% → idle~ | ||
For more information, see Slurm Power Saving official documentation.
Create a managed Slurm cluster
Cluster creation takes about 10 to 15 minutes. Fill in other parameters as needed.
Create a Standard Edition cluster with the following settings.
Configuration item
Example value
Series
Managed
Deployment mode
Public cloud cluster
Cluster type
SLURM
Maximum number of nodes in the cluster
1000 (set based on your peak workload)
Number of nodes in the queue
Skip this step when creating the cluster. Configure the PowerSaving queue separately after the cluster is ready.
Shared file storage
By default, mount
/homeand/optas shared storage directories.Logon node
Instance type: ecs.c7.large (2 vCPU / 4 GiB). Image: CentOS 7.6 64-bit. Enable EIP as needed.
Go to the Cluster Details page.
Log on to the E-HPC console.
In the left part of the top navigation bar, select a region.
In the left-side navigation pane, click Cluster.
On the Cluster List page, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, click User Management.
You must submit jobs as a non-root user. In the User management page, click Add user.
In the dialog box that appears, configure user information. Use usertest as an example.
Username: usertest.
User permission: Select Ordinary permission group. This group is suitable for users who submit and debug jobs.
Password and Confirm password: Set the logon password.
Configure a PowerSaving queue
After the cluster is ready, configure a PowerSaving queue to define elastic scaling behavior.
In the Clusters list page, click the name of your target cluster.
In the navigation pane on the left, choose .
Click Create queue and fill in the parameters using the following settings.
Basic settings
Configuration item | Example value and description |
Queue name | testq |
Queue auto scaling | Enable. After enabling, the queue automatically creates and releases nodes based on job load. |
Number of nodes in the queue | 0–20. The system pre-registers 20 virtual nodes and assigns hostnames and IP addresses. Elastic scaling selects nodes from this pool. |
Node configuration
Configuration item | Example value and description |
Node interconnection | VPC network |
Virtual switch | Select the vSwitch for the queue (for example, vsw-bp*******************). The system uses this to determine the zone and pre-create nodes. |
Instance type group | General-purpose g7 / ecs.g7.large / CentOS 7.6 64-bit / 40 GiB ESSD PL0 system disk. If you specify multiple types, only the first one applies. Choose available types based on inventory in your zone. |
Auto scaling
Configuration item | Example value and description |
Scaling policy | Capacity-first policy. The system prioritizes zones in order and tries to meet capacity requirements. |
Hostname prefix | testq. The system generates node ordinal numbers using this prefix (for example, testq001, testq002). |
Submit a job and verify auto scaling
After you configure the queue, submit a CPU stress test job to verify auto scaling.
Connect to the cluster and check node status
In the Clusters list page, click the name of your target cluster.
In the options bar in the upper-right corner, select Remote connection.
After you open the Workbench page, run the
sinfocommand. Confirm that nodes testq[001–020] are in the idle~ state.
Install test tools and submit a job
Install the stress tool in the
/optshared directory so all compute nodes can access it.
cd /tmp
wget https://fossies.org/linux/privat/stress-1.0.7.tar.gz
tar -xzf stress-1.0.7.tar.gz
cd stress-1.0.7/
./configure --prefix=/opt/stress
make && make installConfirm the installation succeeded.
/opt/stress/bin/stress --versionSwitch to the usertest user and go to the home directory.
su usertest
cd ~Create a job script named
cpu_stress.slurm. Request 1 node and 2 CPU cores from the testq queue. Run the stress test for 10 minutes.
cat > cpu_stress.slurm << EOF
#!/bin/bash
#SBATCH --job-name=cpu_stress_test
#SBATCH --partition=testq
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=2
#SBATCH --time=00:10:00
#SBATCH --output=cpu_stress_%j.out
#SBATCH --error=cpu_stress_%j.err
echo "=== CPU Stress Test ==="
echo "Job ID: \$SLURM_JOB_ID"
echo "Node: \$(hostname)"
echo "Requested CPUs: \$SLURM_CPUS_PER_TASK"
echo "Start time: \$(date)"
NUM_CPUS=\${SLURM_CPUS_PER_TASK:-\$(nproc)}
if command -v /opt/stress/bin/stress >/dev/null 2>&1; then
/opt/stress/bin/stress --cpu \$NUM_CPUS --timeout \${SLURM_TIME_LIMIT:-600}
else
echo "ERROR: stress is not installed"
exit 1
fi
echo "End time: \$(date)"
echo "=== CPU Stress Test Completed ==="
EOFSubmit the job to the testq queue.
sbatch -p testq cpu_stress.slurmObserve auto scale-out and scale-in
After you submit the job, run sinfo and squeue to watch node state changes.
Scale-out phase:
sinfoshows the node in the alloc# state, meaning the ECS instance is being created. In the console, the corresponding testq001 node shows as initializing.Running phase: The node changes to the alloc state.
squeueshows the job in the R (Running) state. In the console, the CPU usage for testq001 is high.Scale-in phase: After the job finishes, the node is automatically revoked if idle for more than 1 minute. Job output is saved in
cpu_stress_<job_id>.out.=== CPU Stress Test === Job ID: 1 Node: testq001 Requested CPUs: 2 Start time: Tue Feb 10 19:30:24 CST 2026 Using stress (fallback) stress: info: [14051] dispatching hogs: 2 cpu, 0 io, 0 vm, 0 hddAfter the node is revoked, testq001 returns to the idle~ state and the ECS instance is released.
Limits
Region limits: Supported only in Hangzhou, Beijing, Heyuan, and Shanghai.
Maximum nodes per cluster: 1000.
Queues support only single-zone, single-instance-type, and homogeneous instance types.
Queues do not support exception nodes, hostname prefix changes, hyper-threading, or eRDMA.
Maximum nodes per queue: 500.
Manual scale-out and scale-in are not supported.
srun and MPI jobs are not supported. Only sbatch batch jobs are supported.