All Products
Search
Document Center

Container Service for Kubernetes:Using Horovod for Elastic Training on Kubernetes

Last Updated:Mar 26, 2026

Traditional distributed training jobs fix the worker count at submission time. Elastic Horovod removes this constraint: it lets you scale workers up or down during a running job without restarting it or restoring from a checkpoint. Use this feature when:

  • Your cluster includes preemptible instances that may be reclaimed at any time

  • You want to expand into idle GPU capacity as it becomes available

  • You need to reduce training costs by releasing underused workers mid-job

Prerequisites

Before you begin, ensure that you have:

  • The cloud-native AI suite deployed in your ACK cluster, with Elastic Training and Arena selected during deployment. For details, see Deploy the cloud-native AI suite.

  • You use Horovod as the distributed training framework.

  • A training script that uses Horovod as the distributed training framework

  • The Arena client installed. For details, see Configure the Arena client.

Submit an elastic training job

Run the following command to submit an elastic training job:

arena submit etjob \
    --name=elastic-training \
    --gpus=1 \
    --workers=3 \
    --max-workers=9 \
    --min-workers=1 \
    --image=registry.cn-hangzhou.aliyuncs.com/ai-samples/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.6.0.post0-py3.7-cuda10.1 \
    --working-dir=/examples \
    "horovodrun \
    -np \$((\${workers}*\${gpus})) \
    --min-np \$((\${minWorkers}*\${gpus})) \
    --max-np \$((\${maxWorkers}*\${gpus})) \
    --host-discovery-script /etc/edl/discover_hosts.sh \
    python /examples/elastic/tensorflow2_mnist_elastic.py
    "

The horovodrun wrapper manages the elastic training process. Arena writes the parameter values to environment variables, which horovodrun reads via the -np, --min-np, and --max-np flags.

The host discovery script at /etc/edl/discover_hosts.sh is created by the et-operator component.

Parameter Description
--name Name of the training job. Must be globally unique.
--gpus Number of GPUs allocated to each worker.
--workers Number of workers to run the training task.
--max-workers Maximum number of workers for the training task.
--min-workers Minimum number of workers for the training task.
--image Container image used to run the job.
--working-dir Directory in which the command runs inside the container.
--np Number of workers to be used for the task. Computed from ${workers}*${gpus}.
--max-np Maximum number of workers to be used for the task. Computed from ${maxWorkers}*${gpus}.
--min-np Minimum number of workers to be used for the task. Computed from ${minWorkers}*${gpus}.
--host-discovery-script Path to the host discovery script. The et-operator component creates this script at /etc/edl/discover_hosts.sh.

The expected output is similar to:

configmap/elastic-training-etjob created
configmap/elastic-training-etjob labeled
trainingjob.kai.alibabacloud.com/elastic-training created
INFO[0000] The Job elastic-training has been submitted successfully
INFO[0000] You can run `arena get elastic-training --type etjob` to check the job status

Verify the running job

Run the following command to check job status:

arena get elastic-training

The expected output is similar to:

Name:        elastic-training
Status:      RUNNING
Namespace:   default
Priority:    N/A
Trainer:     ETJOB
Duration:    13s

Instances:
  NAME                       STATUS   AGE  IS_CHIEF  GPU(Requested)  NODE
  ----                       ------   ---  --------  --------------  ----
  elastic-training-launcher  Running  13s  true      0               cn-huhehaote.192.168.0.173
  elastic-training-worker-0  Running  13s  false     1               cn-huhehaote.192.168.0.174
  elastic-training-worker-1  Running  13s  false     1               cn-huhehaote.192.168.0.174

To check training progress, view the latest log lines:

arena logs elastic-training --tail 10

The expected output is similar to:

[0]<stdout>:Step #340    Loss: 0.047924
[1]<stdout>:Step #340    Loss: 0.116303
[0]<stdout>:Step #350    Loss: 0.068762
[1]<stdout>:Step #350    Loss: 0.040847
[0]<stdout>:Step #360    Loss: 0.057501
[1]<stdout>:Step #360    Loss: 0.111952
[0]<stdout>:Step #370    Loss: 0.085895
[1]<stdout>:Step #370    Loss: 0.075529
[0]<stdout>:Step #380    Loss: 0.063450
[1]<stdout>:Step #380    Loss: 0.054253

Scale out workers

Run the following command to add workers to the running job:

arena scaleout etjob --name="elastic-training" --count=1 --timeout=10m
Parameter Description
--name Name of the training job to scale.
--count Number of workers to add.
--timeout Timeout period of the scale-out operation. If workers are not created before the timeout period ends, the scheduler rolls back the scale-out operation.

The expected output is similar to:

configmap/elastic-training-1609914643-scaleout created
configmap/elastic-training-1609914643-scaleout labeled
scaleout.kai.alibabacloud.com/elastic-training-1609914643 created
INFO[0003] The scaleout job elastic-training-1609914643 has been submitted successfully

To confirm the new worker is running, check the job status:

arena get elastic-training

The expected output is similar to:

Name:        elastic-training
Status:      RUNNING
Namespace:   default
Priority:    N/A
Trainer:     ETJOB
Duration:    3m

Instances:
  NAME                       STATUS   AGE  IS_CHIEF  GPU(Requested)  NODE
  ----                       ------   ---  --------  --------------  ----
  elastic-training-launcher  Running  3m   true      0               cn-huhehaote.192.168.0.173
  elastic-training-worker-0  Running  3m   false     1               cn-huhehaote.192.168.0.174
  elastic-training-worker-1  Running  3m   false     1               cn-huhehaote.192.168.0.174
  elastic-training-worker-2  Running  1m   false     1               cn-huhehaote.192.168.0.173

elastic-training-worker-2 is now active. The logs will show output from all three workers (indices [0], [1], and [2]):

arena logs elastic-training --tail 10
[1]<stdout>:Step #1670    Loss: 0.131210
[2]<stdout>:Step #1680    Loss: 0.020876
[0]<stdout>:Step #1680    Loss: 0.030605
[1]<stdout>:Step #1680    Loss: 0.074515
[2]<stdout>:Step #1690    Loss: 0.029105
[0]<stdout>:Step #1690    Loss: 0.015216
[1]<stdout>:Step #1690    Loss: 0.022670
[0]<stdout>:Step #1700    Loss: 0.105407
[1]<stdout>:Step #1700    Loss: 0.037623
[2]<stdout>:Step #1700    Loss: 0.032874

Scale in workers

Run the following command to remove workers from the running job:

arena scalein etjob --name="elastic-training" --count=1 --timeout=10m
Parameter Description
--name Name of the training job to scale.
--count Number of workers to remove.
--timeout Timeout period of the scale-in operation.

The expected output is similar to:

configmap/elastic-training-1609914720-scalein created
configmap/elastic-training-1609914720-scalein labeled
scalein.kai.alibabacloud.com/elastic-training-1609914720 created
INFO[0002] The scalein job elastic-training-1609914720 has been submitted successfully

To confirm the worker was removed, check the job status:

arena get elastic-training

The expected output is similar to:

Name:        elastic-training
Status:      RUNNING
Namespace:   default
Priority:    N/A
Trainer:     ETJOB
Duration:    3m

Instances:
  NAME                       STATUS   AGE  IS_CHIEF  GPU(Requested)  NODE
  ----                       ------   ---  --------  --------------  ----
  elastic-training-launcher  Running  3m   true      0               cn-huhehaote.192.168.0.173
  elastic-training-worker-0  Running  3m   false     1               cn-huhehaote.192.168.0.174
  elastic-training-worker-1  Running  3m   false     1               cn-huhehaote.192.168.0.174

elastic-training-worker-2 is no longer listed. The logs will show output from two workers only:

arena logs elastic-training --tail 10
[1]<stdout>:Step #2180    Loss: 0.001739
[0]<stdout>:Step #2180    Loss: 0.004853
[0]<stdout>:Step #2190    Loss: 0.000846
[1]<stdout>:Step #2190    Loss: 0.007900
[0]<stdout>:Step #2200    Loss: 0.039376
[1]<stdout>:Step #2200    Loss: 0.024672
[0]<stdout>:Step #2210    Loss: 0.012985
[1]<stdout>:Step #2210    Loss: 0.010956
[0]<stdout>:Step #2220    Loss: 0.009604
[1]<stdout>:Step #2220    Loss: 0.002531