Run and scale elastic Horovod training jobs - Container Service for Kubernetes

Traditional distributed training jobs fix the worker count at submission time. Elastic Horovod removes this constraint: it lets you scale workers up or down during a running job without restarting it or restoring from a checkpoint. Use this feature when:

Your cluster includes preemptible instances that may be reclaimed at any time
You want to expand into idle GPU capacity as it becomes available
You need to reduce training costs by releasing underused workers mid-job

Prerequisites

Before you begin, ensure that you have:

The cloud-native AI suite deployed in your ACK cluster, with Elastic Training and Arena selected during deployment. For details, see Deploy the cloud-native AI suite.
You use Horovod as the distributed training framework.
A training script that uses Horovod as the distributed training framework
The Arena client installed. For details, see Configure the Arena client.

Submit an elastic training job

Run the following command to submit an elastic training job:

arena submit etjob \
    --name=elastic-training \
    --gpus=1 \
    --workers=3 \
    --max-workers=9 \
    --min-workers=1 \
    --image=registry.cn-hangzhou.aliyuncs.com/ai-samples/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.6.0.post0-py3.7-cuda10.1 \
    --working-dir=/examples \
    "horovodrun \
    -np \$((\${workers}*\${gpus})) \
    --min-np \$((\${minWorkers}*\${gpus})) \
    --max-np \$((\${maxWorkers}*\${gpus})) \
    --host-discovery-script /etc/edl/discover_hosts.sh \
    python /examples/elastic/tensorflow2_mnist_elastic.py
    "

The horovodrun wrapper manages the elastic training process. Arena writes the parameter values to environment variables, which horovodrun reads via the -np, --min-np, and --max-np flags.

The host discovery script at /etc/edl/discover_hosts.sh is created by the et-operator component.

Parameter	Description
`--name`	Name of the training job. Must be globally unique.
`--gpus`	Number of GPUs allocated to each worker.
`--workers`	Number of workers to run the training task.
`--max-workers`	Maximum number of workers for the training task.
`--min-workers`	Minimum number of workers for the training task.
`--image`	Container image used to run the job.
`--working-dir`	Directory in which the command runs inside the container.
`--np`	Number of workers to be used for the task. Computed from `${workers}*${gpus}`.
`--max-np`	Maximum number of workers to be used for the task. Computed from `${maxWorkers}*${gpus}`.
`--min-np`	Minimum number of workers to be used for the task. Computed from `${minWorkers}*${gpus}`.
`--host-discovery-script`	Path to the host discovery script. The `et-operator` component creates this script at `/etc/edl/discover_hosts.sh`.

The expected output is similar to:

configmap/elastic-training-etjob created
configmap/elastic-training-etjob labeled
trainingjob.kai.alibabacloud.com/elastic-training created
INFO[0000] The Job elastic-training has been submitted successfully
INFO[0000] You can run `arena get elastic-training --type etjob` to check the job status

Verify the running job

Run the following command to check job status:

arena get elastic-training

The expected output is similar to:

Name:        elastic-training
Status:      RUNNING
Namespace:   default
Priority:    N/A
Trainer:     ETJOB
Duration:    13s

Instances:
  NAME                       STATUS   AGE  IS_CHIEF  GPU(Requested)  NODE
  ----                       ------   ---  --------  --------------  ----
  elastic-training-launcher  Running  13s  true      0               cn-huhehaote.192.168.0.173
  elastic-training-worker-0  Running  13s  false     1               cn-huhehaote.192.168.0.174
  elastic-training-worker-1  Running  13s  false     1               cn-huhehaote.192.168.0.174

To check training progress, view the latest log lines:

arena logs elastic-training --tail 10

The expected output is similar to:

[0]<stdout>:Step #340    Loss: 0.047924
[1]<stdout>:Step #340    Loss: 0.116303
[0]<stdout>:Step #350    Loss: 0.068762
[1]<stdout>:Step #350    Loss: 0.040847
[0]<stdout>:Step #360    Loss: 0.057501
[1]<stdout>:Step #360    Loss: 0.111952
[0]<stdout>:Step #370    Loss: 0.085895
[1]<stdout>:Step #370    Loss: 0.075529
[0]<stdout>:Step #380    Loss: 0.063450
[1]<stdout>:Step #380    Loss: 0.054253

Scale out workers

Run the following command to add workers to the running job:

arena scaleout etjob --name="elastic-training" --count=1 --timeout=10m

Parameter	Description
`--name`	Name of the training job to scale.
`--count`	Number of workers to add.
`--timeout`	Timeout period of the scale-out operation. If workers are not created before the timeout period ends, the scheduler rolls back the scale-out operation.

The expected output is similar to:

configmap/elastic-training-1609914643-scaleout created
configmap/elastic-training-1609914643-scaleout labeled
scaleout.kai.alibabacloud.com/elastic-training-1609914643 created
INFO[0003] The scaleout job elastic-training-1609914643 has been submitted successfully

To confirm the new worker is running, check the job status:

arena get elastic-training

The expected output is similar to:

Name:        elastic-training
Status:      RUNNING
Namespace:   default
Priority:    N/A
Trainer:     ETJOB
Duration:    3m

Instances:
  NAME                       STATUS   AGE  IS_CHIEF  GPU(Requested)  NODE
  ----                       ------   ---  --------  --------------  ----
  elastic-training-launcher  Running  3m   true      0               cn-huhehaote.192.168.0.173
  elastic-training-worker-0  Running  3m   false     1               cn-huhehaote.192.168.0.174
  elastic-training-worker-1  Running  3m   false     1               cn-huhehaote.192.168.0.174
  elastic-training-worker-2  Running  1m   false     1               cn-huhehaote.192.168.0.173

elastic-training-worker-2 is now active. The logs will show output from all three workers (indices [0], [1], and [2]):

arena logs elastic-training --tail 10

[1]<stdout>:Step #1670    Loss: 0.131210
[2]<stdout>:Step #1680    Loss: 0.020876
[0]<stdout>:Step #1680    Loss: 0.030605
[1]<stdout>:Step #1680    Loss: 0.074515
[2]<stdout>:Step #1690    Loss: 0.029105
[0]<stdout>:Step #1690    Loss: 0.015216
[1]<stdout>:Step #1690    Loss: 0.022670
[0]<stdout>:Step #1700    Loss: 0.105407
[1]<stdout>:Step #1700    Loss: 0.037623
[2]<stdout>:Step #1700    Loss: 0.032874

Scale in workers

Run the following command to remove workers from the running job:

arena scalein etjob --name="elastic-training" --count=1 --timeout=10m

Parameter	Description
`--name`	Name of the training job to scale.
`--count`	Number of workers to remove.
`--timeout`	Timeout period of the scale-in operation.

The expected output is similar to:

configmap/elastic-training-1609914720-scalein created
configmap/elastic-training-1609914720-scalein labeled
scalein.kai.alibabacloud.com/elastic-training-1609914720 created
INFO[0002] The scalein job elastic-training-1609914720 has been submitted successfully

To confirm the worker was removed, check the job status:

arena get elastic-training

The expected output is similar to:

Name:        elastic-training
Status:      RUNNING
Namespace:   default
Priority:    N/A
Trainer:     ETJOB
Duration:    3m

Instances:
  NAME                       STATUS   AGE  IS_CHIEF  GPU(Requested)  NODE
  ----                       ------   ---  --------  --------------  ----
  elastic-training-launcher  Running  3m   true      0               cn-huhehaote.192.168.0.173
  elastic-training-worker-0  Running  3m   false     1               cn-huhehaote.192.168.0.174
  elastic-training-worker-1  Running  3m   false     1               cn-huhehaote.192.168.0.174

elastic-training-worker-2 is no longer listed. The logs will show output from two workers only:

arena logs elastic-training --tail 10

[1]<stdout>:Step #2180    Loss: 0.001739
[0]<stdout>:Step #2180    Loss: 0.004853
[0]<stdout>:Step #2190    Loss: 0.000846
[1]<stdout>:Step #2190    Loss: 0.007900
[0]<stdout>:Step #2200    Loss: 0.039376
[1]<stdout>:Step #2200    Loss: 0.024672
[0]<stdout>:Step #2210    Loss: 0.012985
[1]<stdout>:Step #2210    Loss: 0.010956
[0]<stdout>:Step #2220    Loss: 0.009604
[1]<stdout>:Step #2220    Loss: 0.002531