due to the natural advantages of cloud computing in terms of resource costs and elastic expansion, more and more customers are willing to build AI systems on the cloud. Cloud-native technologies, such as containers and Kubernetes, it has become the shortest path to release cloud value. Building AI platforms on the cloud based on Kubernetes has become a trend.
When complex model training or large data volume occurs, the computing power of a single machine cannot meet the computing power requirements. By using Alibaba's AiACC or community's horovod and other distributed training frameworks, you only need to modify a few lines of code to expand a standalone training task to support distributed training tasks. In Kubernetes, tf-operator in kubeflow community supports Tensorflow PS mode, or mpi-operator supports mpi allreduce mode in horovod.
kubernetes and cloud computing provide agility and scalability. We can use components such as cluster-AutoScaler to set elastic policies for training tasks, and use the elasticity of Kubernetes to create on-demand tasks to reduce idle GPU devices. However, this scaling mode is slightly insufficient for offline tasks such as training:
- fault tolerance is not supported. When some workers fail due to device reasons, the entire task needs to be stopped and restarted.
- Generally, training tasks take a long time, occupy a large amount of computing power, and lack elasticity. When resources are insufficient, resources cannot be released for other services on demand unless the task is terminated.
- The training task takes a long time, does not support dynamic worker configuration, and cannot safely use preemptible instances to maximize the cost performance on the cloud.
How to attach elasticity to training tasks is the key way to improve cost performance. Recently, distributed frameworks such as horovod gradually support Elastic Training, that is, Elastic Training capability. In other words, a training task is allowed to dynamically scale out or scale in the training worker during execution without interrupting the training task. A few modifications are required in the code. For more information, see the implementation principles of https://horovod.readthedocs.io/en/stable/elastic_include.html对Elastic training. For more information, see this Elastic Horovod design document.
In mpi-operator, all workers involved in training are designed and maintained as static resources. The support for the elastic training mode increases the flexibility of tasks and brings challenges to the O & M layer, such:
- you must use horovordrun provided by horovod as the entry. In horovod, the launcher logs on to the worker through ssh. You must open the login tunnel between the launcher and the worker.
- The Elastic Driver module that is responsible for computing elasticity obtains the latest worker topology information by specifying the discover_host script to pull up or stop the worker instance. When the worker changes, the return value of the discover_host script must be updated.
- In preemptible or price calculation scenarios, it is sometimes necessary to specify worker scale-in. k8s is a native orchestration meta deployment, statefulset cannot meet the specified scale-in scenarios.
to solve the above problems, we have designed and developed et-operator to provide
TrainingJobCRD describes the training task,
ScaleInCRD describes scale-out and scale-in operations. The combination of these operations makes our training tasks more flexible.
TrainingJob Controller has the following features:
- maintain the creation/deletion lifecycle of TrainingJob and manage sub-resources.
- Perform scale-in operations
- fault Tolerance. When a worker is expelled, a new worker is created and added to the training.
TrainingJob sub-resources are created in the following order:
- create a key pair required to connect ssh, and create a secret
- create workers, including service and pod, and Mount secret public key
- create a configmap, including
- create a launcher and Mount configmap. Since hostfile will be modified along with the topology, hostfile is copied from configmap to a separate directory through initcontainer.
TrainingJob configurations are divided into Lanucher and Worker. By default, et-operator mounts the discover_host script to the Launcher
/etc/edl/discover_hosts.shfile, at the entrance of the script horovodrun medium can pass
--host-discovery-scriptthe name of the parameter. In Worker settings, specify the number of replicas of workers by maxReplicas / minReplicas.
apiVersion: kai.alibabacloud.com/v1alpha1 kind: TrainingJob metadata: name: elastic-training namespace: default spec: cleanPodPolicy: Running etReplicaSpecs: launcher: replicas: 1 template: spec: containers: - command: - sh - -c - horovodrun -np 2 --min-np 1 --max-np 9 --host-discovery-script /etc/edl/discover_hosts.sh python /examples/elastic/tensorflow2_mnist_elastic.py image: registry.cn-huhehaote.aliyuncs.com/lumo/horovod:master-tf2.1.0-torch1.4.0-mxnet-py3.6-gpu imagePullPolicy: Always name: mnist-elastic worker: maxReplicas: 9 minReplicas: 1 replicas: 2 template: spec: containers: - image: registry.cn-huhehaote.aliyuncs.com/lumo/horovod:master-tf2.1.0-torch1.4.0-mxnet-py3.6-gpu imagePullPolicy: Always name: mnist-elastic resources: limits: nvidia.com/gpu: "1" requests: nvidia.com/gpu: "1" status: currentWorkers: - elastic-training-worker-0 - elastic-training-worker-1 - elastic-training-worker-2 - elastic-training-worker-3 phase: Succeeded replicaStatuses: Launcher: active: 1 succeeded: 1 Worker: active: 4
In addition to TrainingJob, et-operator supports both ScaleOut and ScaleIn CRDS to scale out and scale in training tasks. When a ScaleOut CR is issued, ScaleOutController trigger Reconcile. The work here is simple. According to the ScaleOut CR field in the Selector, find the TrainingJob corresponding to Scaler and set it to the OwnerReferences of CR.
- apiVersion: kai.alibabacloud.com/v1alpha1 kind: ScaleOut metadata: creationTimestamp: "2020-11-04T13:54:26Z name: scaleout-ptfnk namespace: default ownerReferences: - apiVersion: kai.alibabacloud.com/v1alpha1 blockOwnerDeletion: true controller: true kind: TrainingJob name: elastic-training // 指向扩容对象TrainingJob uid: 075b9c4a-22f9-40ce-83c7-656b329a2b9e spec: selector: name: elastic-training toAdd: count: 2
TrainingJobController in monitoring to belong
TrainingJobthe ScaleOut CR has been updated, trigger
TrainingJobreconcile, traversal filtering
TrainingJobthe OwnerReference and ScaleIn pointed to in the following ScaleOut. The scale-out or scale-in depends on the creation time and status time.
apiVersion: kai.alibabacloud.com/v1alpha1 kind: TrainingJob metadata: name: elastic-training namespace: default spec: // ...... Launcher and Worker spec status: currentScaler: ScaleIn:default/scaleout-ptfnk phase: Scaling currentWorkers: - elastic-training-worker-0 - elastic-training-worker-1
mkdir -p $(go env GOPATH)/src/github.com/aliyunContainerService cd $(go env GOPATH)/src/github.com/aliyunContainerService git clone https://http://github.com/aliyunContainerService/et-operator cd et-operator kubectl create -f deploy/all_in_one.yaml
check the installation of crd
# kubectl get crd NAME CREATED AT scaleins.kai.alibabacloud.com 2020-11-11T11:16:13Z scaleouts.kai.alibabacloud.com 2020-11-11T11:16:13Z trainingjobs.kai.alibabacloud.com 2020-11-11T11:16:13Z
checks the running status of the controller. By default, the controller is installed in kube-ai.
# kubectl -n kube-ai get po NAME READY STATUS RESTARTS AGE et-operator-controller-manager-7877968489-c5kv4 0/2 ContainerCreating 0 5s
run the prepared example
kubectl apply -f examples/training_job.yaml
check the running status
# kubectl get trainingjob NAME PHASE AGE elastic-training Running 77s # kubectl get po NAME READY STATUS RESTARTS AGE elastic-training-launcher 1/1 Running 0 7s elastic-training-worker-0 1/1 Running 0 10s elastic-training-worker-1 1/1 Running 0 9s
scale in the training task Worker
implementation shrinkage May when, can pass ScaleIn CR IN
spec.toDelete.podNamesfield specifies the worker to be scaled in. Pass
countthe number of nodes to be scaled in is calculated by index from high to low.
apiVersion: kai.alibabacloud.com/v1alpha1 kind: ScaleIn metadata: name: scalein-workers spec: selector: name: elastic-training toDelete: count: 1
If you want to scale in a specific Worker, you can configure
apiVersion: kai.alibabacloud.com/v1alpha1 kind: ScaleIn metadata: name: scalein-workers spec: selector: name: elastic-training toDelete: podNames: - elastic-training-worker-1
run a scale-in example to scale in one worker
kubectl create -f examples/scale_in_count.yaml
check the scale-in execution status and training tasks
# kubectl get scalein NAME PHASE AGE scalein-sample-t8jxd ScaleSucceeded 11s # kubectl get po NAME READY STATUS RESTARTS AGE elastic-training-launcher 1/1 Running 0 47s elastic-training-worker-0 1/1 Running 0 50s
scale-out training task
in the ScaleOut CR
spec.toAdd.countthe number of workers to be scaled out.
apiVersion: kai.alibabacloud.com/v1alpha1 kind: ScaleOut metadata: name: elastic-training-scaleout-9dtmw namespace: default spec: selector: name: elastic-training timeout: 300 toAdd: count: 2
kubectl create -f examples/scale_out.yaml
check the scale-in execution status and training tasks
kubectl get scaleout NAME PHASE AGE elastic-training-scaleout-9dtmw ScaleSucceeded 30s kubectl get po NAME READY STATUS RESTARTS AGE elastic-training-launcher 1/1 Running 0 2m5s elastic-training-worker-0 1/1 Running 0 2m8s elastic-training-worker-1 1/1 Running 0 40s elastic-training-worker-2 1/1 Running 0 40s
ET-Operator provides a set of training and scaling CRDS and Controller, allowing us to easily run elastic distributed training on the Kubernetes, support the distribution of distributed training tasks, and integrate with the distributed framework, during the running of a training task, you can dynamically scale out and scale in the Workers involved in the operation. This allows you to use elastic training tasks and preemptible instances to make better use of the elasticity and cost effectiveness of resources on the cloud.