Elastic Training Operator: run an Elastic deep learning Training task on the Kubernetes-Alibaba Cloud Developer Community

2020-12-17 444

introduction: ## Background due to the natural advantages of cloud computing in terms of resource costs and elastic expansion, more and more customers are willing to build AI systems on the cloud. Cloud native technologies, such as containers and Kubernetes, it has become the shortest path to release cloud value. Building AI platforms on the cloud based on Kubernetes has become a trend. When complex model training or large data volume occurs, the computing power of a single machine cannot meet the computing power requirements. By using Alibaba's AiACC or community's [horovod](https:/
+ Follow to continue viewing

backdrop

due to the natural advantages of cloud computing in terms of resource costs and elastic expansion, more and more customers are willing to build AI systems on the cloud. Cloud-native technologies, such as containers and Kubernetes, it has become the shortest path to release cloud value. Building AI platforms on the cloud based on Kubernetes has become a trend.

When complex model training or large data volume occurs, the computing power of a single machine cannot meet the computing power requirements. By using Alibaba's AiACC or community's horovod and other distributed training frameworks, you only need to modify a few lines of code to expand a standalone training task to support distributed training tasks. In Kubernetes, tf-operator in kubeflow community supports Tensorflow PS mode, or mpi-operator supports mpi allreduce mode in horovod.

Current situation

kubernetes and cloud computing provide agility and scalability. We can use components such as cluster-AutoScaler to set elastic policies for training tasks, and use the elasticity of Kubernetes to create on-demand tasks to reduce idle GPU devices. However, this scaling mode is slightly insufficient for offline tasks such as training:

  • fault tolerance is not supported. When some workers fail due to device reasons, the entire task needs to be stopped and restarted.
  • Generally, training tasks take a long time, occupy a large amount of computing power, and lack elasticity. When resources are insufficient, resources cannot be released for other services on demand unless the task is terminated.
  • The training task takes a long time, does not support dynamic worker configuration, and cannot safely use preemptible instances to maximize the cost performance on the cloud.

How to attach elasticity to training tasks is the key way to improve cost performance. Recently, distributed frameworks such as horovod gradually support Elastic Training, that is, Elastic Training capability. In other words, a training task is allowed to dynamically scale out or scale in the training worker during execution without interrupting the training task. A few modifications are required in the code. For more information, see the implementation principles of https://horovod.readthedocs.io/en/stable/elastic_include.html对Elastic training. For more information, see this Elastic Horovod design document.

In mpi-operator, all workers involved in training are designed and maintained as static resources. The support for the elastic training mode increases the flexibility of tasks and brings challenges to the O & M layer, such:

Solution

to solve the above problems, we have designed and developed et-operator to provide TrainingJobCRD describes the training task, ScaleOutand ScaleInCRD describes scale-out and scale-in operations. The combination of these operations makes our training tasks more flexible.

Design

TrainingJob Controller has the following features:

  • maintain the creation/deletion lifecycle of TrainingJob and manage sub-resources.
  • Perform scale-in operations
  • fault Tolerance. When a worker is expelled, a new worker is created and added to the training.

Resource creation

TrainingJob sub-resources are created in the following order:

  • create a key pair required to connect ssh, and create a secret
  • create workers, including service and pod, and Mount secret public key
  • create a configmap, including discover_host 脚本, hostfile文件 
  • create a launcher and Mount configmap. Since hostfile will be modified along with the topology, hostfile is copied from configmap to a separate directory through initcontainer.

TrainingJob configurations are divided into Lanucher and Worker. By default, et-operator mounts the discover_host script to the Launcher /etc/edl/discover_hosts.shfile, at the entrance of the script horovodrun medium can pass --host-discovery-scriptthe name of the parameter. In Worker settings, specify the number of replicas of workers by maxReplicas / minReplicas.

apiVersion: kai.alibabacloud.com/v1alpha1
kind: TrainingJob
metadata:
  name: elastic-training
  namespace: default
spec:
  cleanPodPolicy: Running
  etReplicaSpecs:
    launcher:
      replicas: 1
      template:
        spec:
          containers:
          - command:
            - sh
            - -c
            - horovodrun -np 2 --min-np 1 --max-np 9 --host-discovery-script
              /etc/edl/discover_hosts.sh python /examples/elastic/tensorflow2_mnist_elastic.py
            image: registry.cn-huhehaote.aliyuncs.com/lumo/horovod:master-tf2.1.0-torch1.4.0-mxnet-py3.6-gpu
            imagePullPolicy: Always
            name: mnist-elastic
    worker:
      maxReplicas: 9
      minReplicas: 1
      replicas: 2
      template:
        spec:
          containers:
          - image: registry.cn-huhehaote.aliyuncs.com/lumo/horovod:master-tf2.1.0-torch1.4.0-mxnet-py3.6-gpu
            imagePullPolicy: Always
            name: mnist-elastic
            resources:
              limits:
                nvidia.com/gpu: "1"
              requests:
                nvidia.com/gpu: "1"
status:
  currentWorkers:
  - elastic-training-worker-0
  - elastic-training-worker-1
  - elastic-training-worker-2
  - elastic-training-worker-3
  phase: Succeeded
  replicaStatuses:
    Launcher:
      active: 1
      succeeded: 1
    Worker:
      active: 4

Worker scale-out/scale-in

In addition to TrainingJob, et-operator supports both ScaleOut and ScaleIn CRDS to scale out and scale in training tasks. When a ScaleOut CR is issued, ScaleOutController trigger Reconcile. The work here is simple. According to the ScaleOut CR field in the Selector, find the TrainingJob corresponding to Scaler and set it to the OwnerReferences of CR.

- apiVersion: kai.alibabacloud.com/v1alpha1
  kind: ScaleOut
  metadata:
    creationTimestamp: "2020-11-04T13:54:26Z
    name: scaleout-ptfnk
    namespace: default
    ownerReferences:
    - apiVersion: kai.alibabacloud.com/v1alpha1
      blockOwnerDeletion: true
      controller: true
      kind: TrainingJob
      name: elastic-training // 指向扩容对象TrainingJob
      uid: 075b9c4a-22f9-40ce-83c7-656b329a2b9e
  spec:
  selector:
    name: elastic-training
  toAdd:
    count: 2

TrainingJobController in monitoring to belong TrainingJobthe ScaleOut CR has been updated, trigger TrainingJobreconcile, traversal filtering TrainingJobthe OwnerReference and ScaleIn pointed to in the following ScaleOut. The scale-out or scale-in depends on the creation time and status time.

apiVersion: kai.alibabacloud.com/v1alpha1
kind: TrainingJob
metadata:
  name: elastic-training
  namespace: default
spec: 
  // ...... Launcher and Worker spec
status:
  currentScaler: ScaleIn:default/scaleout-ptfnk
  phase: Scaling
  currentWorkers:
  - elastic-training-worker-0
  - elastic-training-worker-1

Run

mounting ET-Operator

mkdir -p $(go env GOPATH)/src/github.com/aliyunContainerService
cd $(go env GOPATH)/src/github.com/aliyunContainerService
git clone https://http://github.com/aliyunContainerService/et-operator
cd et-operator
kubectl create -f deploy/all_in_one.yaml 

check the installation of crd

# kubectl get crd
NAME                                    CREATED AT
scaleins.kai.alibabacloud.com           2020-11-11T11:16:13Z
scaleouts.kai.alibabacloud.com          2020-11-11T11:16:13Z
trainingjobs.kai.alibabacloud.com       2020-11-11T11:16:13Z

checks the running status of the controller. By default, the controller is installed in kube-ai.

# kubectl -n kube-ai get po
NAME                                         READY   STATUS              RESTARTS   AGE
et-operator-controller-manager-7877968489-c5kv4   0/2     ContainerCreating   0          5s

Run TrainingJob

run the prepared example

kubectl apply -f examples/training_job.yaml

check the running status

# kubectl get trainingjob
NAME                          PHASE     AGE
elastic-training              Running   77s

# kubectl get po
NAME                                      READY   STATUS             RESTARTS   AGE
elastic-training-launcher                 1/1     Running            0          7s
elastic-training-worker-0                 1/1     Running            0          10s
elastic-training-worker-1                 1/1     Running            0          9s

scale in the training task Worker

implementation shrinkage May when, can pass ScaleIn CR IN spec.toDelete.countor spec.toDelete.podNamesfield specifies the worker to be scaled in. Pass countthe number of nodes to be scaled in is calculated by index from high to low.

apiVersion: kai.alibabacloud.com/v1alpha1
kind: ScaleIn
metadata:
  name: scalein-workers
spec:
  selector:
    name: elastic-training
  toDelete:
    count: 1

If you want to scale in a specific Worker, you can configure podNames 

apiVersion: kai.alibabacloud.com/v1alpha1
kind: ScaleIn
metadata:
  name: scalein-workers
spec:
  selector:
    name: elastic-training
  toDelete:
    podNames:
    - elastic-training-worker-1

run a scale-in example to scale in one worker

kubectl create -f examples/scale_in_count.yaml

check the scale-in execution status and training tasks

# kubectl get scalein
NAME                                     PHASE            AGE
scalein-sample-t8jxd                     ScaleSucceeded   11s

# kubectl get po
NAME                                      READY   STATUS             RESTARTS   AGE
elastic-training-launcher                 1/1     Running            0          47s
elastic-training-worker-0                 1/1     Running            0          50s

scale-out training task

in the ScaleOut CR spec.toAdd.countthe number of workers to be scaled out.

apiVersion: kai.alibabacloud.com/v1alpha1
  kind: ScaleOut
  metadata:
    name: elastic-training-scaleout-9dtmw
    namespace: default
  spec:
    selector:
      name: elastic-training
    timeout: 300
    toAdd:
      count: 2

Example

kubectl create -f examples/scale_out.yaml

check the scale-in execution status and training tasks

kubectl get scaleout
NAME                                     PHASE            AGE
elastic-training-scaleout-9dtmw          ScaleSucceeded   30s
kubectl get po
NAME                                      READY   STATUS             RESTARTS   AGE
elastic-training-launcher                 1/1     Running            0          2m5s
elastic-training-worker-0                 1/1     Running            0          2m8s
elastic-training-worker-1                 1/1     Running            0          40s
elastic-training-worker-2                 1/1     Running            0          40s

summary

ET-Operator provides a set of training and scaling CRDS and Controller, allowing us to easily run elastic distributed training on the Kubernetes, support the distribution of distributed training tasks, and integrate with the distributed framework, during the running of a training task, you can dynamically scale out and scale in the Workers involved in the operation. This allows you to use elastic training tasks and preemptible instances to make better use of the elasticity and cost effectiveness of resources on the cloud.

Artificial intelligence Kubernetes Cloud Native Go network security cloud computing heterogeneous Computing Python container Perl
kubernetes elastic deep learning auto Scaling timing for kubernetes kubernetes elastic telescopic function kubernetes auto scaling kubernetes enterprise container cloud
developer Community> cloud Native > article
Please read this disclaimer carefully before you start to use the service. By using the service, you acknowledge that you have agreed to and accepted the content of this disclaimer in full. You may choose not to use the service if you do not agree to this disclaimer. This document is automatically generated based on public content on the Internet captured by Machine Learning Platform for AI. The copyright of the information in this document, such as web pages, images, and data, belongs to their respective author and publisher. Such automatically generated content does not reflect the views or opinions of Alibaba Cloud. It is your responsibility to determine the legality, accuracy, authenticity, practicality, and completeness of the content. We recommend that you consult a professional if you have any doubt in this regard. Alibaba Cloud accepts no responsibility for any consequences on account of your use of the content without verification. If you have feedback or you find that this document uses some content in which you have rights and interests, please contact us through this link: https://www.alibabacloud.com/campaign/contact-us-feedback. We will handle the matter according to relevant regulations.
Selected, One-Stop Store for Enterprise Applications
Support various scenarios to meet companies' needs at different stages of development

Start Building Today with a Free Trial to 50+ Products

Learn and experience the power of Alibaba Cloud.

Sign Up Now