2020-12-17 444
backdrop
due to the natural advantages of cloud computing in terms of resource costs and elastic expansion, more and more customers are willing to build AI systems on the cloud. Cloud-native technologies, such as containers and Kubernetes, it has become the shortest path to release cloud value. Building AI platforms on the cloud based on Kubernetes has become a trend.
When complex model training or large data volume occurs, the computing power of a single machine cannot meet the computing power requirements. By using Alibaba's AiACC or community's horovod and other distributed training frameworks, you only need to modify a few lines of code to expand a standalone training task to support distributed training tasks. In Kubernetes, tf-operator in kubeflow community supports Tensorflow PS mode, or mpi-operator supports mpi allreduce mode in horovod.
Current situation
kubernetes and cloud computing provide agility and scalability. We can use components such as cluster-AutoScaler to set elastic policies for training tasks, and use the elasticity of Kubernetes to create on-demand tasks to reduce idle GPU devices. However, this scaling mode is slightly insufficient for offline tasks such as training:
- fault tolerance is not supported. When some workers fail due to device reasons, the entire task needs to be stopped and restarted.
- Generally, training tasks take a long time, occupy a large amount of computing power, and lack elasticity. When resources are insufficient, resources cannot be released for other services on demand unless the task is terminated.
- The training task takes a long time, does not support dynamic worker configuration, and cannot safely use preemptible instances to maximize the cost performance on the cloud.
How to attach elasticity to training tasks is the key way to improve cost performance. Recently, distributed frameworks such as horovod gradually support Elastic Training, that is, Elastic Training capability. In other words, a training task is allowed to dynamically scale out or scale in the training worker during execution without interrupting the training task. A few modifications are required in the code. For more information, see the implementation principles of https://horovod.readthedocs.io/en/stable/elastic_include.html对Elastic training. For more information, see this Elastic Horovod design document.
In mpi-operator, all workers involved in training are designed and maintained as static resources. The support for the elastic training mode increases the flexibility of tasks and brings challenges to the O & M layer, such:
- you must use horovordrun provided by horovod as the entry. In horovod, the launcher logs on to the worker through ssh. You must open the login tunnel between the launcher and the worker.
- The Elastic Driver module that is responsible for computing elasticity obtains the latest worker topology information by specifying the discover_host script to pull up or stop the worker instance. When the worker changes, the return value of the discover_host script must be updated.
- In preemptible or price calculation scenarios, it is sometimes necessary to specify worker scale-in. k8s is a native orchestration meta deployment, statefulset cannot meet the specified scale-in scenarios.
Solution
to solve the above problems, we have designed and developed et-operator to provide TrainingJob
CRD describes the training task, ScaleOut
and ScaleIn
CRD describes scale-out and scale-in operations. The combination of these operations makes our training tasks more flexible.
Design
TrainingJob Controller has the following features:
- maintain the creation/deletion lifecycle of TrainingJob and manage sub-resources.
- Perform scale-in operations
- fault Tolerance. When a worker is expelled, a new worker is created and added to the training.
Resource creation
TrainingJob sub-resources are created in the following order:
- create a key pair required to connect ssh, and create a secret
- create workers, including service and pod, and Mount secret public key
- create a configmap, including
discover_host 脚本
,hostfile文件
- create a launcher and Mount configmap. Since hostfile will be modified along with the topology, hostfile is copied from configmap to a separate directory through initcontainer.
TrainingJob configurations are divided into Lanucher and Worker. By default, et-operator mounts the discover_host script to the Launcher /etc/edl/discover_hosts.sh
file, at the entrance of the script horovodrun medium can pass --host-discovery-script
the name of the parameter. In Worker settings, specify the number of replicas of workers by maxReplicas / minReplicas.
apiVersion: kai.alibabacloud.com/v1alpha1
kind: TrainingJob
metadata:
name: elastic-training
namespace: default
spec:
cleanPodPolicy: Running
etReplicaSpecs:
launcher:
replicas: 1
template:
spec:
containers:
- command:
- sh
- -c
- horovodrun -np 2 --min-np 1 --max-np 9 --host-discovery-script
/etc/edl/discover_hosts.sh python /examples/elastic/tensorflow2_mnist_elastic.py
image: registry.cn-huhehaote.aliyuncs.com/lumo/horovod:master-tf2.1.0-torch1.4.0-mxnet-py3.6-gpu
imagePullPolicy: Always
name: mnist-elastic
worker:
maxReplicas: 9
minReplicas: 1
replicas: 2
template:
spec:
containers:
- image: registry.cn-huhehaote.aliyuncs.com/lumo/horovod:master-tf2.1.0-torch1.4.0-mxnet-py3.6-gpu
imagePullPolicy: Always
name: mnist-elastic
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
status:
currentWorkers:
- elastic-training-worker-0
- elastic-training-worker-1
- elastic-training-worker-2
- elastic-training-worker-3
phase: Succeeded
replicaStatuses:
Launcher:
active: 1
succeeded: 1
Worker:
active: 4
Worker scale-out/scale-in
In addition to TrainingJob, et-operator supports both ScaleOut and ScaleIn CRDS to scale out and scale in training tasks. When a ScaleOut CR is issued, ScaleOutController trigger Reconcile. The work here is simple. According to the ScaleOut CR field in the Selector, find the TrainingJob corresponding to Scaler and set it to the OwnerReferences of CR.
- apiVersion: kai.alibabacloud.com/v1alpha1
kind: ScaleOut
metadata:
creationTimestamp: "2020-11-04T13:54:26Z
name: scaleout-ptfnk
namespace: default
ownerReferences:
- apiVersion: kai.alibabacloud.com/v1alpha1
blockOwnerDeletion: true
controller: true
kind: TrainingJob
name: elastic-training // 指向扩容对象TrainingJob
uid: 075b9c4a-22f9-40ce-83c7-656b329a2b9e
spec:
selector:
name: elastic-training
toAdd:
count: 2
TrainingJobController in monitoring to belong TrainingJob
the ScaleOut CR has been updated, trigger TrainingJob
reconcile, traversal filtering TrainingJob
the OwnerReference and ScaleIn pointed to in the following ScaleOut. The scale-out or scale-in depends on the creation time and status time.
apiVersion: kai.alibabacloud.com/v1alpha1
kind: TrainingJob
metadata:
name: elastic-training
namespace: default
spec:
// ...... Launcher and Worker spec
status:
currentScaler: ScaleIn:default/scaleout-ptfnk
phase: Scaling
currentWorkers:
- elastic-training-worker-0
- elastic-training-worker-1
Run
mounting ET-Operator
mkdir -p $(go env GOPATH)/src/github.com/aliyunContainerService
cd $(go env GOPATH)/src/github.com/aliyunContainerService
git clone https://http://github.com/aliyunContainerService/et-operator
cd et-operator
kubectl create -f deploy/all_in_one.yaml
check the installation of crd
# kubectl get crd
NAME CREATED AT
scaleins.kai.alibabacloud.com 2020-11-11T11:16:13Z
scaleouts.kai.alibabacloud.com 2020-11-11T11:16:13Z
trainingjobs.kai.alibabacloud.com 2020-11-11T11:16:13Z
checks the running status of the controller. By default, the controller is installed in kube-ai.
# kubectl -n kube-ai get po
NAME READY STATUS RESTARTS AGE
et-operator-controller-manager-7877968489-c5kv4 0/2 ContainerCreating 0 5s
Run TrainingJob
run the prepared example
kubectl apply -f examples/training_job.yaml
check the running status
# kubectl get trainingjob
NAME PHASE AGE
elastic-training Running 77s
# kubectl get po
NAME READY STATUS RESTARTS AGE
elastic-training-launcher 1/1 Running 0 7s
elastic-training-worker-0 1/1 Running 0 10s
elastic-training-worker-1 1/1 Running 0 9s
scale in the training task Worker
implementation shrinkage May when, can pass ScaleIn CR IN spec.toDelete.count
or spec.toDelete.podNames
field specifies the worker to be scaled in. Pass count
the number of nodes to be scaled in is calculated by index from high to low.
apiVersion: kai.alibabacloud.com/v1alpha1
kind: ScaleIn
metadata:
name: scalein-workers
spec:
selector:
name: elastic-training
toDelete:
count: 1
If you want to scale in a specific Worker, you can configure podNames
apiVersion: kai.alibabacloud.com/v1alpha1
kind: ScaleIn
metadata:
name: scalein-workers
spec:
selector:
name: elastic-training
toDelete:
podNames:
- elastic-training-worker-1
run a scale-in example to scale in one worker
kubectl create -f examples/scale_in_count.yaml
check the scale-in execution status and training tasks
# kubectl get scalein
NAME PHASE AGE
scalein-sample-t8jxd ScaleSucceeded 11s
# kubectl get po
NAME READY STATUS RESTARTS AGE
elastic-training-launcher 1/1 Running 0 47s
elastic-training-worker-0 1/1 Running 0 50s
scale-out training task
in the ScaleOut CR spec.toAdd.count
the number of workers to be scaled out.
apiVersion: kai.alibabacloud.com/v1alpha1
kind: ScaleOut
metadata:
name: elastic-training-scaleout-9dtmw
namespace: default
spec:
selector:
name: elastic-training
timeout: 300
toAdd:
count: 2
Example
kubectl create -f examples/scale_out.yaml
check the scale-in execution status and training tasks
kubectl get scaleout
NAME PHASE AGE
elastic-training-scaleout-9dtmw ScaleSucceeded 30s
kubectl get po
NAME READY STATUS RESTARTS AGE
elastic-training-launcher 1/1 Running 0 2m5s
elastic-training-worker-0 1/1 Running 0 2m8s
elastic-training-worker-1 1/1 Running 0 40s
elastic-training-worker-2 1/1 Running 0 40s
summary
ET-Operator provides a set of training and scaling CRDS and Controller, allowing us to easily run elastic distributed training on the Kubernetes, support the distribution of distributed training tasks, and integrate with the distributed framework, during the running of a training task, you can dynamically scale out and scale in the Workers involved in the operation. This allows you to use elastic training tasks and preemptible instances to make better use of the elasticity and cost effectiveness of resources on the cloud.
Start Building Today with a Free Trial to 50+ Products
Learn and experience the power of Alibaba Cloud.
Sign Up Now