Kubeflow provides Custom Resource Definitions (CRDs), such as TFJob and PyTorchJob.
You can use these CRDs to run distributed training jobs on Kubernetes clusters. This
way, you can focus on model development without the need to manage distributed code
logic or cluster O&M. Data Science clusters provide stable computing capabilities
and rich machine learning frameworks, including TensorFlow and PyTorch.
Prerequisites
- An E-MapReduce (EMR) Data Science cluster is created, and Kubeflow is selected from
the optional services when you create the cluster. For more information, see Create a cluster.
- The dsdemo code is downloaded. To obtain the dsdemo code, join the DingTalk group
numbered 32497587.
Background information
In this topic, the following methods are used to provide examples for model training:
Make preparations
- Log on to your cluster in SSH mode. For more information, see Log on to a cluster.
- Upload the downloaded dsdemo code package to a directory of the master node of your
cluster and decompress the code package.
In this example, upload the code package to the root/dsdemo directory and decompress the code package to the directory. You can specify a directory
based on your business requirements.
Call the Estimator API
- Run the following command to access the estimator-API directory:
cd /root/dsdemo/kubeflow_samples/training/tf/estimator-API
- Make a training image.
- Run the following command to make a training image:
make
- Run the following command to make a training image and push the image to the image
repository:
make push
- Modify the distributed_tfjob.yaml file based on your business requirements.
The following code shows the content of the
distributed_tfjob.yaml file. You need to change the value of
image
in the code.
apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
name: "distributed-training"
spec:
cleanPodPolicy: None
tfReplicaSpecs:
Worker:
replicas: 3
restartPolicy: Never
template:
metadata:
annotations:
scheduling.k8s.io/group-name: "distributed-training"
spec:
containers:
- name: tensorflow
image: datascience-registry.cn-beijing.cr.aliyuncs.com/kubeflow-examples/distributed_worker:0.1.4
volumeMounts:
- mountPath: /train
name: training
volumes:
- name: training
persistentVolumeClaim:
claimName: strategy-estimator-volume
- Run the following command for the change to take effect:
kubectl apply -f distributed_tfjob.yaml
- Modify the pvc.yaml configuration file based on your business requirements.
The following code shows the content of the
pvc.yaml configuration file:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: strategy-estimator-volume
labels:
app: strategy-estimator-volume
spec:
storageClassName: "nfs-client"
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Gi
Note You can increase the value of the storage
parameter. For example, you can set this parameter to 50Gi.
- Run the following command for the change to take effect:
kubectl apply -f pvc.yaml
- View the training status.
- Run the following command to query information about pods:
kubectl get pods
Information similar to the following output is returned:
NAME READY STATUS RESTARTS AGE
distributed-training-worker-0 0/1 Completed 0 22h
distributed-training-worker-1 0/1 Completed 0 22h
distributed-training-worker-2 0/1 Completed 0 22h
nfs-client-provisioner-5cb8b7cf76-k2z4d 1/1 Running 0 25h
Note If the status of all pods is Completed, the training ends.
- Run the following command to view log information:
kubectl logs distributed-training-worker-0
- View the model export file in the checkpoint format.
- Run the following command to access the default_storage_class directory:
cd /mnt/disk1/k8s_pv/default_storage_class
- Run the
ll
command to obtain model directories from the default_storage_class directory. total 28
drwxrwxrwx 2 root root 4096 Jul 8 16:56 anonymous-workspace-ds-notebook2-pvc-09789bbd-aa40-4381-af63-ebcab57c584f
drwxrwxrwx 7 root root 4096 Jul 8 18:09 anonymous-workspace-ds-notebook-pvc-ae4db112-9491-4449-854a-7b0e67264703
drwxrwxrwx 5 root root 4096 Jul 8 16:27 default-strategy-estimator-volume-pvc-6cd2979e-f925-45b7-8c25-ce67e12e2f03
drwxrwxrwx 3 root root 4096 Jul 8 13:47 kubeflow-katib-mysql-pvc-9a3ebbfe-2952-4eaa-8a83-36e42d620cac
drwxrwxrwx 3 root root 4096 Jul 8 13:47 kubeflow-metadata-mysql-pvc-5778e165-8a08-4536-8e36-9185144d3d6d
drwxrwxrwx 3 root root 4096 Jul 8 13:47 kubeflow-minio-pvc-pvc-10015211-fc40-460e-b723-a5c220112d08
drwxrwxrwx 6 polkitd ssh_keys 4096 Jul 8 13:54 kubeflow-mysql-pv-claim-pvc-5136b864-c868-4951-9b34-f6c5f06c2d6f
- Run the following command to access a specific model directory:
cd default-strategy-estimator-volume-pvc-6cd2979e-f925-45b7-8c25-ce67e12e2f03
Note Replace default-strategy-estimator-volume-pvc-6cd2979e-f925-45b7-8c25-ce67e12e2f03
with an actual directory.
- Run the following command to access the master directory:
- Run the
ll
command to view subdirectories of the preceding directory: total 4
drwxr-xr-x 3 root root 4096 Jul 8 16:27 1625732821
- Run the following command to access a specific directory in the master directory:
cd 1625732821
Note Replace 1625732821
with an actual directory.
- Run the
ll
command to view subdirectories of the preceding directory: total 1788
-rw-r--r-- 1 root root 130 Jul 8 16:27 checkpoint
-rw-r--r-- 1 root root 838550 Jul 8 16:27 events.out.tfevents.1625732823.distributed-training-worker-0
-rw-r--r-- 1 root root 523304 Jul 8 16:27 graph.pbtxt
drwxr-xr-x 2 root root 4096 Jul 8 16:27 keras
-rw-r--r-- 1 root root 780 Jul 8 16:27 model.ckpt-0.data-00000-of-00001
-rw-r--r-- 1 root root 249 Jul 8 16:27 model.ckpt-0.index
-rw-r--r-- 1 root root 218354 Jul 8 16:27 model.ckpt-0.meta
-rw-r--r-- 1 root root 780 Jul 8 16:27 model.ckpt-3200.data-00000-of-00001
-rw-r--r-- 1 root root 249 Jul 8 16:27 model.ckpt-3200.index
-rw-r--r-- 1 root root 218354 Jul 8 16:27 model.ckpt-3200.meta
Call the Keras API
Notice If you use this method, GPU resources are required. The creation of an image also
requires GPU resources. Therefore, when you create a Data Science cluster, you must
select a type of GPU configuration.
- Run the following command to access the keras-API directory:
cd /root/dsdemo/kubeflow_samples/training/tf/keras-API
- Make a training image.
- Run the following command to make a training image:
make
- Run the following command to make a training image and push the image to the image
repository:
make push
- Modify the image name of the training configuration file multi_worker_tfjob.yaml that
is stored in the keras-API directory based on your business requirements.
The following code shows the content of the
multi_worker_tfjob.yaml file:
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
name: multi-worker
spec:
cleanPodPolicy: None
tfReplicaSpecs:
Worker:
replicas: 3
restartPolicy: Never
template:
spec:
containers:
- name: tensorflow
image: datascience-registry.cn-beijing.cr.aliyuncs.com/kubeflow-examples/multi_worker_strategy:0.1.1
volumeMounts:
- mountPath: /train
name: training
resources:
limits:
nvidia.com/gpu: 1
volumes:
- name: training
persistentVolumeClaim:
claimName: strategy-volume
- Run the following command to make the modification take effect:
kubectl apply -f multi_worker_tfjob.yaml
- Modify the pvc.yaml configuration file.
The following code shows the content of the
pvc.yaml configuration file:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: strategy-volume
labels:
app: strategy-volume
spec:
storageClassName: "nfs-client"
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Gi
Note You can increase the value of the storage
parameter. For example, you can set this parameter to 50Gi.
- Run the following command for the change to take effect:
kubectl apply -f pvc.yaml
- View the training status.
- Run the following command to query information about pods:
kubectl get pods
Information similar to the following output is returned:
NAME READY STATUS RESTARTS AGE
multi-worker-worker-0 0/1 Pending 0 5m33s
multi-worker-worker-1 0/1 Pending 0 5m33s
multi-worker-worker-2 0/1 Pending 0 5m33s
nfs-client-provisioner-5cb8b7cf76-k2z4d 1/1 Running 0 25h
Note If the status of all pods is Completed, the training ends.
- Run the following command to view log information:
kubectl logs distributed-training-worker-0
- Run the following command to view the model export file in the checkpoint format:
ll /mnt/disk1/k8s_pv/default_storage_class/default-strategy-estimator-volume-pvc-aa16c081-cd94-4565-bfde-daee90ee25a4/master/1625553118/
Note Replace default-strategy-estimator-volume-pvc-aa16c081-cd94-4565-bfde-daee90ee25a4
and 1625553118
with actual directories.
total 1788
-rw-r--r-- 1 root root 130 Jul 6 14:32 checkpoint
-rw-r--r-- 1 root root 838550 Jul 6 14:32 events.out.tfevents.1625553120.distributed-training-worker-0
-rw-r--r-- 1 root root 523405 Jul 6 14:32 graph.pbtxt
drwxr-xr-x 2 root root 4096 Jul 6 14:31 keras
-rw-r--r-- 1 root root 780 Jul 6 14:32 model.ckpt-0.data-00000-of-00001
-rw-r--r-- 1 root root 249 Jul 6 14:32 model.ckpt-0.index
-rw-r--r-- 1 root root 218354 Jul 6 14:32 model.ckpt-0.meta
-rw-r--r-- 1 root root 780 Jul 6 14:32 model.ckpt-3200.data-00000-of-00001
-rw-r--r-- 1 root root 249 Jul 6 14:32 model.ckpt-3200.index
-rw-r--r-- 1 root root 218354 Jul 6 14:32 model.ckpt-3200.meta
Use PyTorch
- Run the following command to access the mnist directory:
cd /root/dsdemo/kubeflow_samples/training/pytorch/mnist
- Make a training image.
- Run the following command to make a training image:
make
- Run the following command to make a training image and push the image to the image
repository:
make push
- Modify the image name of the training configuration file based on your business requirements:
apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
name: "pytorch-dist-mnist-gloo"
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: pytorch
image: datascience-registry.cn-beijing.cr.aliyuncs.com/kubeflow-examples/pytorch-dist-mnist-test:0.1.5
args: ["--backend", "gloo"]
volumeMounts:
- mountPath: /train
name: training
# Comment out the below resources to use the CPU.
resources:
limits:
nvidia.com/gpu: 1
volumes:
- name: training
persistentVolumeClaim:
claimName: strategy-pytorch-volume
Worker:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: pytorch
image: datascience-registry.cn-beijing.cr.aliyuncs.com/kubeflow-examples/pytorch-dist-mnist-test:0.1.5
args: ["--backend", "gloo"]
volumeMounts:
- mountPath: /train
name: training
# Comment out the below resources to use the CPU.
resources:
limits:
nvidia.com/gpu: 1
volumes:
- name: training
persistentVolumeClaim:
claimName: strategy-pytorch-volume
- Modify the pvc.yaml configuration file.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: strategy-pytorch-volume
labels:
app: strategy-pytorch-volume
spec:
storageClassName: "nfs-client"
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Gi
Note You can increase the value of the storage
parameter. For example, you can set this parameter to 50Gi.
- Apply for a persistent volume claim (PVC) and submit a training job.
kubectl apply -f ./v1/pytorch_job_mnist_gloo.yaml
- View the training status.
- Run the following command to query information about pods:
kubectl get pods
Information similar to the following output is returned:
NAME READY STATUS RESTARTS AGE
pytorch-dist-mnist-gloo-worker-0 0/1 Pending 0 11m
Note If the status of all pods is Completed, the training ends.
- Run the following command to view log information:
kubectl logs pytorch-dist-mnist-gloo-master-0
- Run the following command to view the model export file in the checkpoint format:
cd /mnt/disk1/k8s_pv/default_storage_class/default-strategy-pytorch-volume-pvc-633b12ab-5f63-4e63-97b9-24c2b3de733c/
Note Replace default-strategy-pytorch-volume-pvc-633b12ab-5f63-4e63-97b9-24c2b3de733c
with an actual directory.
Information similar to the following output is returned:
total 1688
-rw-r--r-- 1 root root 1725813 Jul 6 15:35 mnist_cnn.pt