Use Kubeflow for model training - E-MapReduce - Alibaba Cloud Documentation Center

Kubeflow provides Custom Resource Definitions (CRDs), such as TFJob and PyTorchJob. You can use these CRDs to run distributed training jobs on Kubernetes clusters. This way, you can focus on model development without the need to manage distributed code logic or cluster O&M. Data Science clusters provide stable computing capabilities and rich machine learning frameworks, including TensorFlow and PyTorch.

Prerequisites

An E-MapReduce (EMR) Data Science cluster is created, and Kubeflow is selected from the optional services when you create the cluster. For more information, see Create a cluster.
The dsdemo code is downloaded. To obtain the dsdemo code, join the DingTalk group numbered 32497587.

Background information

In this topic, the following methods are used to provide examples for model training:

Call the Estimator API
Call the Keras API
Use PyTorch

Make preparations

Log on to your cluster in SSH mode. For more information, see Log on to a cluster.
Upload the downloaded dsdemo code package to a directory of the master node of your cluster and decompress the code package.
In this example, upload the code package to the root/dsdemo directory and decompress the code package to the directory. You can specify a directory based on your business requirements.

Call the Estimator API

Run the following command to access the estimator-API directory:
```
cd /root/dsdemo/kubeflow_samples/training/tf/estimator-API
```
Make a training image.
- Run the following command to make a training image:
```
make
```
- Run the following command to make a training image and push the image to the image repository:
```
make push
```

Modify the distributed_tfjob.yaml file based on your business requirements.

The following code shows the content of the distributed_tfjob.yaml file. You need to change the value of image in the code.

apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
  name: "distributed-training"
spec:
  cleanPodPolicy: None
  tfReplicaSpecs:
    Worker:
      replicas: 3
      restartPolicy: Never
      template:
        metadata:
              annotations:
                scheduling.k8s.io/group-name: "distributed-training"
        spec:
          containers:
            - name: tensorflow
              image: datascience-registry.cn-beijing.cr.aliyuncs.com/kubeflow-examples/distributed_worker:0.1.4
              volumeMounts:
                - mountPath: /train
                  name: training
          volumes:
            - name: training
              persistentVolumeClaim:
                claimName: strategy-estimator-volume

Run the following command for the change to take effect:
```
kubectl apply -f distributed_tfjob.yaml
```

Modify the pvc.yaml configuration file based on your business requirements.

The following code shows the content of the pvc.yaml configuration file:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: strategy-estimator-volume
  labels:
    app: strategy-estimator-volume
spec:
  storageClassName: "nfs-client"
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 10Gi

Note You can increase the value of the storage parameter. For example, you can set this parameter to 50Gi.

Run the following command for the change to take effect:
```
kubectl apply -f pvc.yaml
```

View the training status.

Run the following command to query information about pods:

kubectl get pods

Information similar to the following output is returned:

NAME                                      READY   STATUS      RESTARTS   AGE
distributed-training-worker-0             0/1     Completed   0          22h
distributed-training-worker-1             0/1     Completed   0          22h
distributed-training-worker-2             0/1     Completed   0          22h
nfs-client-provisioner-5cb8b7cf76-k2z4d   1/1     Running     0          25h

Note If the status of all pods is Completed, the training ends.

Run the following command to view log information:
```
kubectl logs distributed-training-worker-0
```

View the model export file in the checkpoint format.

Run the following command to access the default_storage_class directory:
```
cd /mnt/disk1/k8s_pv/default_storage_class
```

Run the ll command to obtain model directories from the default_storage_class directory.

total 28
drwxrwxrwx 2 root    root     4096 Jul  8 16:56 anonymous-workspace-ds-notebook2-pvc-09789bbd-aa40-4381-af63-ebcab57c584f
drwxrwxrwx 7 root    root     4096 Jul  8 18:09 anonymous-workspace-ds-notebook-pvc-ae4db112-9491-4449-854a-7b0e67264703
drwxrwxrwx 5 root    root     4096 Jul  8 16:27 default-strategy-estimator-volume-pvc-6cd2979e-f925-45b7-8c25-ce67e12e2f03
drwxrwxrwx 3 root    root     4096 Jul  8 13:47 kubeflow-katib-mysql-pvc-9a3ebbfe-2952-4eaa-8a83-36e42d620cac
drwxrwxrwx 3 root    root     4096 Jul  8 13:47 kubeflow-metadata-mysql-pvc-5778e165-8a08-4536-8e36-9185144d3d6d
drwxrwxrwx 3 root    root     4096 Jul  8 13:47 kubeflow-minio-pvc-pvc-10015211-fc40-460e-b723-a5c220112d08
drwxrwxrwx 6 polkitd ssh_keys 4096 Jul  8 13:54 kubeflow-mysql-pv-claim-pvc-5136b864-c868-4951-9b34-f6c5f06c2d6f

Run the following command to access a specific model directory:
```
cd default-strategy-estimator-volume-pvc-6cd2979e-f925-45b7-8c25-ce67e12e2f03
```
Note Replace default-strategy-estimator-volume-pvc-6cd2979e-f925-45b7-8c25-ce67e12e2f03 with an actual directory.
Run the following command to access the master directory:
```
cd master
```
Run the ll command to view subdirectories of the preceding directory:
```
total 4
drwxr-xr-x 3 root root 4096 Jul  8 16:27 1625732821
```
Run the following command to access a specific directory in the master directory:
```
cd 1625732821
```
Note Replace 1625732821 with an actual directory.

Run the ll command to view subdirectories of the preceding directory:

total 1788
-rw-r--r-- 1 root root    130 Jul  8 16:27 checkpoint
-rw-r--r-- 1 root root 838550 Jul  8 16:27 events.out.tfevents.1625732823.distributed-training-worker-0
-rw-r--r-- 1 root root 523304 Jul  8 16:27 graph.pbtxt
drwxr-xr-x 2 root root   4096 Jul  8 16:27 keras
-rw-r--r-- 1 root root    780 Jul  8 16:27 model.ckpt-0.data-00000-of-00001
-rw-r--r-- 1 root root    249 Jul  8 16:27 model.ckpt-0.index
-rw-r--r-- 1 root root 218354 Jul  8 16:27 model.ckpt-0.meta
-rw-r--r-- 1 root root    780 Jul  8 16:27 model.ckpt-3200.data-00000-of-00001
-rw-r--r-- 1 root root    249 Jul  8 16:27 model.ckpt-3200.index
-rw-r--r-- 1 root root 218354 Jul  8 16:27 model.ckpt-3200.meta

Call the Keras API

Notice If you use this method, GPU resources are required. The creation of an image also requires GPU resources. Therefore, when you create a Data Science cluster, you must select a type of GPU configuration.

Run the following command to access the keras-API directory:
```
cd /root/dsdemo/kubeflow_samples/training/tf/keras-API
```
Make a training image.
- Run the following command to make a training image:
```
make
```
- Run the following command to make a training image and push the image to the image repository:
```
make push
```

Modify the image name of the training configuration file multi_worker_tfjob.yaml that is stored in the keras-API directory based on your business requirements.

The following code shows the content of the multi_worker_tfjob.yaml file:

apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: multi-worker
spec:
  cleanPodPolicy: None
  tfReplicaSpecs:
    Worker:
      replicas: 3
      restartPolicy: Never
      template:
        spec:
          containers:
            - name: tensorflow
              image: datascience-registry.cn-beijing.cr.aliyuncs.com/kubeflow-examples/multi_worker_strategy:0.1.1
              volumeMounts:
                - mountPath: /train
                  name: training
              resources:
                limits:
                  nvidia.com/gpu: 1
          volumes:
            - name: training
              persistentVolumeClaim:
                claimName: strategy-volume

Run the following command to make the modification take effect:
```
kubectl apply -f multi_worker_tfjob.yaml
```

Modify the pvc.yaml configuration file.

The following code shows the content of the pvc.yaml configuration file:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: strategy-volume
  labels:
    app: strategy-volume
spec:
  storageClassName: "nfs-client"
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 10Gi

Note You can increase the value of the storage parameter. For example, you can set this parameter to 50Gi.

Run the following command for the change to take effect:
```
kubectl apply -f pvc.yaml
```

View the training status.

Run the following command to query information about pods:

kubectl get pods

Information similar to the following output is returned:

NAME                                      READY   STATUS      RESTARTS   AGE
multi-worker-worker-0                     0/1     Pending     0          5m33s
multi-worker-worker-1                     0/1     Pending     0          5m33s
multi-worker-worker-2                     0/1     Pending     0          5m33s
nfs-client-provisioner-5cb8b7cf76-k2z4d   1/1     Running     0          25h

Note If the status of all pods is Completed, the training ends.

Run the following command to view log information:
```
kubectl logs distributed-training-worker-0
```

Run the following command to view the model export file in the checkpoint format:

ll /mnt/disk1/k8s_pv/default_storage_class/default-strategy-estimator-volume-pvc-aa16c081-cd94-4565-bfde-daee90ee25a4/master/1625553118/

Note Replace default-strategy-estimator-volume-pvc-aa16c081-cd94-4565-bfde-daee90ee25a4 and 1625553118 with actual directories.

total 1788
-rw-r--r-- 1 root root    130 Jul  6 14:32 checkpoint
-rw-r--r-- 1 root root 838550 Jul  6 14:32 events.out.tfevents.1625553120.distributed-training-worker-0
-rw-r--r-- 1 root root 523405 Jul  6 14:32 graph.pbtxt
drwxr-xr-x 2 root root   4096 Jul  6 14:31 keras
-rw-r--r-- 1 root root    780 Jul  6 14:32 model.ckpt-0.data-00000-of-00001
-rw-r--r-- 1 root root    249 Jul  6 14:32 model.ckpt-0.index
-rw-r--r-- 1 root root 218354 Jul  6 14:32 model.ckpt-0.meta
-rw-r--r-- 1 root root    780 Jul  6 14:32 model.ckpt-3200.data-00000-of-00001
-rw-r--r-- 1 root root    249 Jul  6 14:32 model.ckpt-3200.index
-rw-r--r-- 1 root root 218354 Jul  6 14:32 model.ckpt-3200.meta

Use PyTorch

Run the following command to access the mnist directory:

cd /root/dsdemo/kubeflow_samples/training/pytorch/mnist

Make a training image.
- Run the following command to make a training image:
```
make
```
- Run the following command to make a training image and push the image to the image repository:
```
make push
```

Modify the image name of the training configuration file based on your business requirements:

apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
  name: "pytorch-dist-mnist-gloo"
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: pytorch
              image: datascience-registry.cn-beijing.cr.aliyuncs.com/kubeflow-examples/pytorch-dist-mnist-test:0.1.5
              args: ["--backend", "gloo"]
              volumeMounts:
                - mountPath: /train
                  name: training
              # Comment out the below resources to use the CPU.
              resources: 
                limits:
                  nvidia.com/gpu: 1
          volumes:
            - name: training
              persistentVolumeClaim:
                claimName: strategy-pytorch-volume
    Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers: 
            - name: pytorch
              image: datascience-registry.cn-beijing.cr.aliyuncs.com/kubeflow-examples/pytorch-dist-mnist-test:0.1.5
              args: ["--backend", "gloo"]
              volumeMounts:
                - mountPath: /train
                  name: training
              # Comment out the below resources to use the CPU.
              resources: 
                limits:
                  nvidia.com/gpu: 1
          volumes:
            - name: training
              persistentVolumeClaim:
                claimName: strategy-pytorch-volume

Modify the pvc.yaml configuration file.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: strategy-pytorch-volume
  labels:
    app: strategy-pytorch-volume
spec:
  storageClassName: "nfs-client"
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 10Gi

Note You can increase the value of the storage parameter. For example, you can set this parameter to 50Gi.

Apply for a persistent volume claim (PVC) and submit a training job.
```
kubectl apply -f ./v1/pytorch_job_mnist_gloo.yaml
```
View the training status.
1. Run the following command to query information about pods:
```
kubectl get pods
```
  Information similar to the following output is returned:
```
NAME                                      READY   STATUS      RESTARTS   AGE
pytorch-dist-mnist-gloo-worker-0          0/1     Pending     0          11m
```
  Note If the status of all pods is Completed, the training ends.
2. Run the following command to view log information:
```
kubectl logs pytorch-dist-mnist-gloo-master-0
```
Run the following command to view the model export file in the checkpoint format:
```
cd /mnt/disk1/k8s_pv/default_storage_class/default-strategy-pytorch-volume-pvc-633b12ab-5f63-4e63-97b9-24c2b3de733c/
```
Note Replace default-strategy-pytorch-volume-pvc-633b12ab-5f63-4e63-97b9-24c2b3de733c with an actual directory.
Information similar to the following output is returned:
```
total 1688
-rw-r--r-- 1 root root 1725813 Jul  6 15:35 mnist_cnn.pt
```