Kubeflow provides Custom Resource Definitions (CRDs), such as TFJob and PyTorchJob. You can use these CRDs to run distributed training jobs on Kubernetes clusters. This way, you can focus on model development without the need to manage distributed code logic or cluster O&M. Data Science clusters provide stable computing capabilities and rich machine learning frameworks, including TensorFlow and PyTorch.

Prerequisites

  • An E-MapReduce (EMR) Data Science cluster is created, and Kubeflow is selected from the optional services when you create the cluster. For more information, see Create a cluster.
  • The dsdemo code is downloaded. To obtain the dsdemo code, join the DingTalk group numbered 32497587.

Background information

In this topic, the following methods are used to provide examples for model training:

Make preparations

  1. Log on to your cluster in SSH mode. For more information, see Log on to a cluster.
  2. Upload the downloaded dsdemo code package to a directory of the master node of your cluster and decompress the code package.
    In this example, upload the code package to the root/dsdemo directory and decompress the code package to the directory. You can specify a directory based on your business requirements.

Call the Estimator API

  1. Run the following command to access the estimator-API directory:
    cd /root/dsdemo/kubeflow_samples/training/tf/estimator-API
  2. Make a training image.
    • Run the following command to make a training image:
      make
    • Run the following command to make a training image and push the image to the image repository:
      make push
  3. Modify the distributed_tfjob.yaml file based on your business requirements.
    The following code shows the content of the distributed_tfjob.yaml file. You need to change the value of image in the code.
    apiVersion: "kubeflow.org/v1"
    kind: "TFJob"
    metadata:
      name: "distributed-training"
    spec:
      cleanPodPolicy: None
      tfReplicaSpecs:
        Worker:
          replicas: 3
          restartPolicy: Never
          template:
            metadata:
                  annotations:
                    scheduling.k8s.io/group-name: "distributed-training"
            spec:
              containers:
                - name: tensorflow
                  image: datascience-registry.cn-beijing.cr.aliyuncs.com/kubeflow-examples/distributed_worker:0.1.4
                  volumeMounts:
                    - mountPath: /train
                      name: training
              volumes:
                - name: training
                  persistentVolumeClaim:
                    claimName: strategy-estimator-volume
  4. Run the following command for the change to take effect:
    kubectl apply -f distributed_tfjob.yaml
  5. Modify the pvc.yaml configuration file based on your business requirements.
    The following code shows the content of the pvc.yaml configuration file:
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: strategy-estimator-volume
      labels:
        app: strategy-estimator-volume
    spec:
      storageClassName: "nfs-client"
      accessModes:
      - ReadWriteMany
      resources:
        requests:
          storage: 10Gi
    Note You can increase the value of the storage parameter. For example, you can set this parameter to 50Gi.
  6. Run the following command for the change to take effect:
    kubectl apply -f pvc.yaml
  7. View the training status.
    1. Run the following command to query information about pods:
      kubectl get pods
      Information similar to the following output is returned:
      NAME                                      READY   STATUS      RESTARTS   AGE
      distributed-training-worker-0             0/1     Completed   0          22h
      distributed-training-worker-1             0/1     Completed   0          22h
      distributed-training-worker-2             0/1     Completed   0          22h
      nfs-client-provisioner-5cb8b7cf76-k2z4d   1/1     Running     0          25h
      Note If the status of all pods is Completed, the training ends.
    2. Run the following command to view log information:
      kubectl logs distributed-training-worker-0
  8. View the model export file in the checkpoint format.
    1. Run the following command to access the default_storage_class directory:
      cd /mnt/disk1/k8s_pv/default_storage_class
    2. Run the ll command to obtain model directories from the default_storage_class directory.
      total 28
      drwxrwxrwx 2 root    root     4096 Jul  8 16:56 anonymous-workspace-ds-notebook2-pvc-09789bbd-aa40-4381-af63-ebcab57c584f
      drwxrwxrwx 7 root    root     4096 Jul  8 18:09 anonymous-workspace-ds-notebook-pvc-ae4db112-9491-4449-854a-7b0e67264703
      drwxrwxrwx 5 root    root     4096 Jul  8 16:27 default-strategy-estimator-volume-pvc-6cd2979e-f925-45b7-8c25-ce67e12e2f03
      drwxrwxrwx 3 root    root     4096 Jul  8 13:47 kubeflow-katib-mysql-pvc-9a3ebbfe-2952-4eaa-8a83-36e42d620cac
      drwxrwxrwx 3 root    root     4096 Jul  8 13:47 kubeflow-metadata-mysql-pvc-5778e165-8a08-4536-8e36-9185144d3d6d
      drwxrwxrwx 3 root    root     4096 Jul  8 13:47 kubeflow-minio-pvc-pvc-10015211-fc40-460e-b723-a5c220112d08
      drwxrwxrwx 6 polkitd ssh_keys 4096 Jul  8 13:54 kubeflow-mysql-pv-claim-pvc-5136b864-c868-4951-9b34-f6c5f06c2d6f
                                      
    3. Run the following command to access a specific model directory:
      cd default-strategy-estimator-volume-pvc-6cd2979e-f925-45b7-8c25-ce67e12e2f03
      Note Replace default-strategy-estimator-volume-pvc-6cd2979e-f925-45b7-8c25-ce67e12e2f03 with an actual directory.
    4. Run the following command to access the master directory:
      cd master
    5. Run the ll command to view subdirectories of the preceding directory:
      total 4
      drwxr-xr-x 3 root root 4096 Jul  8 16:27 1625732821
    6. Run the following command to access a specific directory in the master directory:
      cd 1625732821
      Note Replace 1625732821 with an actual directory.
    7. Run the ll command to view subdirectories of the preceding directory:
      total 1788
      -rw-r--r-- 1 root root    130 Jul  8 16:27 checkpoint
      -rw-r--r-- 1 root root 838550 Jul  8 16:27 events.out.tfevents.1625732823.distributed-training-worker-0
      -rw-r--r-- 1 root root 523304 Jul  8 16:27 graph.pbtxt
      drwxr-xr-x 2 root root   4096 Jul  8 16:27 keras
      -rw-r--r-- 1 root root    780 Jul  8 16:27 model.ckpt-0.data-00000-of-00001
      -rw-r--r-- 1 root root    249 Jul  8 16:27 model.ckpt-0.index
      -rw-r--r-- 1 root root 218354 Jul  8 16:27 model.ckpt-0.meta
      -rw-r--r-- 1 root root    780 Jul  8 16:27 model.ckpt-3200.data-00000-of-00001
      -rw-r--r-- 1 root root    249 Jul  8 16:27 model.ckpt-3200.index
      -rw-r--r-- 1 root root 218354 Jul  8 16:27 model.ckpt-3200.meta

Call the Keras API

Notice If you use this method, GPU resources are required. The creation of an image also requires GPU resources. Therefore, when you create a Data Science cluster, you must select a type of GPU configuration.
  1. Run the following command to access the keras-API directory:
    cd /root/dsdemo/kubeflow_samples/training/tf/keras-API
  2. Make a training image.
    • Run the following command to make a training image:
      make
    • Run the following command to make a training image and push the image to the image repository:
      make push
  3. Modify the image name of the training configuration file multi_worker_tfjob.yaml that is stored in the keras-API directory based on your business requirements.
    The following code shows the content of the multi_worker_tfjob.yaml file:
    apiVersion: kubeflow.org/v1
    kind: TFJob
    metadata:
      name: multi-worker
    spec:
      cleanPodPolicy: None
      tfReplicaSpecs:
        Worker:
          replicas: 3
          restartPolicy: Never
          template:
            spec:
              containers:
                - name: tensorflow
                  image: datascience-registry.cn-beijing.cr.aliyuncs.com/kubeflow-examples/multi_worker_strategy:0.1.1
                  volumeMounts:
                    - mountPath: /train
                      name: training
                  resources:
                    limits:
                      nvidia.com/gpu: 1
              volumes:
                - name: training
                  persistentVolumeClaim:
                    claimName: strategy-volume
  4. Run the following command to make the modification take effect:
    kubectl apply -f multi_worker_tfjob.yaml
  5. Modify the pvc.yaml configuration file.
    The following code shows the content of the pvc.yaml configuration file:
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: strategy-volume
      labels:
        app: strategy-volume
    spec:
      storageClassName: "nfs-client"
      accessModes:
      - ReadWriteMany
      resources:
        requests:
          storage: 10Gi
    Note You can increase the value of the storage parameter. For example, you can set this parameter to 50Gi.
  6. Run the following command for the change to take effect:
    kubectl apply -f pvc.yaml
  7. View the training status.
    1. Run the following command to query information about pods:
      kubectl get pods
      Information similar to the following output is returned:
      NAME                                      READY   STATUS      RESTARTS   AGE
      multi-worker-worker-0                     0/1     Pending     0          5m33s
      multi-worker-worker-1                     0/1     Pending     0          5m33s
      multi-worker-worker-2                     0/1     Pending     0          5m33s
      nfs-client-provisioner-5cb8b7cf76-k2z4d   1/1     Running     0          25h
      Note If the status of all pods is Completed, the training ends.
    2. Run the following command to view log information:
      kubectl logs distributed-training-worker-0
  8. Run the following command to view the model export file in the checkpoint format:
    ll /mnt/disk1/k8s_pv/default_storage_class/default-strategy-estimator-volume-pvc-aa16c081-cd94-4565-bfde-daee90ee25a4/master/1625553118/
    Note Replace default-strategy-estimator-volume-pvc-aa16c081-cd94-4565-bfde-daee90ee25a4 and 1625553118 with actual directories.
    total 1788
    -rw-r--r-- 1 root root    130 Jul  6 14:32 checkpoint
    -rw-r--r-- 1 root root 838550 Jul  6 14:32 events.out.tfevents.1625553120.distributed-training-worker-0
    -rw-r--r-- 1 root root 523405 Jul  6 14:32 graph.pbtxt
    drwxr-xr-x 2 root root   4096 Jul  6 14:31 keras
    -rw-r--r-- 1 root root    780 Jul  6 14:32 model.ckpt-0.data-00000-of-00001
    -rw-r--r-- 1 root root    249 Jul  6 14:32 model.ckpt-0.index
    -rw-r--r-- 1 root root 218354 Jul  6 14:32 model.ckpt-0.meta
    -rw-r--r-- 1 root root    780 Jul  6 14:32 model.ckpt-3200.data-00000-of-00001
    -rw-r--r-- 1 root root    249 Jul  6 14:32 model.ckpt-3200.index
    -rw-r--r-- 1 root root 218354 Jul  6 14:32 model.ckpt-3200.meta

Use PyTorch

  1. Run the following command to access the mnist directory:
    cd /root/dsdemo/kubeflow_samples/training/pytorch/mnist
  2. Make a training image.
    • Run the following command to make a training image:
      make
    • Run the following command to make a training image and push the image to the image repository:
      make push
  3. Modify the image name of the training configuration file based on your business requirements:
    apiVersion: "kubeflow.org/v1"
    kind: "PyTorchJob"
    metadata:
      name: "pytorch-dist-mnist-gloo"
    spec:
      pytorchReplicaSpecs:
        Master:
          replicas: 1
          restartPolicy: OnFailure
          template:
            metadata:
              annotations:
                sidecar.istio.io/inject: "false"
            spec:
              containers:
                - name: pytorch
                  image: datascience-registry.cn-beijing.cr.aliyuncs.com/kubeflow-examples/pytorch-dist-mnist-test:0.1.5
                  args: ["--backend", "gloo"]
                  volumeMounts:
                    - mountPath: /train
                      name: training
                  # Comment out the below resources to use the CPU.
                  resources: 
                    limits:
                      nvidia.com/gpu: 1
              volumes:
                - name: training
                  persistentVolumeClaim:
                    claimName: strategy-pytorch-volume
        Worker:
          replicas: 1
          restartPolicy: OnFailure
          template:
            metadata:
              annotations:
                sidecar.istio.io/inject: "false"
            spec:
              containers: 
                - name: pytorch
                  image: datascience-registry.cn-beijing.cr.aliyuncs.com/kubeflow-examples/pytorch-dist-mnist-test:0.1.5
                  args: ["--backend", "gloo"]
                  volumeMounts:
                    - mountPath: /train
                      name: training
                  # Comment out the below resources to use the CPU.
                  resources: 
                    limits:
                      nvidia.com/gpu: 1
              volumes:
                - name: training
                  persistentVolumeClaim:
                    claimName: strategy-pytorch-volume
  4. Modify the pvc.yaml configuration file.
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: strategy-pytorch-volume
      labels:
        app: strategy-pytorch-volume
    spec:
      storageClassName: "nfs-client"
      accessModes:
      - ReadWriteMany
      resources:
        requests:
          storage: 10Gi
    Note You can increase the value of the storage parameter. For example, you can set this parameter to 50Gi.
  5. Apply for a persistent volume claim (PVC) and submit a training job.
    kubectl apply -f ./v1/pytorch_job_mnist_gloo.yaml
  6. View the training status.
    1. Run the following command to query information about pods:
      kubectl get pods
      Information similar to the following output is returned:
      NAME                                      READY   STATUS      RESTARTS   AGE
      pytorch-dist-mnist-gloo-worker-0          0/1     Pending     0          11m
      Note If the status of all pods is Completed, the training ends.
    2. Run the following command to view log information:
      kubectl logs pytorch-dist-mnist-gloo-master-0
  7. Run the following command to view the model export file in the checkpoint format:
    cd /mnt/disk1/k8s_pv/default_storage_class/default-strategy-pytorch-volume-pvc-633b12ab-5f63-4e63-97b9-24c2b3de733c/
    Note Replace default-strategy-pytorch-volume-pvc-633b12ab-5f63-4e63-97b9-24c2b3de733c with an actual directory.
    Information similar to the following output is returned:
    total 1688
    -rw-r--r-- 1 root root 1725813 Jul  6 15:35 mnist_cnn.pt