Deep Learning Containers (DLC) allows you to use a Container Service for Kubernetes (ACK) cluster to train models in Arena. Before you submit a deep learning job, you must install the Arena client and configure the kubeconfig file of the cluster.

Background information

Arena is a Kubernetes-based command-line tool empowered by AI. You can manage deep learning jobs by using Arena.

Step 1: Install the client

  1. Visit the Arena website and download the Arena installation package. For macOS, download arena-installer-xxx-xxx-darwin-amd64.tar.gz. For Linux, download arena-installer-xxx-xxx-linux-amd64.tar.gz.
  2. Run the following commands to install the client:
    tar -xvf arena-installer-xxx-xxx.tar.gz
    cd arena-installer
    sudo ./install.sh
    Replace arena-installer-xxx-xxx.tar.gz with the actual name of the package.
  3. Run the following command to check whether the client is installed:
    arena version

Step 2: Configure the kubeconfig file

If you want to connect to a remote ACK cluster and submit a job, you must include the configurations of the ACK cluster in $HOME/.kube/config.

  1. Log on to the Machine Learning Platform for AI (PAI) console.
  2. In the left-side navigation pane, choose Model Training > Deep Learning Model Training.
  3. On the Deep Learning Model Training page, click the name of your ACK cluster in the ACK Cluster Name column.
  4. On the Connection Information tab of the Cluster Information page, click Copy. Copy configurations
  5. Create a .kube/config file on your on-premises machine and paste the configurations to the file.
    vim $HOME/.kube/config

Step 3: Submit a model training job

  • Submit a TensorFlow job:
    1. Run one of the following commands to submit a TensorFlow job.
      # Method 1: Submit a TensorFlow job. 
      arena submit tfjob paraname
      
      # Method 2: Submit a TensorFlow job. 
      arena submit tf paraname
      Replace paraname with the actual parameters. You can run the arena submit tfjob --help command to query all supported parameters. The following parameters are required:
      • --name: the name of the job.
      • --image: the image that is used to launch pods for training deep learning models. The image must be supported by DLC. You can select a public image or a custom image based on the region, deep learning framework, Python version, and resource type of your DLC cluster. For more information, see Images.
      • --data: the directory where the source data is stored. The path must be in the format of Name of the persistent volume claim (PVC):Directory.
    2. You can use one of the following methods to view the log of a job:
      • Run the following command to view the log of a job:
        arena logs yourTaskName
        Replace yourTaskName with the name of the job.
      • For more information about how to view logs of training jobs in DLC Dashboard, see Manage jobs in DLC Dashboard.
  • Submit a PyTorch job.
    Run the following commands to submit a PyTorch job.
    arena submit pytorch \
    --namespace=pai-dlc-system --name=ddptest \
    --gpus=8 --workers=2 \
    --image=registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/pai-pytorch-training:1.5-gpu-py3 \
    --data=pai-hangzhou-cpfs-pvc:/mnt/luci-cpfs/ \
    --working-dir=/mnt/luci-cpfs/luci-hangzhou/yanhao/centernet/ \
    "bash experiments/ctdet_coco_ddp.sh"
    Description of parameters in the preceding code:
    • --namespace: the DLC namespace.
    • --name: the name of the job.
    • --gpus: the number of graphics processing units (GPUs) that are allocated to each worker node.
    • --workers: the number of worker nodes.
    • --image: the name of the image to be used. We recommend that you use the recommended registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/pai-pytorch-training:1.5-gpu-py3 image.
    • --data: the directory where the source data is stored. The path must be in the format of Name of the persistent volume claim (PVC):Directory.
    • --working-dir: the working directory where the program is executed.
    • experiments/ctdet_coco_ddp.sh: the script to be executed. Replace experiments/ctdet_coco_ddp.sh with the actual name of the script.

Step 4: Manage jobs

  • You can run the following command to view the statuses of jobs:
    arena list -n pai-dlc-system
  • You can run the following command to view the logs of a job:
    arena logs -f ddptest -n pai-dlc-system
  • You can run the following command to delete a job:
    arena delete ddptest -n pai-dlc-system

Examples

  • Standalone job
    arena submit tf \
    --name=pai-deeplearning-test-oss \
    --image=registry.cn-shanghai.aliyuncs.com/pai-dlc/pai-tensorflow-training:1.12-cpu-py2 \
    --data=pai-deeplearning-oss:/training_dir/ \
    "python /training_dir/code/main.py --max_steps=10000 --data_dir=/training_dir/data/"
    • --name: the name of the job. After you submit the job, you can run the arena logs ${name} command to view the log of the job.
    • --image: the name of the image to be used. Example: registry.cn-shanghai.aliyuncs.com/pai-dlc/pai-tensorflow-training:1.12-cpu-py2.
    • --data: the directory where the source data is stored. Example: pai-deeplearning-oss:/training_dir. pai-deeplearning-oss represents the PVC that is created for the ACK cluster. /training_dir represents a directory on a pod. Make sure that the specified PVC is mounted to /training_dir.
    • python /training_dir/code/main.py --max_steps=10000 --data_dir=/training_dir/data/: the command to be executed by the pod. /training_dir/code/ represents the directory where the source code is stored in the Object Storage Service (OSS) bucket. --max_steps and --data_dir correspond to the FALGS.max_steps and FLAGS.data_dir parameters in main.py.
  • Distributed job
    arena submit tf \
    --name=pai-deeplearning-dist-test-nas \
    --workers=2 \
    --worker-image=registry.cn-shanghai.aliyuncs.com/pai-dlc/pai-tensorflow-training:1.12-cpu-py2 \
    --ps=1 \
    --ps-image=registry.cn-shanghai.aliyuncs.com/pai-dlc/pai-tensorflow-training:1.12-cpu-py2 \
    --data=pai-deeplearning-nas:/training_dir/ \
    "python /training_dir/code/dist-main.py --max_steps=10000 --data_dir=/training_dir/data/"
    • --name: the name of the job. After you submit the job, you can run the arena logs ${name} command to view the log of the job.
    • --workers: the number of worker nodes.
    • --worker-image: the image for worker nodes.
    • --ps: the number of parameter server (PS) nodes.
    • --ps-image: the image for PS nodes.
    • --data: the directory where the source data is stored. Example: pai-deeplearning-nas:/training_dir/. pai-deeplearning-nas represents the PVC created for the ACK cluster. /training_dir represents a directory on a pod. Make sure that the specified PVC is mounted to /training_dir.
    • python /training_dir/code/dist-main.py --max_steps=10000 --data_dir=/training_dir/data/: the command to be executed by the pod. /training_dir/code/dist-main.py represents the directory where the source code is stored in Apsara File Storage NAS. --max_steps and --data_dir correspond to the FALGS.max_steps and FLAGS.data_dir parameters in main.py.
  • Distributed job (GPU-accelerated)
    arena submit tf \
    --name=pai-deeplearning-gpu-dist-test-oss \
    --gpus=1 \
    --workers=2 \
    --worker-image=registry.cn-shanghai.aliyuncs.com/pai-dlc/pai-tensorflow-training:1.12-gpu-py2 \
    --ps=1 \
    --ps-image=registry.cn-shanghai.aliyuncs.com/pai-dlc/pai-tensorflow-training:1.12-cpu-py2 \
    --data=pai-deeplearning-nas:/training_dir/ \
    "python /training_dir/code/dist-main.py --max_steps=10000 --data_dir=/training_dir/data/"
    • --name: the name of the job. After you submit the job, you can run the arena logs ${name} command to view the log of the job.
    • --gpus: the number of GPUs allocated to each worker node. The value of this parameter cannot be greater than the number of GPU-accelerated nodes in your ACK cluster.
    • --workers: the number of worker nodes.
    • --worker-image: the GPU image for worker nodes.
    • --ps: the number of PS nodes.
    • --ps-image: the CPU image for PS nodes.
    • --data: the directory where the source data is stored. Example: pai-deeplearning-oss:/training_dir/. pai-deeplearning-oss represents the PVC created for the ACK cluster. /training_dir represents a directory on a pod. Make sure that the specified PVC is mounted to /training_dir.
    • python /training_dir/code/dist-main.py --max_steps=10000 --data_dir=/training_dir/data/: the command to be executed by the pod. /training_dir/code/dist-main.py represents the directory where the source code is stored in the OSS bucket. --max_steps and --data_dir correspond to the FALGS.max_steps and FLAGS.data_dir parameters in main.py.