Arena is a Kubernetes-based command-line tool empowered by AI. Deep Learning Containers (DLC) allows you to use clusters of Alibaba Cloud Container Service for Kubernetes (ACK) to train models in Arena.

Step 1: Install the client

  1. Log on to the Arena website, and download the Arena installation package. For macOS, download arena-installer-xxx-xxx-darwin-amd64.tar.gz. For Linux, download arena-installer-xxx-xxx-linux-amd64.tar.gz.
  2. Run the following command to install the client:
    tar -xvf arena-installer-xxx-xxx.tar.gz
    cd arena-installer
    sudo ./install.sh
    Replace arena-installer-xxx-xxx.tar.gz with the actual name of the package.
  3. Run the following command to check whether the client is installed:
    arena version

Step 2: Configure KubeConfig

If you want to remotely connect to an ACK cluster and submit a job, you must include the KubeConfig of the ACK cluster in $HOME/.kube/config.

  1. Log on to the Machine Learning Platform for AI console.
  2. In the left-side navigation pane, choose Model Training > DLC-Cloud-native Deep Learning Model Training.
  3. On the DLC page, click the ID of your ACK cluster in the ACK Cluster ID/Name column.
  4. On the Connection Information tab of the Cluster Information page, click Copy.Copy KubeConfig
  5. Create a local .kube/config file, and paste the configurations to the file.
    vim $HOME/.kube/config

Step 3: Submit a TensorFlow job

  1. Run the command to submit a TensorFlow job.
    You can run one of the following commands to submit a TensorFlow job:
    • # Submit a TensorFlow job.
      arena submit tfjob + paraname
    • # Submit a TensorFlow job.
      arena submit tf + paraname
    Replace paraname with the desired parameter. You can run the arena submit tfjob —help command to query all supported parameters. Required parameters are:
    • --name: the name of the job.
    • --image: the Docker image that is used to launch pods for training deep learning models. The image must be supported by DLC. You can select a public image or a custom image based on the region, deep learning framework, Python version, and resource type of your DLC cluster. For more information, see Images.
    • --data: the directory where the source data is stored. The path must be in the format of PVC name:directory.
  2. You can use one of the following methods to view the log of a job:
    • Run the following command:
      arena logs yourTaskName
      Replace yourTaskName with the actual name of the job.
    • For more information about how to view logs of training jobs in DLC Dashboard, see Manage jobs in DLC Dashboard.

Examples

  • Standalone job
    arena submit tf \
    --name=pai-deeplearning-test-oss \
    --image=registry.cn-shanghai.aliyuncs.com/pai-dlc/pai-tensorflow-training:1.12-cpu-py2 \
    --data=pai-deeplearning-oss:/training_dir/ \
    "python /training_dir/code/main.py --max_steps=10000 --data_dir=/training_dir/data/"
    • --name: the name of the job. After you submit the job, you can run the arena logs ${name} command to view the log of the job.
    • --image: the path of the image. Example: registry.cn-shanghai.aliyuncs.com/pai-dlc/pai-tensorflow-training:1.12-cpu-py2.
    • --data: the directory where the source data is stored. Example: pai-deeplearning-oss:/training_dir. pai-deeplearning-oss represents the persistent volume claim (PVC) created for the ACK cluster. /training_dir represents a directory on a pod. Make sure that the specified PVC is mounted to /training_dir.
    • python /training_dir/code/main.py --max_steps=10000 --data_dir=/training_dir/data/: the command to be executed by the pod. /training_dir/code/ represents the directory where the source code is stored in the Object Storage Service (OSS) bucket. --max_steps and --data_dir correspond to the FALGS.max_steps and FLAGS.data_dir parameters in main.py.
  • Distributed job
    arena submit tf \
    --name=pai-deeplearning-dist-test-nas \
    --workers=2 \
    --worker-image=registry.cn-shanghai.aliyuncs.com/pai-dlc/pai-tensorflow-training:1.12-cpu-py2 \
    --ps=1 \
    --ps-image=registry.cn-shanghai.aliyuncs.com/pai-dlc/pai-tensorflow-training:1.12-cpu-py2 \
    --data=pai-deeplearning-nas:/training_dir/ \
    "python /training_dir/code/dist-main.py --max_steps=10000 --data_dir=/training_dir/data/"
    • --name: the name of the job. After you submit the job, you can run the arena logs ${name} command to view the log of the job.
    • --workers: the number of worker nodes.
    • --worker-image: the path of the image for worker nodes.
    • --ps: the number of PS nodes.
    • --ps-image: the path of the image for PS nodes.
    • --data: the directory where the source data is stored. Example: pai-deeplearning-nas:/training_dir/. pai-deeplearning-nas represents the PVC created for the ACK cluster. /training_dir represents a directory on a pod. Make sure that the specified PVC is mounted to /training_dir.
    • python /training_dir/code/dist-main.py --max_steps=10000 --data_dir=/training_dir/data/: the command to be executed by the pod. /training_dir/code/dist-main.py represents the directory where the source code is stored in the NAS file system. --max_steps and --data_dir correspond to the FALGS.max_steps and FLAGS.data_dir parameters in main.py.
  • Distributed job (GPU-accelerated)
    arena submit tf \
    --name=pai-deeplearning-gpu-dist-test-oss \
    --gpus=1 \
    --workers=2 \
    --worker-image=registry.cn-shanghai.aliyuncs.com/pai-dlc/pai-tensorflow-training:1.12-gpu-py2 \
    --ps=1 \
    --ps-image=registry.cn-shanghai.aliyuncs.com/pai-dlc/pai-tensorflow-training:1.12-cpu-py2 \
    --data=pai-deeplearning-nas:/training_dir/ \
    "python /training_dir/code/dist-main.py --max_steps=10000 --data_dir=/training_dir/data/"
    • --name: the name of the job. After you submit the job, you can run the arena logs ${name} command to view the log of the job.
    • --gpus: the number of GPUs allocated to each worker node. The value of this parameter must not be greater than the number of GPU-accelerated nodes in your ACK cluster.
    • --workers: the number of worker nodes.
    • --worker-image: the path of the GPU image for worker nodes.
    • --ps: the number of PS nodes.
    • --ps-image: the path of the CPU image for PS nodes.
    • --data: the directory where the source data is stored. Example: pai-deeplearning-oss:/training_dir/. pai-deeplearning-oss represents the PVC created for the ACK cluster. /training_dir represents a directory on a pod. Make sure that the specified PVC is mounted to /training_dir.
    • python /training_dir/code/dist-main.py --max_steps=10000 --data_dir=/training_dir/data/: the command to be executed by the pod. /training_dir/code/dist-main.py represents the directory where the source code is stored in the OSS bucket. --max_steps and --data_dir correspond to the FALGS.max_steps and FLAGS.data_dir parameters in main.py.