Arena is a lightweight client that is used to manage Kubernetes-based machine learning tasks. Arena allows you to streamline data preparation, model development, model training, and model prediction throughout a complete lifecycle of machine learning. This improves the work efficiency of data scientists. Arena is also deeply integrated with the basic services of Alibaba Cloud. It supports GPU sharing and Cloud Paralleled File System (CPFS). Arena can run in deep learning frameworks optimized by Alibaba Cloud. This maximizes the performance and utilization of heterogeneous computing resources provided by Alibaba Cloud.
- A Container Service for Kubernetes (ACK) cluster that contains GPU-accelerated nodes is created. For more information, see Create an ACK managed cluster with GPU-accelerated nodes or Create an ACK dedicated cluster with GPU-accelerated nodes.
- Nodes in the cluster can access the Internet.
Step 1: Install ack-arena
- Log on to the ACK console.
- In the left-side navigation pane of the ACK console, click Clusters.
- On the Clusters page, find the cluster that you want to manage and choose in the Actions column.
- Find ack-arena and click Install.
Step 2: Configure the Arena client
If you use a dedicated Kubernetes cluster, log on to a master node in the cluster by using SSH and run the arena command. For more information about how to log on to a node by using SSH, see Connect to the master nodes of a dedicated Kubernetes cluster by using SSH.
If you use a managed Kubernetes cluster, you must install the Arena client on your on-premises machine, such as a PC that runs macOS. This is because a managed Kubernetes cluster does not contain master nodes. Before you install the Arena client, make sure that the kubeconfig file is in the $HOME/.kube/config directory. For more information, see Connect to ACK clusters by using kubectl. Then, perform the following steps to install and configure the Arena client:
kubectl get nodescommand to check whether the configurations in the kubeconfig file are correct.
- Download the Arena client.
- Decompress the package.
- To install the Arena client on Linux, run the following command to decompress the
tar -xvf arena-installer-0.8.6-a2bec8c-linux-amd64.tar.gz
- To install the Arena client on macOS, run the following command to decompress the
tar -xvf arena-installer-0.8.6-a2bec8c-darwin-amd64.tar.gz
- To install the Arena client on Linux, run the following command to decompress the package:
- Run the following command to install the Arena client:
cd arena-installer bash install.sh --only-binary
- Optional:Install bash-completion. The auto completion feature of bash-completion can automatically fill in partially typed commands.
- Run the following command to install bash-completion on CentOS or Alibaba Cloud Linux
yum install bash-completion -y
- Run the following command to install bash-completion on Debian or Ubuntu:
apt-get install bash-completion
- Run the following command to install bash-completion on macOS:
brew install bash-completion@2
- Run the following command to install bash-completion on CentOS or Alibaba Cloud Linux 2:
- Run the following command to add the auto completion feature to the profile file.
Then, you can press Tab in a CLI to automatically complete a partially typed command.
echo "source <(arena completion bash)" >> ~/.bashrc chmod u+x ~/.bashrc
echo "source $(brew --prefix)/etc/profile.d/bash_completion.sh" >> ~/.bashrc
Step 3: Test whether Arena works as expected
You can perform the following steps to check whether Arena works as expected:
- Run the following command to query the available GPU resources in the cluster:
arena top nodeThe output shows information about the nodes and GPUs. This indicates that Arena works as expected.
NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated) cn-huhehaote.192.1xx.x.xx7 192.1xx.x.xx7 <none> ready 8 0 cn-huhehaote.192.1xx.x.xx8 192.168.0.118 <none> ready 8 0 cn-huhehaote.192.1xx.x.xx9 192.168.0.119 <none> ready 8 0 cn-huhehaote.192.1xx.x.xx0 192.168.0.120 <none> ready 8 0 ----------------------------------------------------------------------------------------- Allocated/Total GPUs In Cluster: 0/32 (0%)
- Use Arena to submit a training job. The output shows that the job is submitted.
arena submit tf \ --name=firstjob \ --gpus=1 \ --image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/tf-mnist-standalone:gpu \ "python /app/main.py"
configmap/firstjob-tfjob created configmap/firstjob-tfjob labeled tfjob.kubeflow.org/firstjob created INFO The Job firstjob has been submitted successfully INFO You can run `arena get firstjob --type tfjob` to check the job status
- Run the
arena listcommand to query all jobs.Expected output:
NAME STATUS TRAINER AGE NODE firstjob RUNNING TFJOB 5s 192.1xx.x.xxx
- Run the following command to query the state of the submitted job:
arena get firstjob
STATUS: SUCCEEDED NAMESPACE: default PRIORITY: N/A TRAINING DURATION: 52s NAME STATUS TRAINER AGE INSTANCE NODE firstjob SUCCEEDED TFJOB 14m firstjob-chief-0 192.168.0.118
- Run the following command to query the log of the job:
arena logs --tail=10 firstjob
Accuracy at step 910: 0.9694 Accuracy at step 920: 0.9687 Accuracy at step 930: 0.9676 Accuracy at step 940: 0.9678 Accuracy at step 950: 0.9704 Accuracy at step 960: 0.9692 Accuracy at step 970: 0.9721 Accuracy at step 980: 0.9696 Accuracy at step 990: 0.9675 Adding run metadata for 999