Arena is a lightweight CLI for managing machine learning tasks on Kubernetes. It covers the full ML workflow—data preparation, model development, training, and prediction—improving the work efficiency of data scientists. Arena runs in deep learning frameworks optimized by Alibaba Cloud and integrates with Alibaba Cloud services to maximize the performance and utilization of heterogeneous computing resources through GPU sharing and Cloud Paralleled File System (CPFS) support.
Prerequisites
Before you begin, make sure you have:
-
A Container Service for Kubernetes (ACK) cluster with GPU-accelerated nodes. See Create an ACK cluster with GPU-accelerated nodes or Create an ACK dedicated cluster with GPU-accelerated nodes.
-
Internet access from the cluster nodes.
-
The Arena component installed. See Deploy the cloud-native AI suite.
Step 1: Connect to the cluster
Connection method depends on your cluster type.
ACK managed clusters
ACK managed clusters have no master nodes. Install the Arena client on your local machine (for example, a macOS computer) and connect via kubeconfig. See Connect to a cluster using kubectl.
ACK dedicated clusters
Use SSH to log in to a master node, then run Arena commands directly on that node. See Use SSH to connect to master nodes of an ACK dedicated cluster.
Run kubectl get nodes to confirm your kubeconfig is configured correctly.
Step 2: Install the Arena client
Choose the package that matches your operating system and processor architecture, then run the commands to download, extract, and install.
Linux/amd64
# Download the Arena installation package.
wget https://aliacs-k8s-cn-hangzhou.oss-cn-hangzhou.aliyuncs.com/arena/arena-installer-0.12.0-linux-amd64.tar.gz
# Extract the package.
tar -zxvf arena-installer-0.12.0-linux-amd64.tar.gz
# Install Arena.
cd arena-installer-0.12.0-linux-amd64
bash install.sh --only-binary
Linux/arm64
# Download the Arena installation package.
wget https://aliacs-k8s-cn-hangzhou.oss-cn-hangzhou.aliyuncs.com/arena/arena-installer-0.12.0-linux-arm64.tar.gz
# Extract the package.
tar -zxvf arena-installer-0.12.0-linux-arm64.tar.gz
# Install Arena.
cd arena-installer-0.12.0-linux-arm64
bash install.sh --only-binary
macOS/amd64
# Download the Arena installation package.
wget https://aliacs-k8s-cn-hangzhou.oss-cn-hangzhou.aliyuncs.com/arena/arena-installer-0.12.0-darwin-amd64.tar.gz
# Extract the package.
tar -zxvf arena-installer-0.12.0-darwin-amd64.tar.gz
# Install Arena.
cd arena-installer-0.12.0-darwin-amd64
bash install.sh --only-binary
macOS/arm64
# Download the Arena installation package.
wget https://aliacs-k8s-cn-hangzhou.oss-cn-hangzhou.aliyuncs.com/arena/arena-installer-0.12.0-darwin-arm64.tar.gz
# Extract the package.
tar -zxvf arena-installer-0.12.0-darwin-arm64.tar.gz
# Install Arena.
cd arena-installer-0.12.0-darwin-arm64
bash install.sh --only-binary
Step 3: (Optional) Enable shell auto-completion
Shell auto-completion lets you press Tab to complete partially typed Arena commands.
Install the completion package
CentOS or Linux
sudo yum install bash-completion -y
Debian or Ubuntu
sudo apt-get install bash-completion
macOS
brew install bash-completion@2
Enable auto-completion in your shell profile
bash (Linux)
echo "source <(arena completion bash)" >> ~/.bashrc
chmod u+x ~/.bashrc
bash (macOS)
echo "source $(brew --prefix)/etc/profile.d/bash_completion.sh" >> ~/.bashrc
Step 4: Verify the installation
Run the following commands to confirm Arena is working.
-
Check available GPU resources in the cluster:
arena top nodeThe output lists nodes and their GPU capacity. For example:
NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated) cn-huhehaote.192.168.X.XXX 192.168.0.117 <none> ready 8 0 cn-huhehaote.192.168.X.XXX 192.168.0.118 <none> ready 8 0 cn-huhehaote.192.168.X.XXX 192.168.0.119 <none> ready 8 0 cn-huhehaote.192.169.X.XXX 192.168.0.120 <none> ready 8 0 ----------------------------------------------------------------------------------------- Allocated/Total GPUs In Cluster: 0/32 (0%) -
Submit a test training job:
arena submit tf \ --name=firstjob \ --gpus=1 \ --image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/tf-mnist-standalone:gpu \ "python /app/main.py"Expected output:
configmap/firstjob-tfjob created configmap/firstjob-tfjob labeled tfjob.kubeflow.org/firstjob created INFO[0001] The Job firstjob has been submitted successfully INFO[0001] You can run `arena get firstjob --type tfjob` to check the job status -
List all jobs:
arena listExpected output:
NAME STATUS TRAINER AGE NODE firstjob RUNNING TFJOB 5s 192.168.X.XXX -
Check the job status:
arena get firstjobExpected output:
STATUS: SUCCEEDED NAMESPACE: default PRIORITY: N/A TRAINING DURATION: 52s NAME STATUS TRAINER AGE INSTANCE NODE firstjob SUCCEEDED TFJOB 14m firstjob-chief-0 192.168.X.XXX -
View the job logs:
arena logs --tail=10 firstjobExpected output:
Accuracy at step 910: 0.9694 Accuracy at step 920: 0.9687 Accuracy at step 930: 0.9676 Accuracy at step 940: 0.9678 Accuracy at step 950: 0.9704 Accuracy at step 960: 0.9692 Accuracy at step 970: 0.9721 Accuracy at step 980: 0.9696 Accuracy at step 990: 0.9675 Adding run metadata for 999The job completed successfully.