Train Fashion-MNIST Models on ACK using Arena & V100 GPUs - Container Service for Kubernetes

Background

The cloud-native AI component set is a collection of components deployed independently via Helm charts. It supports two roles:

Administrators manage users and permissions, allocate cluster resources, configure storage, manage datasets, and monitor resource utilization.
Developers submit jobs and use cluster resources. Developers must be created by an administrator and granted permissions before they can use tools such as Arena or Jupyter Notebook.

The following table describes each component and its role in the workflow:

Component	Role
AI Dashboard	Admin control plane — manage datasets and monitor resources
AI Developer Console	Developer portal — create notebooks, submit jobs, and manage models
Arena	CLI for submitting and monitoring training and inference jobs
Fluid	Data caching layer — accelerates dataset reads for training jobs
AI job scheduler	GPU topology-aware scheduling — reduces distributed training time

Prerequisites

Before you begin, make sure the following are in place.

Cluster (completed by an administrator):

An ACK managed cluster where each node has at least 300 GB of disk space. See Create an ACK managed cluster.
- For optimal data acceleration: 4 Elastic Compute Service (ECS) instances, each with 8 V100 GPUs.
- For optimal topology awareness: 2 ECS instances, each with 2 V100 GPUs.
All cloud-native AI component set components installed. See Deploy the cloud-native AI suite.
AI Dashboard ready for use. See Access AI Dashboard.
AI Developer Console ready for use. See Log on to AI Developer Console.

The AI Console (AI Dashboard and AI Developer Console) was rolled out via a whitelist starting January 22, 2025. Existing deployments before this date are unaffected. If you are not whitelisted for a new installation, configure AI Console via the open-source community. See Open-source AI Console.

Dataset and credentials:

The Fashion-MNIST dataset downloaded and uploaded to an Object Storage Service (OSS) bucket. See Upload objects.
The address, username, and password of the Git repository that stores the training code.

Tooling:

A kubectl client connected to the cluster. See Obtain the kubeconfig file and connect kubectl to the cluster.
Arena installed. See Configure the Arena client.

Test environment

The cluster used in this guide has the following nodes:

Host name	IP	Role	GPUs	vCPUs	Memory
cn-beijing.192.168.0.13	192.168.0.13	Jump server	1	8	30580004 KiB
cn-beijing.192.168.0.16	192.168.0.16	Worker	1	8	30580004 KiB
cn-beijing.192.168.0.17	192.168.0.17	Worker	1	8	30580004 KiB
cn-beijing.192.168.0.240	192.168.0.240	Worker	1	8	30580004 KiB
cn-beijing.192.168.0.239	192.168.0.239	Worker	1	8	30580004 KiB

Submit Arena commands from a Jupyter Notebook terminal, not from the jump server directly.

What this guide covers

Step	Task	Role
Step 1: Create a user and allocate resources	Create a user and allocate resources	Admin
Step 2: Create a dataset	Create and accelerate a dataset	Admin
Step 3: Develop a model	Develop a model in Jupyter Notebook	Developer
Step 4: Train the model	Submit standalone and distributed training jobs	Developer
Step 5: Manage the model	Register the trained model	Developer
Step 6: Evaluate the model	Evaluate the model	Developer
Step 7: Deploy the model as an inference service	Deploy an inference service	Developer

Step 1: Create a user and allocate resources

Role: Admin

Before developers can submit jobs, the administrator must provision the following:

A username and password. See Manage users.
Resource quotas. See Manage elastic quota groups.
The AI Developer Console endpoint, if developers submit jobs via the console. See Log on to AI Developer Console.
The kubeconfig file for cluster access, if developers submit jobs via Arena. See Select a type of cluster credentials.

Step 2: Create a dataset

Role: Admin

Add the Fashion-MNIST dataset

Create a persistent volume (PV) and persistent volume claim (PVC) to mount the OSS bucket that stores the Fashion-MNIST dataset.

Create a file named fashion-mnist.yaml with the following content. Replace AKID and AKSECRET with your OSS access credentials.

apiVersion: v1
kind: PersistentVolume
metadata:
  name: fashion-demo-pv
spec:
  accessModes:
  - ReadWriteMany
  capacity:
    storage: 10Gi
  csi:
    driver: ossplugin.csi.alibabacloud.com
    volumeAttributes:
      bucket: fashion-mnist
      otherOpts: ""
      url: oss-cn-beijing.aliyuncs.com
      akId: "AKID"
      akSecret: "AKSECRET"
    volumeHandle: fashion-demo-pv
  persistentVolumeReclaimPolicy: Retain
  storageClassName: oss
  volumeMode: Filesystem
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: fashion-demo-pvc
  namespace: demo-ns
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 10Gi
  selector:
    matchLabels:
      alicloud-pvname: fashion-demo-pv
  storageClassName: oss
  volumeMode: Filesystem
  volumeName: fashion-demo-pv

Apply the manifest:
```
kubectl create -f fashion-mnist.yaml
```

Verify that the PV and PVC are in the Bound state. Check the PV:

kubectl get pv fashion-mnist-jackwg

Expected output:

NAME                   CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                          STORAGECLASS   REASON   AGE
fashion-mnist-jackwg   10Gi       RWX            Retain           Bound    ns1/fashion-mnist-jackwg-pvc   oss                     8h

Check the PVC:

kubectl get pvc fashion-mnist-jackwg-pvc -n ns1

Expected output:

NAME                       STATUS   VOLUME                 CAPACITY   ACCESS MODES   STORAGECLASS   AGE
fashion-mnist-jackwg-pvc   Bound    fashion-mnist-jackwg   10Gi       RWX            oss            8h

Both resources should show Bound.

Accelerate the dataset

Accelerate the dataset with Fluid via AI Dashboard so that training jobs read data from a local cache rather than from OSS directly.

Access AI Dashboard as an administrator.
In the left-side navigation pane, choose Dataset > Dataset List.
Find the dataset and click Accelerate in the Operator column.

Step 3: Develop a model

Role: Developer

Use Jupyter Notebook to develop and test the model, then submit training code to a Git repository.

(Optional) Build a custom image

AI Developer Console provides built-in TensorFlow and PyTorch images. To use a custom image instead:

Create a dockerfile with the following content:

FROM tensorflow/tensorflow:1.15.5-gpu
USER root
RUN pip install jupyter && \
    pip install ipywidgets && \
    jupyter nbextension enable --py widgetsnbextension && \
    pip install jupyterlab && jupyter serverextension enable --py jupyterlab
EXPOSE 8888
CMD ["sh", "-c", "jupyter-lab --notebook-dir=/home/jovyan --ip=0.0.0.0 --no-browser --allow-root --port=8888 --NotebookApp.token='' --NotebookApp.password='' --NotebookApp.allow_origin='*' --NotebookApp.base_url=${NB_PREFIX} --ServerApp.authenticate_prometheus=False"]

For limits on custom images, see Create and use notebooks.

Build the image:

docker build -f dockerfile .

Expected output (abbreviated):

Sending build context to Docker daemon  9.216kB
Step 1/5 : FROM tensorflow/tensorflow:1.15.5-gpu
 ---> 73be11373498
...
Successfully built 3692f04626d5

Tag and push the image to your container registry:

docker tag ${IMAGE_ID} registry-vpc.cn-beijing.aliyuncs.com/${DOCKER_REPO}/jupyter:fashion-mnist-20210802a
docker push registry-vpc.cn-beijing.aliyuncs.com/${DOCKER_REPO}/jupyter:fashion-mnist-20210802a

Create a Secret to pull the image from the container registry. See Create a Secret based on existing Docker credentials.

kubectl create secret docker-registry regcred \
  --docker-server=<your-registry-server> \
  --docker-username=<username> \
  --docker-password=<password> \
  --docker-email=<your-email>

Create a Jupyter Notebook in AI Developer Console using the custom image. See Create and use notebooks.

Develop and test the model

Log on to AI Developer Console.
In the left-side navigation pane, click Notebook.
On the Notebook page, click the notebook in the Running state.

Open a CLI launcher and verify the dataset is mounted:

pwd
/root/data
ls -alh

Expected output:

total 30M
drwx------ 1 root root    0 Jan  1  1970 .
drwx------ 1 root root 4.0K Aug  2 04:15 ..
drwxr-xr-x 1 root root    0 Aug  1 14:16 saved_model
-rw-r----- 1 root root 4.3M Aug  1 01:53 t10k-images-idx3-ubyte.gz
-rw-r----- 1 root root 5.1K Aug  1 01:53 t10k-labels-idx1-ubyte.gz
-rw-r----- 1 root root  26M Aug  1 01:54 train-images-idx3-ubyte.gz
-rw-r----- 1 root root  29K Aug  1 01:53 train-labels-idx1-ubyte.gz

Create a notebook cell with the following training code. Set dataset_path to the mounted dataset directory and model_path to the output directory.

Important

Replace dataset_path and model_path with the actual paths in your cluster.

#!/usr/bin/python
# -*- coding: UTF-8 -*-

import os
import gzip
import numpy as np
import tensorflow as tf
from tensorflow import keras
print('TensorFlow version: {}'.format(tf.__version__))
dataset_path = "/root/data/"
model_path = "./model/"
model_version =  "v1"

def load_data():
    files = [
        'train-labels-idx1-ubyte.gz',
        'train-images-idx3-ubyte.gz',
        't10k-labels-idx1-ubyte.gz',
        't10k-images-idx3-ubyte.gz'
    ]
    paths = []
    for fname in files:
        paths.append(os.path.join(dataset_path, fname))
    with gzip.open(paths[0], 'rb') as labelpath:
        y_train = np.frombuffer(labelpath.read(), np.uint8, offset=8)
    with gzip.open(paths[1], 'rb') as imgpath:
        x_train = np.frombuffer(imgpath.read(), np.uint8, offset=16).reshape(len(y_train), 28, 28)
    with gzip.open(paths[2], 'rb') as labelpath:
        y_test = np.frombuffer(labelpath.read(), np.uint8, offset=8)
    with gzip.open(paths[3], 'rb') as imgpath:
        x_test = np.frombuffer(imgpath.read(), np.uint8, offset=16).reshape(len(y_test), 28, 28)
    return (x_train, y_train),(x_test, y_test)

def train():
    (train_images, train_labels), (test_images, test_labels) = load_data()

    # Normalize pixel values to [0.0, 1.0]
    train_images = train_images / 255.0
    test_images = test_images / 255.0

    # Reshape for CNN input
    train_images = train_images.reshape(train_images.shape[0], 28, 28, 1)
    test_images = test_images.reshape(test_images.shape[0], 28, 28, 1)

    class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
                'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

    print('\ntrain_images.shape: {}, of {}'.format(train_images.shape, train_images.dtype))
    print('test_images.shape: {}, of {}'.format(test_images.shape, test_images.dtype))

    model = keras.Sequential([
    keras.layers.Conv2D(input_shape=(28,28,1), filters=8, kernel_size=3,
                        strides=2, activation='relu', name='Conv1'),
    keras.layers.Flatten(),
    keras.layers.Dense(10, activation=tf.nn.softmax, name='Softmax')
    ])
    model.summary()
    epochs = 5
    model.compile(optimizer='adam',
                loss='sparse_categorical_crossentropy',
                metrics=['accuracy'])
    logdir = "/training_logs"
    tensorboard_callback = keras.callbacks.TensorBoard(log_dir=logdir)
    model.fit(train_images,
        train_labels,
        epochs=epochs,
        callbacks=[tensorboard_callback],
    )
    test_loss, test_acc = model.evaluate(test_images, test_labels)
    print('\nTest accuracy: {}'.format(test_acc))
    export_path = os.path.join(model_path, model_version)
    print('export_path = {}\n'.format(export_path))
    tf.keras.models.save_model(
        model,
        export_path,
        overwrite=True,
        include_optimizer=True,
        save_format=None,
        signatures=None,
        options=None
    )
    print('\nSaved model success')
if __name__ == '__main__':
    train()

Click the Execute icon icon to run the cell. Expected output (5 epochs, test accuracy ~86.7%):

TensorFlow version: 1.15.5

train_images.shape: (60000, 28, 28, 1), of float64
test_images.shape: (10000, 28, 28, 1), of float64
Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
Conv1 (Conv2D)               (None, 13, 13, 8)         80
_________________________________________________________________
flatten_2 (Flatten)          (None, 1352)              0
_________________________________________________________________
Softmax (Dense)              (None, 10)                13530
=================================================================
Total params: 13,610
Trainable params: 13,610
Non-trainable params: 0
_________________________________________________________________
Train on 60000 samples
Epoch 1/5
60000/60000 [==============================] - 3s 57us/sample - loss: 0.5452 - acc: 0.8102
Epoch 2/5
60000/60000 [==============================] - 3s 52us/sample - loss: 0.4103 - acc: 0.8555
Epoch 3/5
60000/60000 [==============================] - 3s 55us/sample - loss: 0.3750 - acc: 0.8681
Epoch 4/5
60000/60000 [==============================] - 3s 55us/sample - loss: 0.3524 - acc: 0.8757
Epoch 5/5
60000/60000 [==============================] - 3s 53us/sample - loss: 0.3368 - acc: 0.8798
10000/10000 [==============================] - 0s 37us/sample - loss: 0.3770 - acc: 0.8673

Test accuracy: 0.8672999739646912
export_path = ./model/v1

Saved model success

Push code to a Git repository

Install Git:
```
apt-get update
apt-get install git
```

Configure Git credentials:

git config --global credential.helper store
git pull ${YOUR_GIT_REPO}

Push the code:

git push origin fashion-test

Expected output:

Total 0 (delta 0), reused 0 (delta 0)
To codeup.aliyun.com:60b4cf5c66bba1c04b442e49/tensorflow-fashion-mnist-sample.git
 * [new branch]      fashion-test -> fashion-test

Submit a training job via the Arena SDK

Instead of running training inside the notebook, you can use the Arena SDK to submit a TFJob to the cluster.

Install the SDK dependency:
```
!pip install coloredlogs
```

Run the following code in a notebook cell. Replace the Git repository URL and credentials with your own values.

namespace: The job is submitted to the demo-ns namespace.
with_sync_source: The Git repository URL.
with_envs: The Git repository username and password.

import os
import sys
import time
from arenasdk.client.client import ArenaClient
from arenasdk.enums.types import *
from arenasdk.exceptions.arena_exception import *
from arenasdk.training.tensorflow_job_builder import *
from arenasdk.logger.logger import LoggerBuilder

def main():
    print("start to test arena-python-sdk")
    # Submit the job to the demo-ns namespace
    client = ArenaClient("","demo-ns","info","arena-system")
    print("create ArenaClient succeed.")
    print("start to create tfjob")
    job_name = "arena-sdk-distributed-test"
    job_type = TrainingJobType.TFTrainingJob
    try:
        job =  TensorflowJobBuilder().with_name(job_name)\
            .witch_workers(1)\
            .with_gpus(1)\
            .witch_worker_image("tensorflow/tensorflow:1.5.0-devel-gpu")\
            .witch_ps_image("tensorflow/tensorflow:1.5.0-devel")\
            .witch_ps_count(1)\
            .with_datas({"fashion-demo-pvc":"/data"})\
            .enable_tensorboard()\
            .with_sync_mode("git")\
            .with_sync_source("https://codeup.aliyun.com/60b4cf5c66bba1c04b442e49/tensorflow-fashion-mnist-sample.git")\
            .with_envs({\
                "GIT_SYNC_USERNAME":"USERNAME", \
                "GIT_SYNC_PASSWORD":"PASSWORD",\
                "TEST_TMPDIR":"/",\
            })\
            .with_command("python code/tensorflow-fashion-mnist-sample/tf-distributed-mnist.py").build()
        if client.training().get(job_name, job_type):
            print("the job {} has been created, to delete it".format(job_name))
            client.training().delete(job_name, job_type)
            time.sleep(3)

        output = client.training().submit(job)
        print(output)

        count = 0
        while True:
            if count > 160:
                raise Exception("timeout for waiting job to be running")
            jobInfo = client.training().get(job_name,job_type)
            if jobInfo.get_status() == TrainingJobStatus.TrainingJobPending:
                print("job status is PENDING,waiting...")
                count = count + 1
                time.sleep(5)
                continue
            print("current status is {} of job {}".format(jobInfo.get_status().value,job_name))
            break
        logger = LoggerBuilder().with_accepter(sys.stdout).with_follow().with_since("5m")
        print(str(jobInfo))
    except ArenaException as e:
        print(e)

main()

Key parameters:

Click the Execute icon icon to submit the job. When the job reaches RUNNING state, the output includes job details:

current status is RUNNING of job arena-sdk-distributed-test
{
    "allocated_gpus": 1,
    "chief_name": "arena-sdk-distributed-test-worker-0",
    "duration": "185s",
    "name": "arena-sdk-distributed-test",
    "namespace": "demo-ns",
    "request_gpus": 1,
    "tensorboard": "http://192.168.5.6:31068",
    "type": "tfjob"
}

Step 4: Train the model

Role: Developer

The following four examples cover standalone training, distributed training, Fluid-accelerated training, and topology-aware GPU scheduling.

Example 1: Standalone TensorFlow training job

Method 1: Arena CLI

arena \
  submit \
  tfjob \
  -n ns1 \
  --name=fashion-mnist-arena \
  --data=fashion-mnist-jackwg-pvc:/root/data/ \
  --env=DATASET_PATH=/root/data/ \
  --env=MODEL_PATH=/root/saved_model \
  --env=MODEL_VERSION=1 \
  --env=GIT_SYNC_USERNAME=<GIT_USERNAME> \
  --env=GIT_SYNC_PASSWORD=<GIT_PASSWORD> \
  --sync-mode=git \
  --sync-source=https://codeup.aliyun.com/60b4cf5c66bba1c04b442e49/tensorflow-fashion-mnist-sample.git \
  --image="tensorflow/tensorflow:2.2.2-gpu" \
  "python /root/code/tensorflow-fashion-mnist-sample/train.py --log_dir=/training_logs"

Method 2: AI Developer Console

Configure the data source. See Configure a dataset.

Parameter	Example	Required
Name	fashion-demo	Yes
Namespace	demo-ns	Yes
PersistentVolumeClaim	fashion-demo-pvc	Yes
Local Directory	/root/data	No

Configure a dataset

Configure the source code repository. See Configure a source code repository.

Parameter	Example	Required
Name	fashion-git	Yes
Git Repository	https://codeup.aliyun.com/60b4cf5c66bba1c04b442e49/tensorflow-fashion-mnist-sample.git	Yes
Default Branch	master	No
Local Directory	/root/	No
Git user	Your Git username	No
Git secret	Your Git password	No

Configure the source code

Submit the job. See Submit a TensorFlow training job. Key parameters for this example: For Arena CLI reference, see Use Arena to submit a TensorFlow training job.

Parameter	Value
Job Name	fashion-tf-ui
Job Type	TF Stand-alone
Namespace	demo-ns
Data Configuration	fashion-demo
Code Configuration	fashion-git
Code branch	master
Execution Command	`"export DATASET_PATH=/root/data/ \&\&export MODEL_PATH=/root/saved_model \&\&export MODEL_VERSION=1 \&\&python /root/code/tensorflow-fashion-mnist-sample/train.py"`
Instances Count	1 (default)
Image	tensorflow/tensorflow:2.2.2-gpu
CPU (Cores)	4 (default)
Memory (GB)	8 (default)

Submit a standalone training job

View the job log. In the left-side navigation pane, click Job List, click the job name, then on the Instances tab click Log in the Operator column. The log shows 5 training epochs with a final test accuracy of approximately 87.3%:

Epoch 5/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.3351 - accuracy: 0.8816
313/313 [==============================] - 0s 1ms/step - loss: 0.3595 - accuracy: 0.8733

Test accuracy: 0.8733000159263611
export_path = /root/saved_model/1

Saved model success

View training metrics on TensorBoard. Get the TensorBoard Service IP:

kubectl get svc -n demo-ns

Expected output:

NAME                        TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)               AGE
tf-dist-arena-tensorboard   NodePort    172.16.XX.XX     <none>        6006:32226/TCP        80m

Forward the port to your local machine:

kubectl port-forward svc/tf-dist-arena-tensorboard -n demo-ns 6006:6006

Open http://localhost:6006/ in your browser.

Tensorboard

Example 2: Distributed TensorFlow training job

Method 1: Arena CLI

arena submit tf \
    -n demo-ns \
    --name=tf-dist-arena \
    --working-dir=/root/ \
    --data fashion-mnist-pvc:/data \
    --env=TEST_TMPDIR=/ \
    --env=GIT_SYNC_USERNAME=kubeai \
    --env=GIT_SYNC_PASSWORD=kubeai@ACK123 \
    --env=GIT_SYNC_BRANCH=master \
    --gpus=1 \
    --workers=2 \
    --worker-image=tensorflow/tensorflow:1.5.0-devel-gpu \
    --sync-mode=git \
    --sync-source=https://codeup.aliyun.com/60b4cf5c66bba1c04b442e49/tensorflow-fashion-mnist-sample.git \
    --ps=1 \
    --ps-image=tensorflow/tensorflow:1.5.0-devel \
    --tensorboard \
    "python code/tensorflow-fashion-mnist-sample/tf-distributed-mnist.py --log_dir=/training_logs"

After the job starts, access TensorBoard the same way as in Example 1:

Get the Service IP: kubectl get svc -n demo-ns
Forward the port: kubectl port-forward svc/tf-dist-arena-tensorboard -n demo-ns 6006:6006
Open http://localhost:6006/ in your browser.

Method 2: AI Developer Console

Reuse the data source (fashion-demo) and source code (fashion-git) configured in Example 1. Key differences in the job configuration:

Submit a distributed TensorFlow training job

Parameter	Value
Job Name	fashion-ps-ui
Job Type	TF Distributed
Namespace	demo-ns
Execution Command	`"export TEST_TMPDIR=/root/ \&\& python code/tensorflow-fashion-mnist-sample/tf-distributed-mnist.py --log_dir=/training_logs"`
Image (Worker tab)	tensorflow/tensorflow:1.5.0-devel-gpu
Image (PS tab)	tensorflow/tensorflow:1.5.0-devel

For Arena CLI reference, see Use Arena to submit a TensorFlow training job.

Example 3: Fluid-accelerated training job

Fluid caches the OSS dataset locally on cluster nodes, reducing training time from 3 minutes to 33 seconds — a 5.5x speedup — with no code changes.

If you already accelerated the dataset in Step 2, skip the acceleration step. Otherwise, see Create an accelerated dataset based on OSS.

Submit a training job that reads from the accelerated PVC (fashion-demo-pvc-acc):

arena \
  submit \
  tfjob \
  -n demo-ns \
  --name=fashion-mnist-fluid \
  --data=fashion-demo-pvc-acc:/root/data/ \
  --env=DATASET_PATH=/root/data/fashion-demo-pvc-acc \
  --env=MODEL_PATH=/root/saved_model \
  --env=MODEL_VERSION=1 \
  --env=GIT_SYNC_USERNAME=${GIT_USERNAME} \
  --env=GIT_SYNC_PASSWORD=${GIT_PASSWORD} \
  --sync-mode=git \
  --sync-source=https://codeup.aliyun.com/60b4cf5c66bba1c04b442e49/tensorflow-fashion-mnist-sample.git \
  --image="tensorflow/tensorflow:2.2.2-gpu" \
  "python /root/code/tensorflow-fashion-mnist-sample/train.py --log_dir=/training_logs"

The key difference from a regular job: --data=fashion-demo-pvc-acc:/root/data/ points to the Fluid-accelerated PVC, and DATASET_PATH includes the PVC name as a subdirectory.

Compare both jobs after they complete:

arena list -n demo-ns

Expected output:

NAME                 STATUS     TRAINER  DURATION  GPU(Requested)  GPU(Allocated)  NODE
fashion-mnist-fluid  SUCCEEDED  TFJOB    33s       0               N/A             192.168.5.7
fashion-mnist-arena  SUCCEEDED  TFJOB    3m        0               N/A             192.168.5.8

Both jobs run the same code on the same node. The Fluid-accelerated job completes in 33 seconds vs. 3 minutes for the regular job.

Example 4: Topology-aware GPU scheduling

Topology-aware scheduling reduces training time from 120 seconds to 44 seconds, and increases throughput from 225.50 to 1,006.44 images/sec. The AI job scheduler achieves this by optimizing GPU placement based on hardware topology — NVLink and PCIe Switch interconnects, and non-uniform memory access (NUMA) topology.

Submit a job without topology-aware scheduling:

arena submit mpi \
  --name=tensorflow-4-vgg16 \
  --gpus=1 \
  --workers=4 \
  --image=registry.cn-hangzhou.aliyuncs.com/kubernetes-image-hub/tensorflow-benchmark:tf2.3.0-py3.7-cuda10.1 \
  "mpirun --allow-run-as-root -np "4" -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 -x LD_LIBRARY_PATH -x PATH --mca pml ob1 --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 --mca orte_keep_fqdn_hostnames t --mca btl ^openib python /tensorflow/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=vgg16 --batch_size=64 --variable_update=horovod"

Submit a job with topology-aware scheduling:

Add the ack.node.gpu.schedule=topology label to the target node:

kubectl label node cn-beijing.192.168.XX.XX ack.node.gpu.schedule=topology --overwrite

Submit the job with --gputopology=true:

arena submit mpi \
  --name=tensorflow-topo-4-vgg16 \
  --gpus=1 \
  --workers=4 \
  --gputopology=true \
  --image=registry.cn-hangzhou.aliyuncs.com/kubernetes-image-hub/tensorflow-benchmark:tf2.3.0-py3.7-cuda10.1 \
  "mpirun --allow-run-as-root -np "4" -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 -x LD_LIBRARY_PATH -x PATH --mca pml ob1 --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 --mca orte_keep_fqdn_hostnames t --mca btl ^openib python /tensorflow/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=vgg16 --batch_size=64 --variable_update=horovod"

Compare results:

arena list -n demo-ns

Expected output:

NAME                             STATUS     TRAINER  DURATION  GPU(Requested)  GPU(Allocated)  NODE
tensorflow-topo-4-vgg16          SUCCEEDED  MPIJOB   44s       4               N/A             192.168.4.XX1
tensorflow-4-vgg16-image-warned  SUCCEEDED  MPIJOB   2m        4               N/A             192.168.4.XX0

Get throughput for the topology-aware job:

arena logs tensorflow-topo-4-vgg16 -n demo-ns

total images/sec: 1006.44

Get throughput for the baseline job:

arena logs tensorflow-4-vgg16-image-warned -n demo-ns

total images/sec: 225.50

Training job	Processing time per GPU (ns)	Total GPU throughput (images/sec)	Duration (s)
Topology-aware scheduling enabled	56.4	1006.44	44
Topology-aware scheduling disabled	251.7	225.50	120

To restore regular GPU scheduling on the node, remove the topology label:

kubectl label node cn-beijing.192.168.XX.XX0 ack.node.gpu.schedule=default --overwrite

For more information, see GPU topology-aware scheduling and Enable topology-aware CPU scheduling.

Step 5: Manage the model

Role: Developer

Register the trained model in AI Developer Console to track versions and trigger evaluations.

Log on to AI Developer Console.
In the left-side navigation pane, click Model Manage.
Click Create Model.
In the Create dialog box, set the following fields:
- Model Name: fsahion-mnist-demo
- Model Version: v1
- Job Name: tf-single
Click OK. The model appears in the list.

To evaluate the model immediately, click New Model Evaluate in the Operation column.

Step 6: Evaluate the model

Role: Developer

Submit an evaluation job that loads the model checkpoint, runs it against the test dataset, and stores metrics in MySQL. You can then compare metrics across model versions in AI Developer Console.

Submit a training job that exports a checkpoint

arena \
  submit \
  tfjob \
  -n demo-ns \
  --name=fashion-mnist-arena-ckpt \
  --data=fashion-demo-pvc:/root/data/ \
  --env=DATASET_PATH=/root/data/ \
  --env=MODEL_PATH=/root/data/saved_model \
  --env=MODEL_VERSION=1 \
  --env=GIT_SYNC_USERNAME=${GIT_USERNAME} \
  --env=GIT_SYNC_PASSWORD=${GIT_PASSWORD} \
  --env=OUTPUT_CHECKPOINT=1 \
  --sync-mode=git \
  --sync-source=https://codeup.aliyun.com/60b4cf5c66bba1c04b442e49/tensorflow-fashion-mnist-sample.git \
  --image="tensorflow/tensorflow:2.2.2-gpu" \
  "python /root/code/tensorflow-fashion-mnist-sample/train.py --log_dir=/training_logs"

Build the evaluation image

In the kubeai-sdk directory, build and push the evaluation image:

docker build . -t ${DOCKER_REGISTRY}:fashion-mnist
docker push ${DOCKER_REGISTRY}:fashion-mnist

Submit the evaluation job

Get the MySQL Service IP:

kubectl get svc -n kube-ai ack-mysql

Expected output:

NAME        TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
ack-mysql   ClusterIP   172.16.XX.XX    <none>        3306/TCP   28h

Submit the evaluation job using the CLUSTER-IP from the previous step as MYSQL_HOST:

arena evaluate model \
 --namespace=demo-ns \
 --loglevel=debug \
 --name=evaluate-job \
 --image=registry.cn-beijing.aliyuncs.com/kube-ai/kubeai-sdk-demo:fashion-minist \
 --env=ENABLE_MYSQL=True \
 --env=MYSQL_HOST=172.16.77.227 \
 --env=MYSQL_PORT=3306 \
 --env=MYSQL_USERNAME=kubeai \
 --env=MYSQL_PASSWORD=kubeai@ACK \
 --data=fashion-demo-pvc:/data \
 --model-name=1 \
 --model-path=/data/saved_model/ \
 --dataset-path=/data/ \
 --metrics-path=/data/output \
 "python /kubeai/evaluate.py"

Compare evaluation results

In the left-side navigation pane of AI Developer Console, click Model Manage.
In the Job List section, click an evaluation job name to view its metrics.
Select multiple evaluation jobs to compare their metrics side by side.

Step 7: Deploy the model as an inference service

Role: Developer

Deploy the trained Fashion-MNIST model as a TensorFlow Serving inference service. Arena supports multiple serving frameworks including Triton and Seldon. See Arena serve guide for the full list.

The model is stored in fashion-demo-pvc from Step 2. To use a different storage type, create a PVC for that storage type first.

Deploy the inference service

arena serve tensorflow \
  --loglevel=debug \
  --namespace=demo-ns \
  --name=fashion-mnist \
  --model-name=1  \
  --gpus=1  \
  --image=tensorflow/serving:1.15.0-gpu \
  --data=fashion-demo-pvc:/data \
  --model-path=/data/saved_model/ \
  --version-policy=latest

Verify the service

arena serve list -n demo-ns

Expected output:

NAME           TYPE        VERSION       DESIRED  AVAILABLE  ADDRESS         PORTS                   GPU
fashion-mnist  Tensorflow  202111031203  1        1          172.16.XX.XX    GRPC:8500,RESTFUL:8501  1

The service exposes two ports: gRPC on 8500 and REST on 8501. Use the ADDRESS and PORTS values to send requests from within the cluster.

Send inference requests

Use the Jupyter Notebook from Step 3 as a client. Set server_ip to the address from the previous step and server_http_port to 8501.

import os
import gzip
import numpy as np
import random
import requests
import json

server_ip = "172.16.XX.XX"       # Replace with the ADDRESS from arena serve list
server_http_port = 8501

dataset_dir = "/root/data/"

def load_data():
        files = [
            'train-labels-idx1-ubyte.gz',
            'train-images-idx3-ubyte.gz',
            't10k-labels-idx1-ubyte.gz',
            't10k-images-idx3-ubyte.gz'
        ]

        paths = []
        for fname in files:
            paths.append(os.path.join(dataset_dir, fname))

        with gzip.open(paths[0], 'rb') as labelpath:
            y_train = np.frombuffer(labelpath.read(), np.uint8, offset=8)
        with gzip.open(paths[1], 'rb') as imgpath:
            x_train = np.frombuffer(imgpath.read(), np.uint8, offset=16).reshape(len(y_train), 28, 28)
        with gzip.open(paths[2], 'rb') as labelpath:
            y_test = np.frombuffer(labelpath.read(), np.uint8, offset=8)
        with gzip.open(paths[3], 'rb') as imgpath:
            x_test = np.frombuffer(imgpath.read(), np.uint8, offset=16).reshape(len(y_test), 28, 28)

        return (x_train, y_train),(x_test, y_test)

class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

(train_images, train_labels), (test_images, test_labels) = load_data()
train_images = train_images / 255.0
test_images = test_images / 255.0

# Reshape for model input
train_images = train_images.reshape(train_images.shape[0], 28, 28, 1)
test_images = test_images.reshape(test_images.shape[0], 28, 28, 1)

print('\ntrain_images.shape: {}, of {}'.format(train_images.shape, train_images.dtype))
print('test_images.shape: {}, of {}'.format(test_images.shape, test_images.dtype))

def request_model(data):
    headers = {"content-type": "application/json"}
    json_response = requests.post('http://{}:{}/v1/models/1:predict'.format(server_ip, server_http_port), data=data, headers=headers)
    print('=======response:', json_response, json_response.text)
    predictions = json.loads(json_response.text)['predictions']

    print('The model thought this was a {} (class {}), and it was actually a {} (class {})'.format(
        class_names[np.argmax(predictions[0])], np.argmax(predictions[0]),
        class_names[test_labels[0]], test_labels[0]))

data = json.dumps({"signature_name": "serving_default", "instances": test_images[0:3].tolist()})
print('Data: {} ... {}'.format(data[:50], data[len(data)-52:]))
request_model(data)

Click the Execute icon icon. Expected output:

train_images.shape: (60000, 28, 28, 1), of float64
test_images.shape: (10000, 28, 28, 1), of float64
Data: {"signature_name": "serving_default", "instances": ...  [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]]]]}
=======response: <Response [200]> {
    "predictions": [[7.42696e-07, 6.91237556e-09, 2.66364452e-07, 2.27735413e-07, 4.0373439e-07, 0.00490919966, 7.27086217e-06, 0.0316713452, 0.0010733594, 0.962337255], ...]
}
The model thought this was a Ankle boot (class 9), and it was actually a Ankle boot (class 9)

FAQ

How do I install software in the Jupyter Notebook console?

Run apt-get install <software-name> from a terminal in the notebook.

How do I fix garbled characters in the Jupyter Notebook console?

Update /etc/locale with the following content and reopen the terminal:

LC_CTYPE="da_DK.UTF-8"
LC_NUMERIC="da_DK.UTF-8"
LC_TIME="da_DK.UTF-8"
LC_COLLATE="da_DK.UTF-8"
LC_MONETARY="da_DK.UTF-8"
LC_MESSAGES="da_DK.UTF-8"
LC_PAPER="da_DK.UTF-8"
LC_NAME="da_DK.UTF-8"
LC_ADDRESS="da_DK.UTF-8"
LC_TELEPHONE="da_DK.UTF-8"
LC_MEASUREMENT="da_DK.UTF-8"
LC_IDENTIFICATION="da_DK.UTF-8"
LC_ALL=

Container Service for Kubernetes:Cloud-native AI suite developer guide

Background

Prerequisites

Test environment

What this guide covers

Step 1: Create a user and allocate resources

Step 2: Create a dataset

Add the Fashion-MNIST dataset

Accelerate the dataset

Step 3: Develop a model

(Optional) Build a custom image

Develop and test the model

Push code to a Git repository

Submit a training job via the Arena SDK

Step 4: Train the model

Example 1: Standalone TensorFlow training job

Example 2: Distributed TensorFlow training job

Example 3: Fluid-accelerated training job

Example 4: Topology-aware GPU scheduling

Step 5: Manage the model

Step 6: Evaluate the model

Submit a training job that exports a checkpoint

Build the evaluation image

Submit the evaluation job

Compare evaluation results

Step 7: Deploy the model as an inference service

Deploy the inference service

Verify the service

Send inference requests

FAQ

What's next