This topic describes how to develop an AI algorithm by using the cloud-native AI component set and the open source Fashion-MNIST dataset. The process includes model development, model training and optimization, model management, model evaluation, and model deployment.

Background information

The cloud-native AI component set includes components that can be independently deployed by using Helm charts. You can use these components to accelerate AI projects.

The cloud-native AI component set is suitable for two types of roles: administrators and developers.
  • Administrators manage users and permissions, allocate cluster resources, configure external storage, manage datasets, and monitor resource utilization by using dashboards.
  • Developers use cluster resources and submit jobs. Developers are created by administrators and must be granted permissions before developers can perform development by using tools such as the CLI, web UI, or Jupyter Notebook.

Prerequisites

The following operations are completed by an administrator:

  • A Container Service for Kubernetes (ACK) cluster is created. For more information, see Create an ACK managed cluster.
    • The disk size of each node in the cluster is at least 300 GB.
    • If you require optimal data acceleration, use four Elastic Compute Service (ECS) instances that each provides eight V100 GPUs.
    • If you require optimal topology awareness, use two ECS instances that each provides two V100 GPUs.
  • All components in the cloud-native AI component set are installed in the cluster. For more information, see Deploy the cloud-native AI component set.
  • AI Dashboard is ready for use. For more information about how to configure AI Dashboard, see Access AI Dashboard.
  • AI Developer Console is ready for use. For more information about how to configure AI Developer Console, see Access the AI development console.
  • The Fashion-MNIST dataset is downloaded and uploaded to an Object Storage Service (OSS) bucket. For more information about how to upload a model to an OSS bucket, see Upload objects.
  • The address, username, and password of the Git repository that stores the test code is obtained.
  • A kubectl client is connected to the cluster. For more information, see Connect to ACK clusters by using kubectl.
  • Arena is installed. For more information, see Install Arena.

Test environment

In this example, an AI model is developed, trained, accelerated, managed, evaluated, and deployed by using the cloud-native AI component set and the open source Fashion-MNIST dataset.

You must create a terminal in Jupyter Notebook or use a jump server in the cluster to submit Arena commands. We recommend that you create a terminal in Jupyter Notebook.

The following table describes the nodes in the cluster.
Host name IP Role Number of GPUs Number of vCPUs Memory
cn-beijing.192.168.0.13 192.168.0.13 Jump server 1 8 30580004 KiB
cn-beijing.192.168.0.16 192.168.0.16 Worker 1 8 30580004 KiB
cn-beijing.192.168.0.17 192.168.0.17 Worker 1 8 30580004 KiB
cn-beijing.192.168.0.240 192.168.0.240 Worker 1 8 30580004 KiB
cn-beijing.192.168.0.239 192.168.0.239 Worker 1 8 30580004 KiB

Experiment objectives

This topic aims to achieve the following objectives:
  • Manage datasets.
  • Use Jupyter Notebook to set up the development environment.
  • Submit standalone training jobs.
  • Submit distributed training jobs.
  • Use Fluid to accelerate training jobs.
  • Use the cybernetes scheduler to accelerate training jobs.
  • Manage models.
  • Evaluate models.
  • Deploy an inference service.

Step 1: Create a user and allocate resources

Developers must obtain the following information and resources from the administrator:
  • The username and password of a user. For more information about how to create a user, see Manage users.
  • Resource quotas. For more information about how to allocate resource quotas, see Manage elastic quota groups.
  • The endpoint of AI Developer Console if developers want to submit jobs by using AI Developer Console. For more information about how to access AI Developer Console, see Access the AI development console.
  • The kubeconfig file that is used to log on to the cluster if developers want to submit jobs by using Arena. For more information about how to obtain the kubeconfig file that is used to log on to a cluster, see Step 2: Select a type of cluster credentials.

Step 2: Prepare a dataset

The administrator must prepare a dataset. In this example, the Fashion-MNIST dataset is used.

a: Add the Fashion-MNIST dataset

  1. Use the following YAML template to create a fashion-mnist.yaml file:
    In this example, a persistent volume (PV) and a persistent volume claim (PVC) are created to mount the OSS bucket that stores the Fashion-MNIST dataset.
    apiVersion: v1
    kind: PersistentVolume
    metadata:
      name: fashion-demo-pv
    spec:
      accessModes:
      - ReadWriteMany
      capacity:
        storage: 10Gi
      csi:
        driver: ossplugin.csi.alibabacloud.com
        volumeAttributes:
          bucket: fashion-mnist
          otherOpts: ""
          url: oss-cn-beijing.aliyuncs.com
          akId: "AKID"
          akSecret: "AKSECRET"
        volumeHandle: fashion-demo-pv
      persistentVolumeReclaimPolicy: Retain
      storageClassName: oss
      volumeMode: Filesystem
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: fashion-demo-pvc
      namespace: demo-ns
    spec:
      accessModes:
      - ReadWriteMany
      resources:
        requests:
          storage: 10Gi
      selector:
        matchLabels:
          alicloud-pvname: fashion-demo-pv
      storageClassName: oss
      volumeMode: Filesystem
      volumeName: fashion-demo-pv
  2. Run the following command to create the fashion-mnist.yaml file:
    kubectl create -f fashion-mnist.yaml
  3. Check the status of the created PV and PVC.
    • Run the following command to check the status of the created PV:
      kubectl get pv fashion-mnist-jackwg

      Expected output:

      NAME                   CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                          STORAGECLASS   REASON   AGE
      fashion-mnist-jackwg   10Gi       RWX            Retain           Bound    ns1/fashion-mnist-jackwg-pvc   oss                     8h
    • Run the following command to check the status of the created PVC:
      kubectl get pvc fashion-mnist-jackwg-pvc -n ns1
      Expected output:
      NAME                       STATUS   VOLUME                 CAPACITY   ACCESS MODES   STORAGECLASS   AGE
      fashion-mnist-jackwg-pvc   Bound    fashion-mnist-jackwg   10Gi       RWX            oss            8h
    The output shows that both the PV and PVC are in the Bound state.

b: Accelerate the dataset

The administrator must accelerate the dataset by using AI Dashboard.

  1. Access AI Dashboard as an administrator.
  2. In the left-side navigation pane of AI Dashboard, choose Dataset > Dataset List.
  3. On the Dataset List page, find the dataset and click Accelerate in the Operator column.
    The following figure shows the accelerated dataset.Accelerate the dataset

Step 3: Develop a model

This step describes how to use Jupyter Notebook to set up the development environment. Procedure:
  1. Use a custom image to create a Jupyter notebook (optional).
  2. Use the Jupyter notebook to develop and test a model.
  3. Use the Jupyter notebook to submit code to a Git repository.
  4. Use the Arena SDK to submit a training job.

a (optional): Use a custom image to create a Jupyter notebook

AI Developer Console provides various versions of images that support TensorFlow and PyTorch for you to create Jupyter notebooks. You can also use a custom image to meet your requirements.

  1. Use the following Dockerfile template to create a file named Dockerfile.
    For more information about the limits on custom images, see Create and use a Jupyter notebook.
    cat<<EOF >dockerfile
    FROM tensorflow/tensorflow:1.15.5-gpu
    USER root
    RUN pip install jupyter && \
        pip install ipywidgets && \
        jupyter nbextension enable --py widgetsnbextension && \
        pip install jupyterlab && jupyter serverextension enable --py jupyterlab
    EXPOSE 8888
    #USER jovyan
    CMD ["sh", "-c", "jupyter-lab --notebook-dir=/home/jovyan --ip=0.0.0.0 --no-browser --allow-root --port=8888 --NotebookApp.token='' --NotebookApp.password='' --NotebookApp.allow_origin='*' --NotebookApp.base_url=${NB_PREFIX} --ServerApp.authenticate_prometheus=False"]
    EOF
  2. Run the following command to build an image from the Dockerfile:
    docker build -f dockerfile .
    Expected output:
    Sending build context to Docker daemon  9.216kB
    Step 1/5 : FROM tensorflow/tensorflow:1.15.5-gpu
     ---> 73be11373498
    Step 2/5 : USER root
     ---> Using cache
     ---> 7ee21dc7e42e
    Step 3/5 : RUN pip install jupyter &&     pip install ipywidgets &&     jupyter nbextension enable --py widgetsnbextension &&     pip install jupyterlab && jupyter serverextension enable --py jupyterlab
     ---> Using cache
     ---> 23bc51c5e16d
    Step 4/5 : EXPOSE 8888
     ---> Using cache
     ---> 76a55822ddae
    Step 5/5 : CMD ["sh", "-c", "jupyter-lab --notebook-dir=/home/jovyan --ip=0.0.0.0 --no-browser --allow-root --port=8888 --NotebookApp.token='' --NotebookApp.password='' --NotebookApp.allow_origin='*' --NotebookApp.base_url=${NB_PREFIX} --ServerApp.authenticate_prometheus=False"]
     ---> Using cache
     ---> 3692f04626d5
    Successfully built 3692f04626d5
  3. Run the following commands to push the image to your Docker image repository:
    docker tag ${IMAGE_ID} registry-vpc.cn-beijing.aliyuncs.com/${DOCKER_REPO}/jupyter:fashion-mnist-20210802a
    docker push registry-vpc.cn-beijing.aliyuncs.com/${DOCKER_REPO}/jupyter:fashion-mnist-20210802a
  4. Create a Secret that is used to pull the image from the Docker image repository:
    For more information, see Create a Secret based on existing Docker credentials.
    kubectl create secret docker-registry regcred \
      --docker-server=<Your registry server> \
      --docker-username=<Username> \
      --docker-password=<Password> \
      --docker-email=<Your email address>
  5. Create a Jupyter notebook by using AI Developer Console.
    For more information about how to create a Jupyter notebook, see Create and use a Jupyter notebook.
    The following figure shows the parameters that are used to configure a Jupyter notebook.Create a Jupyter notebook

b: Use the Jupyter notebook to develop and test a model

  1. Access the AI development console
  2. In the left-side navigation pane of AI Developer Console, click Notebook.
  3. On the Notebook page, click the notebook that is in the Running state.
  4. Create a CLI lauchner and verify that the dataset is mounted.
    pwd
    /root/data
    ls -alh
    Expected output:
    total 30M
    drwx------ 1 root root    0 Jan  1  1970 .
    drwx------ 1 root root 4.0K Aug  2 04:15 ..
    drwxr-xr-x 1 root root    0 Aug  1 14:16 saved_model
    -rw-r----- 1 root root 4.3M Aug  1 01:53 t10k-images-idx3-ubyte.gz
    -rw-r----- 1 root root 5.1K Aug  1 01:53 t10k-labels-idx1-ubyte.gz
    -rw-r----- 1 root root  26M Aug  1 01:54 train-images-idx3-ubyte.gz
    -rw-r----- 1 root root  29K Aug  1 01:53 train-labels-idx1-ubyte.gz
  5. Create a Jupyter notebook that is used to train a model based on the Fashion-MNIST dataset. The following code block is used to initialize the notebook:
    #!/usr/bin/python
    # -*- coding: UTF-8 -*-
    
    import os
    import gzip
    import numpy as np
    import tensorflow as tf
    from tensorflow import keras
    print('TensorFlow version: {}'.format(tf.__version__))
    dataset_path = "/root/data/"
    model_path = "./model/"
    model_version =  "v1"
    
    def load_data():
        files = [
            'train-labels-idx1-ubyte.gz',
            'train-images-idx3-ubyte.gz',
            't10k-labels-idx1-ubyte.gz',
            't10k-images-idx3-ubyte.gz'
        ]
        paths = []
        for fname in files:
            paths.append(os.path.join(dataset_path, fname))
        with gzip.open(paths[0], 'rb') as labelpath:
            y_train = np.frombuffer(labelpath.read(), np.uint8, offset=8)
        with gzip.open(paths[1], 'rb') as imgpath:
            x_train = np.frombuffer(imgpath.read(), np.uint8, offset=16).reshape(len(y_train), 28, 28)
        with gzip.open(paths[2], 'rb') as labelpath:
            y_test = np.frombuffer(labelpath.read(), np.uint8, offset=8)
        with gzip.open(paths[3], 'rb') as imgpath:
            x_test = np.frombuffer(imgpath.read(), np.uint8, offset=16).reshape(len(y_test), 28, 28)
        return (x_train, y_train),(x_test, y_test)
    
    def train():
        (train_images, train_labels), (test_images, test_labels) = load_data()
    
        # scale the values to 0.0 to 1.0
        train_images = train_images / 255.0
        test_images = test_images / 255.0
    
        # reshape for feeding into the model
        train_images = train_images.reshape(train_images.shape[0], 28, 28, 1)
        test_images = test_images.reshape(test_images.shape[0], 28, 28, 1)
    
        class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
                    'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
    
        print('\ntrain_images.shape: {}, of {}'.format(train_images.shape, train_images.dtype))
        print('test_images.shape: {}, of {}'.format(test_images.shape, test_images.dtype))
    
        model = keras.Sequential([
        keras.layers.Conv2D(input_shape=(28,28,1), filters=8, kernel_size=3,
                            strides=2, activation='relu', name='Conv1'),
        keras.layers.Flatten(),
        keras.layers.Dense(10, activation=tf.nn.softmax, name='Softmax')
        ])
        model.summary()
        testing = False
        epochs = 5
        model.compile(optimizer='adam',
                    loss='sparse_categorical_crossentropy',
                    metrics=['accuracy'])
        logdir = "/training_logs"
        tensorboard_callback = keras.callbacks.TensorBoard(log_dir=logdir)
        model.fit(train_images,
            train_labels,
            epochs=epochs,
            callbacks=[tensorboard_callback],
        )
        test_loss, test_acc = model.evaluate(test_images, test_labels)
        print('\nTest accuracy: {}'.format(test_acc))
        export_path = os.path.join(model_path, model_version)
        print('export_path = {}\n'.format(export_path))
        tf.keras.models.save_model(
            model,
            export_path,
            overwrite=True,
            include_optimizer=True,
            save_format=None,
            signatures=None,
            options=None
        )
        print('\nSaved model success')
    if __name__ == '__main__':
        train()
    Notice Replace dataset_path and dataset_path with the path of the source data. This allows the notebook to access the dataset that is mounted to the cluster.
  6. Click the Execute icon icon on the notebook.
    Expected output:
    TensorFlow version: 1.15.5
    
    train_images.shape: (60000, 28, 28, 1), of float64
    test_images.shape: (10000, 28, 28, 1), of float64
    Model: "sequential_2"
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #
    =================================================================
    Conv1 (Conv2D)               (None, 13, 13, 8)         80
    _________________________________________________________________
    flatten_2 (Flatten)          (None, 1352)              0
    _________________________________________________________________
    Softmax (Dense)              (None, 10)                13530
    =================================================================
    Total params: 13,610
    Trainable params: 13,610
    Non-trainable params: 0
    _________________________________________________________________
    Train on 60000 samples
    Epoch 1/5
    60000/60000 [==============================] - 3s 57us/sample - loss: 0.5452 - acc: 0.8102
    Epoch 2/5
    60000/60000 [==============================] - 3s 52us/sample - loss: 0.4103 - acc: 0.8555
    Epoch 3/5
    60000/60000 [==============================] - 3s 55us/sample - loss: 0.3750 - acc: 0.8681
    Epoch 4/5
    60000/60000 [==============================] - 3s 55us/sample - loss: 0.3524 - acc: 0.8757
    Epoch 5/5
    60000/60000 [==============================] - 3s 53us/sample - loss: 0.3368 - acc: 0.8798
    10000/10000 [==============================] - 0s 37us/sample - loss: 0.3770 - acc: 0.8673
    
    Test accuracy: 0.8672999739646912
    export_path = ./model/v1
    Saved model success

c: Use the Jupyter notebook to submit code to a Git repository

After the notebook is created, you can use the notebook to submit code to a Git repository.

  1. Run the following command to install Git:
    apt-get update
    apt-get install git
  2. Run the following command to initialize Git and save the username and password to the notebook:
    git config --global credential.helper store
    git pull ${YOUR_GIT_REPO}
  3. Run the following command to push code to a Git repository:
    git push origin fashion-test
    Expected output:
    Total 0 (delta 0), reused 0 (delta 0)
    To codeup.aliyun.com:60b4cf5c66bba1c04b442e49/tensorflow-fashion-mnist-sample.git
     * [new branch]      fashion-test -> fashion-test

d: Use the Arena SDK to submit a training job

  1. Install the dependency for the Arena SDK.
    !pip install coloredlogs
  2. Use the following code to create a Python file for initialization:
    import os
    import sys
    import time
    from arenasdk.client.client import ArenaClient
    from arenasdk.enums.types import *
    from arenasdk.exceptions.arena_exception import *
    from arenasdk.training.tensorflow_job_builder import *
    from arenasdk.logger.logger import LoggerBuilder
    
    def main():
        print("start to test arena-python-sdk")
        client = ArenaClient("","demo-ns","info","arena-system") # The training job is submitted to the demo-ns namespace. 
        print("create ArenaClient succeed.")
        print("start to create tfjob")
        job_name = "arena-sdk-distributed-test"
        job_type = TrainingJobType.TFTrainingJob
        try:
            # build the training job
            job =  TensorflowJobBuilder().with_name(job_name)\
                .witch_workers(1)\
                .with_gpus(1)\
                .witch_worker_image("tensorflow/tensorflow:1.5.0-devel-gpu")\
                .witch_ps_image("tensorflow/tensorflow:1.5.0-devel")\
                .witch_ps_count(1)\
                .with_datas({"fashion-demo-pvc":"/data"})\
                .enable_tensorboard()\
                .with_sync_mode("git")\
                .with_sync_source("https://codeup.aliyun.com/60b4cf5c66bba1c04b442e49/tensorflow-fashion-mnist-sample.git")\  # The address of the Git repository. 
                .with_envs({\
                    "GIT_SYNC_USERNAME":"USERNAME", \   # The username of the Git repository. 
                    "GIT_SYNC_PASSWORD":"PASSWORD",\    # The password of the Git repository. 
                    "TEST_TMPDIR":"/",\
                })\
                .with_command("python code/tensorflow-fashion-mnist-sample/tf-distributed-mnist.py").build()
            # if training job is not existed,create it
            if client.training().get(job_name, job_type):
                print("the job {} has been created, to delete it".format(job_name))
                client.training().delete(job_name, job_type)
                time.sleep(3)
    
            output = client.training().submit(job)
            print(output)
    
            count = 0
            # waiting training job to be running
            while True:
                if count > 160:
                    raise Exception("timeout for waiting job to be running")
                jobInfo = client.training().get(job_name,job_type)
                if jobInfo.get_status() == TrainingJobStatus.TrainingJobPending:
                    print("job status is PENDING,waiting...")
                    count = count + 1
                    time.sleep(5)
                    continue
                print("current status is {} of job {}".format(jobInfo.get_status().value,job_name))
                break
            # get the training job logs
            logger = LoggerBuilder().with_accepter(sys.stdout).with_follow().with_since("5m")
            #jobInfo.get_instances()[0].get_logs(logger)
            # display the training job information
            print(str(jobInfo))
            # delete the training job
            #client.training().delete(job_name, job_type)
        except ArenaException as e:
            print(e)
    
    main()
    • namespace: In this example, the training job is submitted to the demo-ns namespace.
    • with_sync_source: The address of the Git repository.
    • with_envs: The username and password of the Git repository.
  3. Click the Execute icon icon on the notebook.
    Expected output:
    2021-11-02/08:57:28 DEBUG util.py[line:19] - execute command: [arena get --namespace=demo-ns --arena-namespace=arena-system --loglevel=info arena-sdk-distributed-test --type=tfjob -o json]
    2021-11-02/08:57:28 DEBUG util.py[line:19] - execute command: [arena submit --namespace=demo-ns --arena-namespace=arena-system --loglevel=info tfjob --name=arena-sdk-distributed-test --workers=1 --gpus=1 --worker-image=tensorflow/tensorflow:1.5.0-devel-gpu --ps-image=tensorflow/tensorflow:1.5.0-devel --ps=1 --data=fashion-demo-pvc:/data --tensorboard --sync-mode=git --sync-source=https://codeup.aliyun.com/60b4cf5c66bba1c04b442e49/tensorflow-fashion-mnist-sample.git --env=GIT_SYNC_USERNAME=kubeai --env=GIT_SYNC_PASSWORD=kubeai@ACK123 --env=TEST_TMPDIR=/ python code/tensorflow-fashion-mnist-sample/tf-distributed-mnist.py]
    start to test arena-python-sdk
    create ArenaClient succeed.
    start to create tfjob
    2021-11-02/08:57:29 DEBUG util.py[line:19] - execute command: [arena get --namespace=demo-ns --arena-namespace=arena-system --loglevel=info arena-sdk-distributed-test --type=tfjob -o json]
    service/arena-sdk-distributed-test-tensorboard created
    deployment.apps/arena-sdk-distributed-test-tensorboard created
    tfjob.kubeflow.org/arena-sdk-distributed-test created
    
    job status is PENDING,waiting...
    2021-11-02/09:00:34 DEBUG util.py[line:19] - execute command: [arena get --namespace=demo-ns --arena-namespace=arena-system --loglevel=info arena-sdk-distributed-test --type=tfjob -o json]
    current status is RUNNING of job arena-sdk-distributed-test
    {
        "allocated_gpus": 1,
        "chief_name": "arena-sdk-distributed-test-worker-0",
        "duration": "185s",
        "instances": [
            {
                "age": "13s",
                "gpu_metrics": [],
                "is_chief": false,
                "name": "arena-sdk-distributed-test-ps-0",
                "node_ip": "192.168.5.8",
                "node_name": "cn-beijing.192.168.5.8",
                "owner": "arena-sdk-distributed-test",
                "owner_type": "tfjob",
                "request_gpus": 0,
                "status": "Running"
            },
            {
                "age": "13s",
                "gpu_metrics": [],
                "is_chief": true,
                "name": "arena-sdk-distributed-test-worker-0",
                "node_ip": "192.168.5.8",
                "node_name": "cn-beijing.192.168.5.8",
                "owner": "arena-sdk-distributed-test",
                "owner_type": "tfjob",
                "request_gpus": 1,
                "status": "Running"
            }
        ],
        "name": "arena-sdk-distributed-test",
        "namespace": "demo-ns",
        "priority": "N/A",
        "request_gpus": 1,
        "tensorboard": "http://192.168.5.6:31068",
        "type": "tfjob"
    }

Step 4: Train a model

Perform the following steps to submit a standalone TensorFlow training job, a distributed TensorFlow training job, a Fluid-accelerated training job, and a cybernetes-accelerated training job.

Submit a standalone TensorFlow training job

After you develop a model by using the notebook and save the model, you can use Arena or AI Developer Console to submit a training job.

Method 1: Use Arena to submit a standalone TensorFlow training job

arena \
  submit \
  tfjob \
  -n ns1 \
  --name=fashion-mnist-arena \
  --data=fashion-mnist-jackwg-pvc:/root/data/ \
  --env=DATASET_PATH=/root/data/ \
  --env=MODEL_PATH=/root/saved_model \
  --env=MODEL_VERSION=1 \
  --env=GIT_SYNC_USERNAME=<GIT_USERNAME> \
  --env=GIT_SYNC_PASSWORD=<GIT_PASSWORD> \
  --sync-mode=git \
  --sync-source=https://codeup.aliyun.com/60b4cf5c66bba1c04b442e49/tensorflow-fashion-mnist-sample.git \
  --image="tensorflow/tensorflow:2.2.2-gpu" \
  "python /root/code/tensorflow-fashion-mnist-sample/train.py --log_dir=/training_logs"

Method 2: Use AI Developer Console to submit a standalone TensorFlow training job

  1. Configure the data source. For more information, see Configure a dataset.
    Configure a dataset

    The following table describes some parameters.

    Parameter Example Required
    Name fashion-demo Yes
    Namespaces demo-ns Yes
    PersistentVolumeClaim fashion-demo-pvc Yes
    Local Directory /root/data No
  2. Configure the source code. For more information, see Configure a source code repository.
    Configure the source code
    Parameter Example Required
    Name fashion-git Yes
    Git Repository https://codeup.aliyun.com/60b4cf5c66bba1c04b442e49/tensorflow-fashion-mnist-sample.git Yes
    Default Branch master No
    Local Directory /root/ No
    Git user The username of your private Git repository. No
    Git secret The password of your private Git repository. No
  3. Submit a standalone TensorFlow training job. For more information, see Submit a TensorFlow job.
    After you configure the job parameters, click Submit. The training job appears in the job list. The following figure describes the job parameters.Submit a standalone TensorFlow training job
    Parameter Description
    Job Name In this example, fashion-tf-ui is used.
    Job Type In this example, TF Stand-alone is selected.
    Namespace In this example, demo-ns is selected. You must select the namespace to which the dataset belongs.
    Data Configuration In this example, fashion-demo is selected. You must select the data source that you configured in Step 1.
    Code Configuration In this example, fashion-git is selected. You must select the source code that you configured in Step 2.
    Code branch In this example, master is specified.
    Execution Command In this example, the following command is specified: "export DATASET_PATH=/root/data/ &&export MODEL_PATH=/root/saved_model &&export MODEL_VERSION=1 &&python /root/code/tensorflow-fashion-mnist-sample/train.py".
    Private Git To use a private Git repository, you must first specify the username and password of the private Git repository.
    Instances Count Default value: 1.
    Image In this example, tensorflow/tensorflow:2.2.2-gpu is specified.
    Image Pull Secrets To pull images from a private image repository, you must first create a Secret.
    CPU (Cores) Default value: 4.
    Memory (GB) Default value: 8.

    For more information about Arena commands, see Use Arena to submit a TensorFlow training job.

  4. After you submit the job, check the job log.
    1. In the left-side navigation pane of AI Developer Console, click Job List.
    2. On the Job List page, click the name of the job that you submitted.
    3. On the details page, click the Instances tab. Find the instance that you want to view and click Log in the Operator column.
      Example:
      train_images.shape: (60000, 28, 28, 1), of float64
      test_images.shape: (10000, 28, 28, 1), of float64
      Model: "sequential"
      _________________________________________________________________
      Layer (type)                 Output Shape              Param #
      =================================================================
      Conv1 (Conv2D)               (None, 13, 13, 8)         80
      _________________________________________________________________
      flatten (Flatten)            (None, 1352)              0
      _________________________________________________________________
      Softmax (Dense)              (None, 10)                13530
      =================================================================
      Total params: 13,610
      Trainable params: 13,610
      Non-trainable params: 0
      _________________________________________________________________
      Epoch 1/5
      2021-08-01 14:21:17.532237: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1430] function cupti_interface_->EnableCallback( 0 , subscriber_, CUPTI_CB_DOMAIN_DRIVER_API, cbid)failed with error CUPTI_ERROR_INVALID_PARAMETER
      2021-08-01 14:21:17.532390: I tensorflow/core/profiler/internal/gpu/device_tracer.cc:216]  GpuTracer has collected 0 callback api events and 0 activity events.
      2021-08-01 14:21:17.533535: I tensorflow/core/profiler/rpc/client/save_profile.cc:168] Creating directory: /training_logs/train/plugins/profile/2021_08_01_14_21_17
      2021-08-01 14:21:17.533928: I tensorflow/core/profiler/rpc/client/save_profile.cc:174] Dumped gzipped tool data for trace.json.gz to /training_logs/train/plugins/profile/2021_08_01_14_21_17/fashion-mnist-arena-chief-0.trace.json.gz
      2021-08-01 14:21:17.534251: I tensorflow/core/profiler/utils/event_span.cc:288] Generation of step-events took 0 ms
      
      2021-08-01 14:21:17.534961: I tensorflow/python/profiler/internal/profiler_wrapper.cc:87] Creating directory: /training_logs/train/plugins/profile/2021_08_01_14_21_17Dumped tool data for overview_page.pb to /training_logs/train/plugins/profile/2021_08_01_14_21_17/fashion-mnist-arena-chief-0.overview_page.pb
      Dumped tool data for input_pipeline.pb to /training_logs/train/plugins/profile/2021_08_01_14_21_17/fashion-mnist-arena-chief-0.input_pipeline.pb
      Dumped tool data for tensorflow_stats.pb to /training_logs/train/plugins/profile/2021_08_01_14_21_17/fashion-mnist-arena-chief-0.tensorflow_stats.pb
      Dumped tool data for kernel_stats.pb to /training_logs/train/plugins/profile/2021_08_01_14_21_17/fashion-mnist-arena-chief-0.kernel_stats.pb
      
      1875/1875 [==============================] - 3s 2ms/step - loss: 0.5399 - accuracy: 0.8116
      Epoch 2/5
      1875/1875 [==============================] - 3s 2ms/step - loss: 0.4076 - accuracy: 0.8573
      Epoch 3/5
      1875/1875 [==============================] - 3s 2ms/step - loss: 0.3727 - accuracy: 0.8694
      Epoch 4/5
      1875/1875 [==============================] - 3s 2ms/step - loss: 0.3512 - accuracy: 0.8769
      Epoch 5/5
      1875/1875 [==============================] - 3s 2ms/step - loss: 0.3351 - accuracy: 0.8816
      313/313 [==============================] - 0s 1ms/step - loss: 0.3595 - accuracy: 0.8733
      2021-08-01 14:21:34.820089: W tensorflow/python/util/util.cc:329] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
      WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/resource_variable_ops.py:1817: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
      Instructions for updating:
      If using Keras pass *_constraint arguments to layers.
      
      Test accuracy: 0.8733000159263611
      export_path = /root/saved_model/1
      
      
      Saved model success
  5. View data on TensorBoard.
    You can use the kubectl port-forward command to map a local port to the TensorBoard Service. Perform the following steps:
    1. Run the following command to query the IP address of the TensorBoard Service:
      kubectl get svc -n demo-ns
      Expected output:
      NAME                        TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)               AGE
      tf-dist-arena-tensorboard   NodePort    172.16.XX.XX     <none>        6006:32226/TCP        80m
    2. Run the following command to map a local port to the TensorBoard Service:
      kubectl port-forward svc/tf-dist-arena-tensorboard -n demo-ns 6006:6006
      Expected output:
      Forwarding from 127.0.0.1:6006 -> 6006
      Forwarding from [::1]:6006 -> 6006
      Handling connection for 6006
      Handling connection for 6006
    3. Enter http://localhost:6006/ in the address bar of your browser to view data on TensorBoard.
      Tensorboard

Submit a distributed TensorFlow training job

Method 1: Use Arena to submit a distributed TensorFlow training job

  1. Run the following command to submit a distributed TensorFlow training job by using Arena:
    arena submit tf \
        -n demo-ns \
        --name=tf-dist-arena \
        --working-dir=/root/ \
        --data fashion-mnist-pvc:/data \
        --env=TEST_TMPDIR=/ \
        --env=GIT_SYNC_USERNAME=kubeai \
        --env=GIT_SYNC_PASSWORD=kubeai@ACK123 \
        --env=GIT_SYNC_BRANCH=master \
        --gpus=1 \
        --workers=2 \
        --worker-image=tensorflow/tensorflow:1.5.0-devel-gpu \
        --sync-mode=git \
        --sync-source=https://codeup.aliyun.com/60b4cf5c66bba1c04b442e49/tensorflow-fashion-mnist-sample.git \
        --ps=1 \
        --ps-image=tensorflow/tensorflow:1.5.0-devel \
        --tensorboard \
        "python code/tensorflow-fashion-mnist-sample/tf-distributed-mnist.py --log_dir=/training_logs"
  2. Run the following command to query the IP address of the TensorBoard Service:
    kubectl get svc -n demo-ns
    Expected output:
    NAME                        TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                 AGE
    tf-dist-arena-tensorboard   NodePort    172.16.204.248   <none>        6006:32226/TCP          80m
  3. Run the following command to map a local port to the TensorBoard Service.
    To view TensorBaord, use the kubectl port-forward command to map a local port to the TensorBoard Service.
    kubectl port-forward svc/tf-dist-arena-tensorboard -n demo-ns 6006:6006
    Expected output:
    Forwarding from 127.0.0.1:6006 -> 6006
    Forwarding from [::1]:6006 -> 6006
    Handling connection for 6006
    Handling connection for 6006
  4. Enter http://localhost:6006/ in the address bar of your browser to view data on TensorBoard.
    View data on TensorBoard

Method 2: Use AI Developer Console to submit a distributed TensorFlow training job

  1. Configure the data source. For more information, see Configure a dataset.
    In this example, the data source configuration is the same as that used in 1.
  2. Configure the source code. For more information, see Configure a source code repository.
    In this example, the source code configuration is the same as that used in 2.
  3. Submit a distributed TensorFlow training job. For more information, see Submit a TensorFlow job.
    After you configure the job parameters, click Submit. The training job appears in the job list. The following figure shows the job parameters.Submit a distributed TensorFlow training job
    Parameter Description
    Job Name In this example, fashion-ps-ui is used.
    Job Type In this example, TF Distributed is selected.
    Namespace In this example, demo-ns is selected. You must select the namespace to which the dataset belongs.
    Data Configuration In this example, fashion-demo is selected. You must select the data source that you configured in Step 1.
    Code Configuration In this example, fashion-git is selected. You must select the source code that you configured in Step 2.
    Execution Command In this example, the following command is specified: "export TEST_TMPDIR=/root/ && python code/tensorflow-fashion-mnist-sample/tf-distributed-mnist.py --log_dir=/training_logs".
    Image
    • On the Worker tab in the Resources section, set Image to tensorflow/tensorflow:1.5.0-devel-gpu.
    • On the PS tab in the Resources section, set Image to tensorflow/tensorflow:1.5.0-devel.

    For more information about Arena commands, see Use Arena to submit a TensorFlow training job.

  4. View data on TensorBoard. For more information, see 2 to 4 in Method 1: Use Arena to submit a standalone TensorFlow training job.

Submit a Fluid-accelerated training job

In this example, the dataset is accelerated on AI Dashboard and a training job that uses the accelerated dataset is submitted. The result shows that the time required to complete the training job is reduced. Procedure:
  1. The administrator accelerates the dataset on AI Dashboard.
  2. A developer uses Arena to submit a training job that uses the accelerated dataset.
  3. Use Arena to query the time that is required to complete the training job.
  1. Accelerate the dataset.
    If you have accelerated fashion-demo-pvc in Step 2: Prepare a dataset, skip this step. For more information about how to accelerate a dataset, see Create an accelerated dataset based on OSS.
  2. Submit a training job that uses the accelerated dataset.
    A developer submits a training job that uses the accelerated dataset to the demo-ns namespace. The configuration of a job that uses the accelerated dataset and the configuration of a job that uses a regular dataset differ in the following parameter settings:
    • --data: the accelerated VPC, which is fashion-demo-pvc-acc in this example.
    • --env=DATASET_PATH: the mount path of the dataset PVC, which is /root/data/ in --data in this example, and the name of the PVC, which is fashion-demo-pvc-acc in this example.
    arena \
      submit \
      tfjob \
      -n demo-ns \
      --name=fashion-mnist-fluid \
      --data=fashion-demo-pvc-acc:/root/data/ \
      --env=DATASET_PATH=/root/data/fashion-demo-pvc-acc \
      --env=MODEL_PATH=/root/saved_model \
      --env=MODEL_VERSION=1 \
      --env=GIT_SYNC_USERNAME=${GIT_USERNAME} \
      --env=GIT_SYNC_PASSWORD=${GIT_PASSWORD} \
      --sync-mode=git \
      --sync-source=https://codeup.aliyun.com/60b4cf5c66bba1c04b442e49/tensorflow-fashion-mnist-sample.git \
      --image="tensorflow/tensorflow:2.2.2-gpu" \
      "python /root/code/tensorflow-fashion-mnist-sample/train.py --log_dir=/training_logs"
  3. Run the following command to compare the time that is required to complete the two training jobs:
    arena list -n demo-ns

    Expected output:

    NAME                 STATUS     TRAINER  DURATION  GPU(Requested)  GPU(Allocated)  NODE
    fashion-mnist-fluid  SUCCEEDED  TFJOB    33s       0               N/A             192.168.5.7
    fashion-mnist-arena  SUCCEEDED  TFJOB    3m        0               N/A             192.168.5.8

    The output of the arena list command shows that 33 seconds is required to complete the Fluid-accelerated training job, whereas 3 minutes is required to complete the training job that uses a regular dataset. Both jobs run with the same code and on the same node.

Use cybernetes to accelerate a training job

ACK provides the cybernetes scheduler that is optimized for AI and big data computing. cybernetes supports gang scheduling, capacity scheduling, and topology-aware scheduling. In this example, a training job that has topology-aware GPU scheduling enabled is used.

To ensure high performance for AI workloads, cybernetes uses an optimal scheduling solution based on the topological information about heterogeneous resources on nodes. The information includes how GPUs communicate with each other by using NVLink and PCIe switches, and the non-uniform memory access (NUMA) topology of CPUs. For more information about topology-aware GPU scheduling, see Overview of topology-aware GPU scheduling. For more information about topology-aware CPU scheduling, see Topology-aware CPU scheduling.

Perform the following steps to submit a training job that has topology-aware GPU scheduling enabled and a training job that has topology-aware GPU scheduling disabled. Then, compare the time that is required to complete the jobs.

  1. Run the following command to submit a training job that has topology-aware GPU scheduling disabled:
    arena submit mpi \
      --name=tensorflow-4-vgg16 \
      --gpus=1 \
      --workers=4 \
      --image=registry.cn-hangzhou.aliyuncs.com/kubernetes-image-hub/tensorflow-benchmark:tf2.3.0-py3.7-cuda10.1 \
      "mpirun --allow-run-as-root -np "4" -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 -x LD_LIBRARY_PATH -x PATH --mca pml ob1 --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 --mca orte_keep_fqdn_hostnames t --mca btl ^openib python /tensorflow/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=vgg16 --batch_size=64 --variable_update=horovod"
  2. Submit a training job that has topology-aware GPU scheduling enabled.
    You must add a label to the node on which you want to run the job. In this example, the cn-beijing.192.168.XX.XX node is used. Replace the node with the actual node that is used.
    kubectl label node cn-beijing.192.168.XX.XX ack.node.gpu.schedule=topology --overwrite
    Run the following command to submit a training job that is configured with --gputopology=true, which is used to enable topology-aware GPU scheduling.
    arena submit mpi \
      --name=tensorflow-topo-4-vgg16 \
      --gpus=1 \
      --workers=4 \
      --gputopology=true \
      --image=registry.cn-hangzhou.aliyuncs.com/kubernetes-image-hub/tensorflow-benchmark:tf2.3.0-py3.7-cuda10.1 \
      "mpirun --allow-run-as-root -np "4" -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 -x LD_LIBRARY_PATH -x PATH --mca pml ob1 --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 --mca orte_keep_fqdn_hostnames t --mca btl ^openib python /tensorflow/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=vgg16 --batch_size=64 --variable_update=horovod
  3. Compare the time that is required to complete the training jobs.
    1. Run the following command to compare the time that is required to complete the two training jobs:
      arena list -n demo-ns

      Expected output:

      NAME                             STATUS     TRAINER  DURATION  GPU(Requested)  GPU(Allocated)  NODE
      tensorflow-topo-4-vgg16          SUCCEEDED  MPIJOB   44s       4               N/A             192.168.4.XX1
      tensorflow-4-vgg16-image-warned  SUCCEEDED  MPIJOB   2m        4               N/A             192.168.4.XX0
    2. Run the following command to query the total GPU processing time of the training job that has topology-aware GPU scheduling disabled:
      arena logs tensorflow-topo-4-vgg16 -n demo-ns
      Expected output:
      100 images/sec: 251.7 +/- 0.1 (jitter = 1.2)  7.262
      ----------------------------------------------------------------
      total images/sec: 1006.44
    3. Run the following command to query the total GPU processing time of the training job that has topology-aware GPU scheduling enabled:
      arena logs tensorflow-4-vgg16-image-warned -n demo-ns
      Expected output:
      100 images/sec: +/- 0.2 (jitter = 1.5)  7.261
      ----------------------------------------------------------------
      total images/sec: 225.50
The following table shows the results about the two jobs.
Training job Processing time per GPU (ns) Total GPU processing time (ns) Duration (s)
Topology-aware GPU scheduling enabled 56.4 225.50 44
Topology-aware GPU scheduling disabled 251.7 1006.44 120
After topology-aware GPU scheduling is enabled on nodes, regular GPU scheduling cannot be enabled. To resume regular GPU scheduling, run the following command to modify the node labels:
kubectl label node cn-beijing.192.168.XX.XX0 ack.node.gpu.schedule=default --overwrite

Step 5: Manage the model

  1. Access the AI development console
  2. In the left-side navigation pane of AI Developer Console, click Model Manage.
  3. On the Model Manage page, click Create Model.
  4. In the Create dialog box, set Model Name, Model Version, and Job Name.
    In this example, Model Name is set to fsahion-mnist-demo, Model Version is set to v1, and Job Name is set to tf-single.
  5. Click OK. The model appears on the page.
    Create a model

    If you want to evaluate the model, click New Model Evaluate in the Operation column.

Step 6: Evaluate the model

After you install the cloud-native component set, you can use Arena or AI Developer Console to submit an evaluation job. In this example, an evaluation job is submitted to evaluate the checkpoint of the model that is trained based on the Fashion-MNIST dataset. Procedure:
  1. Use Arena to submit a training job that exports a checkpoint.
  2. Use Arena to submit an evaluation job.
  3. Use AI Developer Console to compare the evaluation results of different models.
  1. Submit a training job that exports a checkpoint.
    Run the following command to use Arena to submit a training job that exports a checkpoint to fashion-demo-pvc:
    arena \
      submit \
      tfjob \
      -n demo-ns \ # You can change the namespace based on your business requirements. 
      --name=fashion-mnist-arena-ckpt \
      --data=fashion-demo-pvc:/root/data/ \
      --env=DATASET_PATH=/root/data/ \
      --env=MODEL_PATH=/root/data/saved_model \
      --env=MODEL_VERSION=1 \
      --env=GIT_SYNC_USERNAME=${GIT_USERNAME} \ # The username of your Git repository. 
      --env=GIT_SYNC_PASSWORD=${GIT_PASSWORD} \ # The password of your Git repository. 
      --env=OUTPUT_CHECKPOINT=1 \
      --sync-mode=git \
      --sync-source=https://codeup.aliyun.com/60b4cf5c66bba1c04b442e49/tensorflow-fashion-mnist-sample.git \
      --image="tensorflow/tensorflow:2.2.2-gpu" \
      "python /root/code/tensorflow-fashion-mnist-sample/train.py --log_dir=/training_logs"
  2. Submit an evaluation job.
    1. Build an image that is used to deploy the job.
      Obtain the code for model evaluation. Run the following commands in the kubeai-sdk directory to create and push an image:
      docker build . -t ${DOCKER_REGISTRY}:fashion-mnist
      docker push ${DOCKER_REGISTRY}:fashion-mnist
    2. Run the following command to query the Service that provides access to MySQL:
      kubectl get svc -n kube-ai ack-mysql
      Expected output:
      NAME        TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
      ack-mysql   ClusterIP   172.16.XX.XX    <none>        3306/TCP   28h
    3. Run the following command to submit an evaluation job by using Arena:
      arena evaluate model \
       --namespace=demo-ns \
       --loglevel=debug \
       --name=evaluate-job \
       --image=registry.cn-beijing.aliyuncs.com/kube-ai/kubeai-sdk-demo:fashion-minist \
       --env=ENABLE_MYSQL=True \
       --env=MYSQL_HOST=172.16.77.227 \
       --env=MYSQL_PORT=3306 \
       --env=MYSQL_USERNAME=kubeai \
       --env=MYSQL_PASSWORD=kubeai@ACK \
       --data=fashion-demo-pvc:/data \
       --model-name=1 \
       --model-path=/data/saved_model/ \
       --dataset-path=/data/ \
       --metrics-path=/data/output \
       "python /kubeai/evaluate.py"
      Note You can obtain the IP address and port from the output in the previous step to access MySQL.
  3. Compare evaluation results.
    1. In the left-side navigation pane of AI Developer Console, click Model Manage.
      Model evaluation list
    2. In the Job List section, you can click the name of an evaluation job to view the metrics.
      Evaluation job metrics
      You can also compare the metrics of different evaluation jobs.Compare the metrics of different evaluation jobs

Step 7: Deploy the model as a service

After a model is developed and evaluated, you can deploy the model as a service for your business. The following steps describe how to deploy the preceding model as an inference service named tf-serving. Arena supports various service architectures, such as Triton and Seldon. For more information, see Arena serve guide.

In this example, the model that is trained in Step 4: Train a model is used. The model is stored in the fashion-minist-demo PVC that is used in Step 2: Prepare a dataset. If you want to store the model to other types of storage, you must first create a PVC of the storage type that you want to use.

  1. Run the following command to use Arena to deploy the TensorFlow model to TensorFlow Serving:
    arena serve tensorflow \
      --loglevel=debug \
      --namespace=demo-ns \
      --name=fashion-mnist \
      --model-name=1  \
      --gpus=1  \
      --image=tensorflow/serving:1.15.0-gpu \
      --data=fashion-demo-pvc:/data \
      --model-path=/data/saved_model/ \
      --version-policy=latest
  2. Run the following command to query the name of the inference service that you deployed:
    arena serve list -n demo-ns

    Expected output:

    NAME           TYPE        VERSION       DESIRED  AVAILABLE  ADDRESS         PORTS                   GPU
    fashion-mnist  Tensorflow  202111031203  1        1          172.16.XX.XX    GRPC:8500,RESTFUL:8501  1

    You can use the IP address and ports in the ADDRESS and PORTS columns to send requests to the inference service from within the cluster.

  3. Create a Jupyter notebook that is used to run a client to send requests to the tf-serving service over HTTP.
    In this example, the notebook that is created in Step 3: Develop a model is used.
    • Specify 172.16.XX.XX as the value of the server_ip field in the code that is used to initialize the notebook. 172.16.XX.XX is returned in the ADDRESS column in the previous step.
    • Specify 8501 as the value of the server_http_port field in the code that is used to initialize the notebook. Port 8501 is returned in the PORTS column in the previous step and is used to call the RESTful API.

    Example:

    import os
    import gzip
    import numpy as np
    # import matplotlib.pyplot as plt
    import random
    import requests
    import json
    
    server_ip = "172.16.XX.XX"
    server_http_port = 8501
    
    dataset_dir = "/root/data/"
    
    def load_data():
            files = [
                'train-labels-idx1-ubyte.gz',
                'train-images-idx3-ubyte.gz',
                't10k-labels-idx1-ubyte.gz',
                't10k-images-idx3-ubyte.gz'
            ]
    
            paths = []
            for fname in files:
                paths.append(os.path.join(dataset_dir, fname))
    
            with gzip.open(paths[0], 'rb') as labelpath:
                y_train = np.frombuffer(labelpath.read(), np.uint8, offset=8)
            with gzip.open(paths[1], 'rb') as imgpath:
                x_train = np.frombuffer(imgpath.read(), np.uint8, offset=16).reshape(len(y_train), 28, 28)
            with gzip.open(paths[2], 'rb') as labelpath:
                y_test = np.frombuffer(labelpath.read(), np.uint8, offset=8)
            with gzip.open(paths[3], 'rb') as imgpath:
                x_test = np.frombuffer(imgpath.read(), np.uint8, offset=16).reshape(len(y_test), 28, 28)
    
    
            return (x_train, y_train),(x_test, y_test)
    
    def show(idx, title):
      plt.figure()
      plt.imshow(test_images[idx].reshape(28,28))
      plt.axis('off')
      plt.title('\n\n{}'.format(title), fontdict={'size': 16})
    
    class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
                   'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
    
    (train_images, train_labels), (test_images, test_labels) = load_data()
    train_images = train_images / 255.0
    test_images = test_images / 255.0
    
    # reshape for feeding into the model
    train_images = train_images.reshape(train_images.shape[0], 28, 28, 1)
    test_images = test_images.reshape(test_images.shape[0], 28, 28, 1)
    
    print('\ntrain_images.shape: {}, of {}'.format(train_images.shape, train_images.dtype))
    print('test_images.shape: {}, of {}'.format(test_images.shape, test_images.dtype))
    
    rando = random.randint(0,len(test_images)-1)
    #show(rando, 'An Example Image: {}'.format(class_names[test_labels[rando]]))
    
    # !pip install -q requests
    
    # import requests
    # headers = {"content-type": "application/json"}
    # json_response = requests.post('http://localhost:8501/v1/models/fashion_model:predict', data=data, headers=headers)
    # predictions = json.loads(json_response.text)['predictions']
    
    # show(0, 'The model thought this was a {} (class {}), and it was actually a {} (class {})'.format(
    #   class_names[np.argmax(predictions[0])], np.argmax(predictions[0]), class_names[test_labels[0]], test_labels[0]))
    
    
    def request_model(data):
        headers = {"content-type": "application/json"}
        json_response = requests.post('http://{}:{}/v1/models/1:predict'.format(server_ip, server_http_port), data=data, headers=headers)
        print('=======response:', json_response, json_response.text)
        predictions = json.loads(json_response.text)['predictions']
    
        print('The model thought this was a {} (class {}), and it was actually a {} (class {})'.format(class_names[np.argmax(predictions[0])], np.argmax(predictions[0]), class_names[test_labels[0]], test_labels[0]))
        #show(0, 'The model thought this was a {} (class {}), and it was actually a {} (class {})'.format(
        #  class_names[np.argmax(predictions[0])], np.argmax(predictions[0]), class_names[test_labels[0]], test_labels[0]))
    
    # def request_model_version(data):
    #     headers = {"content-type": "application/json"}
    #     json_response = requests.post('http://{}:{}/v1/models/1/version/1:predict'.format(server_ip, server_http_port), data=data, headers=headers)
    #     print('=======response:', json_response, json_response.text)
    
    #     predictions = json.loads(json_response.text)
    #     for i in range(0,3):
    #       show(i, 'The model thought this was a {} (class {}), and it was actually a {} (class {})'.format(
    #         class_names[np.argmax(predictions[i])], np.argmax(predictions[i]), class_names[test_labels[i]], test_labels[i]))
    
    data = json.dumps({"signature_name": "serving_default", "instances": test_images[0:3].tolist()})
    print('Data: {} ... {}'.format(data[:50], data[len(data)-52:]))
    #request_model_version(data)
    request_model(data)
    Click the Execute icon icon on the notebook. The following output is returned:
    train_images.shape: (60000, 28, 28, 1), of float64
    test_images.shape: (10000, 28, 28, 1), of float64
    Data: {"signature_name": "serving_default", "instances": ...  [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]]]]}
    =======response: <Response [200]> {
        "predictions": [[7.42696e-07, 6.91237556e-09, 2.66364452e-07, 2.27735413e-07, 4.0373439e-07, 0.00490919966, 7.27086217e-06, 0.0316713452, 0.0010733594, 0.962337255], [0.00685342, 1.8516447e-08, 0.9266119, 2.42278338e-06, 0.0603800081, 4.01338771e-12, 0.00613868702, 4.26091073e-15, 1.35764185e-05, 3.38685469e-10], [1.09047969e-05, 0.999816835, 7.98738e-09, 0.000122893631, 4.85748023e-05, 1.50353979e-10, 3.57102294e-07, 1.89657579e-09, 4.4604468e-07, 9.23274524e-09]
        ]
    }
    The model thought this was a Ankle boot (class 9), and it was actually a Ankle boot (class 9)

FAQ

  • How do I install commonly used software in the notebook console?
    To install commonly used software in the notebook console, run the following command:
    apt-get install ${Software name}
  • How do I resolve character set encoding errors?
    Modify the /etc/locale file based on the following content and then reopen the terminal.
    LC_CTYPE="da_DK.UTF-8"
    LC_NUMERIC="da_DK.UTF-8"
    LC_TIME="da_DK.UTF-8"
    LC_COLLATE="da_DK.UTF-8"
    LC_MONETARY="da_DK.UTF-8"
    LC_MESSAGES="da_DK.UTF-8"
    LC_PAPER="da_DK.UTF-8"
    LC_NAME="da_DK.UTF-8"
    LC_ADDRESS="da_DK.UTF-8"
    LC_TELEPHONE="da_DK.UTF-8"
    LC_MEASUREMENT="da_DK.UTF-8"
    LC_IDENTIFICATION="da_DK.UTF-8"
    LC_ALL=