Machine Learning Platform for AI (PAI) provides EasyVision, which is an enhanced algorithm framework for visual intelligence. EasyVision provides a variety of model training and prediction capabilities. You can use EasyVision to train and apply computer vision (CV) models for your CV applications. This topic describes how to use EasyVision in Data Science Workshop (DSW) of PAI.


A development environment with the following software versions is prepared:
  • Python 2.7, or Python 3.4 or later
  • TensorFlow 1.8 or later, or PAI-TensorFlow

Step 1: Prepare data

  1. Use one of the following methods to download the Pascal dataset:
    # Method 1: Use osscmd.
    osscmd downloadallobject oss://pai-vision-data-hz/data/voc0712_tfrecord/ data/voc0712_tfrecord
    # Method 2: Use ossutil. To use ossutil to download the Pascal dataset, you must set the host parameter to in the configuration file.
    ossutil  cp -r  oss://pai-vision-data-hz/data/voc0712_tfrecord/ data/voc0712_tfrecord
  2. Download the ResNet50 pre-trained model.
    mkdir -p pretrained_models/
    ossutil cp -r oss://pai-vision-data-hz/pretrained_models/resnet_v1d_50/ pretrained_models/resnet_v1d_50

Step 2: Start a training task

  • Single-machine mode
    import easy_vision
    In the training process, the model is evaluated every 5,000 rounds of training.
  • Multi-machine mode
    Use multiple servers to train the model. Make sure that each server has at least two GPUs. In multi-machine mode, you must start the following child processes:
    • ps: the parameter server.
    • master: the master that writes summaries, saves checkpoints, and periodically evaluates the model.
    • worker: the worker that processes specific data.
    Run the following script to start a training task.
    #-*- encoding:utf-8 -*-
    import multiprocessing
    import sys
    import os
    import easy_vision
    import json
    import logging
    import subprocess
    import time
    # train config under distributed settings
    # The configuration of the cluster.
                 'ps': ['localhost:12921'],
                 'master': ['localhost:12922'],
                 'worker': ['localhost:12923']
    def job(task, gpu):
      task_name = task['type']
      # redirect python log and tf log to log_file_name
      # [logs/master.log, logs/worker.log, logs/ps.log]
      log_file_name = "logs/%s.log" % task_name
      TF_CONFIG['task'] = task
      os.environ['TF_CONFIG'] = json.dumps(TF_CONFIG)
      os.environ['CUDA_VISIBLE_DEVICES'] = gpu
      train_cmd = 'python -m easy_vision.python.train_eval --pipeline_config_path %s' % config'%s > %s 2>&1 ' % (train_cmd, log_file_name))
      with open(log_file_name, 'w') as lfile:
        return subprocess.Popen(train_cmd.split(' '), stdout= lfile, stderr=subprocess.STDOUT)
    if __name__ == '__main__':
      procs = {}
      # start ps job on cpu
      task = {'type':'ps', 'index':0}
      procs['ps'] = job(task, '')
      # start master job on gpu 0
      task = {'type':'master', 'index':0}
      procs['master'] = job(task, '0')
      # start worker job on gpu 1
      task = {'type':'worker', 'index':0}
      procs['worker'] = job(task, '1')
      num_worker = 2
      for k, proc in procs.items():'%s pid: %d' %(k,
      task_failed = None
      task_finish_cnt = 0
      task_has_finished = {k:False for k in procs.keys()}
      while True:
        for k, proc in procs.items():
          if proc.poll() is None:
            if task_failed is not None:
              logging.error('task %s failed, %s quit' % (task_failed, k))
              if k ! = 'ps':
                task_has_finished[k] = True
                task_finish_cnt += 1
    'task_finish_cnt %d' % task_finish_cnt)
            if not task_has_finished[k]:
              #process quit by itself
              if k ! = 'ps':
                task_finish_cnt += 1
                task_has_finished[k] = True
    'task_finish_cnt %d' % task_finish_cnt)
              if proc.returncode ! = 0:
                logging.error('%s failed' %k)
                task_failed = k
      '%s run successfuly' % k)
        if task_finish_cnt >= num_worker:

Step 3: Use TensorBoard to monitor the training task

The checkpoints and event files of the model are saved in the pascal_resnet50_rfcn_model directory. You can run the following command in TensorBoard to view the loss and mean average precision (mAP) of the training:
tensorboard --port 6006 --logdir pascal_resnet50_rfcn_model  [ --host ]
In TensorBoard, you can view the following information:
  • Training lossTraining lossTensorBoard provides the following metrics about the training loss:
    • loss: the total loss of the training.
    • loss/loss/rcnn_cls: the classification loss.
    • loss/loss/rcnn_reg: the regression loss.
    • loss/loss/regularization_loss: the regularization loss.
    • loss/loss/rpn_cls: the classification loss of region proposal network (RPN).
    • loss/loss/rpn_reg: the regression loss of RPN.
  • Test mAPTest mAPIn the preceding figure, PascalBoxes07 and PascalBoxes are used as metrics to calculate the test mAP. PascalBoxes07 is commonly used in studies.

Step 4: Test and evaluate the model

After the training task is completed, you can test and evaluate the trained model.
  • Use other datasets to test the model. Then, check the detection result of each image.
    import easy_vision
    detect_results = easy_vision.test(easy_vision.RFCN_SAMPLE_CONFIG)
    The detection result of each image is returned in the detect_results parameter in the format of [detection_boxes, box_probability, box_class]. In the format, detection_boxes and box_class indicate the location and category of the detected object. box_probability indicates the confidence level of the detection result.
  • Evaluate the trained model.
    import easy_vision
    eval_metrics = easy_vision.evaluate(easy_vision.RFCN_SAMPLE_CONFIG)
    The eval_metrics parameter indicates evaluation metrics, including PascalBoxes07, PascalBoxes, global_step, and the following loss metrics: loss, loss/loss/rcnn_cls, loss/loss/rcnn_reg, loss/loss/rpn_cls, loss/loss/rpn_reg, and loss/loss/total_loss. The following examples show the metrics:
    • PascalBoxes07 Metric
      PascalBoxes07_PerformanceByCategory/AP@0.5IOU/aeroplane = 0.74028647
      PascalBoxes07_PerformanceByCategory/AP@0.5IOU/bicycle = 0.77216494
      PascalBoxes07_PerformanceByCategory/AP@0.5IOU/train = 0.771075
      PascalBoxes07_PerformanceByCategory/AP@0.5IOU/tvmonitor = 0.70221454
      PascalBoxes07_Precision/mAP@0.5IOU = 0.6975172
    • PascalBoxes Metric
      PascalBoxes_PerformanceByCategory/AP@0.5IOU/aeroplane = 0.7697732
      PascalBoxes_PerformanceByCategory/AP@0.5IOU/bicycle = 0.80088705
      PascalBoxes_PerformanceByCategory/AP@0.5IOU/train = 0.8002225
      PascalBoxes_PerformanceByCategory/AP@0.5IOU/tvmonitor = 0.72775906
      PascalBoxes_Precision/mAP@0.5IOU = 0.7182514
    • global_step and loss
      global_step = 75000
      loss = 0.51076376
      loss/loss/rcnn_cls = 0.23392382
      loss/loss/rcnn_reg = 0.12589474
      loss/loss/rpn_cls = 0.13748208
      loss/loss/rpn_reg = 0.013463326
      loss/loss/total_loss = 0.51076376

Step 5: Export the model

Run the following script to export the model as a SavedModel file:
import easy_vision
easy_vision.export(export_dir, pipeline_config_path, checkpoint_path)
After you run the preceding script, a model directory is created in the export_dir directory. The name of the model directory contains the UNIX timestamp that indicates the time when the directory is created. All checkpoints of the model are exported to a SavedModel file in the model directory.

Step 6: Evaluate the SavedModel file

Run the following script to evaluate the exported SavedModel file. All metrics of the model are contained in the evaluation result file and logs.
from easy_vision.python.main import predictor_evaluate
In the preceding code, predictor_eval_config specifies the .proto file that is used for the evaluation. For more information, see Protocol Documentation. You can also use the following files for evaluation:

Step 7: Deploy the model as a service

Save the SavedModel file in Object Storage Service (OSS) and use the file to deploy a service in Elastic Algorithm Service (EAS). For more information, see Deploy models.