Machine Learning Platform for AI (PAI) provides EasyVision, which is an enhanced algorithm framework for visual intelligence. EasyVision provides a variety of features for model training and prediction. You can use EasyVision to train and apply computer vision models for your computer vision applications. This topic describes how to use EasyVision in Data Science Workshop (DSW) to detect objects.
Prerequisites
- Python 2.7, or Python 3.4 or later
- TensorFlow 1.8 or later, or PAI-TensorFlow
Note If you use a DSW instance, we recommend that you select an image of TensorFlow 1.12 and an instance type with a memory greater than 16 GiB.
- ossutil is downloaded and installed. For more information, see Download and installation.
Notice After you download ossutil, you must set the endpoint parameter to oss-cn-zhangjiakou.aliyuncs.com in the configuration file.
Step 1: Prepare data
Step 2: Start a training task in the current directory
- Single-machine mode
In the training process, the model is evaluated every 5,000 rounds of training.import easy_vision easy_vision.train_and_evaluate(easy_vision.RFCN_SAMPLE_CONFIG)
- Multi-machine mode
You can use multiple servers to train the model. Make sure that each server has at least two GPUs. In multi-machine mode, you must start the following child processes:
- ps: the parameter server.
- master: the master that writes summaries, saves checkpoints, and periodically evaluates the model.
- worker: the worker that processes specific data.
#-*- encoding:utf-8 -*- import multiprocessing import sys import os import easy_vision import json import logging import subprocess import time # train config under distributed settings config=easy_vision.RFCN_DISTRIBUTE_SAMPLE_CONFIG # The configuration of the cluster. TF_CONFIG={'cluster':{ 'ps': ['localhost:12921'], 'master': ['localhost:12922'], 'worker': ['localhost:12923'] } } def job(task, gpu): task_name = task['type'] # redirect python log and tf log to log_file_name # [logs/master.log, logs/worker.log, logs/ps.log] log_file_name = "logs/%s.log" % task_name TF_CONFIG['task'] = task os.environ['TF_CONFIG'] = json.dumps(TF_CONFIG) os.environ['CUDA_VISIBLE_DEVICES'] = gpu train_cmd = 'python -m easy_vision.python.train_eval --pipeline_config_path %s' % config logging.info('%s > %s 2>&1 ' % (train_cmd, log_file_name)) with open(log_file_name, 'w') as lfile: return subprocess.Popen(train_cmd.split(' '), stdout= lfile, stderr=subprocess.STDOUT) if __name__ == '__main__': procs = {} # start ps job on cpu task = {'type':'ps', 'index':0} procs['ps'] = job(task, '') # start master job on gpu 0 task = {'type':'master', 'index':0} procs['master'] = job(task, '0') # start worker job on gpu 1 task = {'type':'worker', 'index':0} procs['worker'] = job(task, '1') num_worker = 2 for k, proc in procs.items(): logging.info('%s pid: %d' %(k, proc.pid)) task_failed = None task_finish_cnt = 0 task_has_finished = {k:False for k in procs.keys()} while True: for k, proc in procs.items(): if proc.poll() is None: if task_failed is not None: logging.error('task %s failed, %s quit' % (task_failed, k)) proc.terminate() if k != 'ps': task_has_finished[k] = True task_finish_cnt += 1 logging.info('task_finish_cnt %d' % task_finish_cnt) else: if not task_has_finished[k]: #process quit by itself if k != 'ps': task_finish_cnt += 1 task_has_finished[k] = True logging.info('task_finish_cnt %d' % task_finish_cnt) if proc.returncode != 0: logging.error('%s failed' %k) task_failed = k else: logging.info('%s run successfuly' % k) if task_finish_cnt >= num_worker: break time.sleep(1)
Step 3: Use TensorBoard to monitor the training task
- Run the following command to obtain the logon link of TensorBoard. Then, open TensorBoard
in a browser.
Notice You must run the command in Linux. To run the command, you must switch to the directory in which pascal_resnet50_rfcn_model resides, or replace the path following --logdir in the command with the actual path of pascal_resnet50_rfcn_model. Otherwise, the command fails to be run.
In TensorBoard, you can view the following information:tensorboard --port 6006 --logdir pascal_resnet50_rfcn_model [ --host 0.0.0.0 ]
- Training loss
TensorBoard provides the following metrics about the training loss:
- loss: the total loss of the training.
- loss/loss/rcnn_cls: the classification loss.
- loss/loss/rcnn_reg: the regression loss.
- loss/loss/regularization_loss: the regularization loss.
- loss/loss/rpn_cls: the classification loss of region proposal network (RPN).
- loss/loss/rpn_reg: the regression loss of RPN.
- Test mAP
PascalBoxes07 and PascalBoxes are used as metrics to calculate the test mAP, as shown in the preceding figure. PascalBoxes07 is commonly used in studies.
- Training loss
- On the TensorBoard page, view information such as the loss and mAP of the training based on the instructions
shown in the following figure.
Step 4: Test and evaluate the model
- Use other datasets to test the model. Then, check the detection result of each image.
The detection result of each image is returned in theimport easy_vision test_filelist = 'path/to/filelist.txt' # each line is a image file path detect_results = easy_vision.predict(easy_vision.RFCN_SAMPLE_CONFIG, test_filelist=test_filelist)
detect_results
parameter in the format of [detection_boxes, box_probability, box_class]. In the format, detection_boxes and box_class indicate the location and category of the detected object. box_probability indicates the confidence level of the detection result. - Evaluate the trained model.
Theimport easy_vision eval_metrics = easy_vision.evaluate(easy_vision.RFCN_SAMPLE_CONFIG)
eval_metrics
parameter indicates evaluation metrics, including PascalBoxes07, PascalBoxes, global_step, and the following loss metrics: loss, loss/loss/rcnn_cls, loss/loss/rcnn_reg, loss/loss/rpn_cls, loss/loss/rpn_reg, and loss/loss/total_loss. The following examples show the metrics:- PascalBoxes07 Metric
PascalBoxes07_PerformanceByCategory/AP@0.5IOU/aeroplane = 0.74028647 PascalBoxes07_PerformanceByCategory/AP@0.5IOU/bicycle = 0.77216494 ...... PascalBoxes07_PerformanceByCategory/AP@0.5IOU/train = 0.771075 PascalBoxes07_PerformanceByCategory/AP@0.5IOU/tvmonitor = 0.70221454 PascalBoxes07_Precision/mAP@0.5IOU = 0.6975172
- PascalBoxes Metric
PascalBoxes_PerformanceByCategory/AP@0.5IOU/aeroplane = 0.7697732 PascalBoxes_PerformanceByCategory/AP@0.5IOU/bicycle = 0.80088705 ...... PascalBoxes_PerformanceByCategory/AP@0.5IOU/train = 0.8002225 PascalBoxes_PerformanceByCategory/AP@0.5IOU/tvmonitor = 0.72775906 PascalBoxes_Precision/mAP@0.5IOU = 0.7182514
- global_step and loss
global_step = 75000 loss = 0.51076376 loss/loss/rcnn_cls = 0.23392382 loss/loss/rcnn_reg = 0.12589474 loss/loss/rpn_cls = 0.13748208 loss/loss/rpn_reg = 0.013463326 loss/loss/total_loss = 0.51076376
- PascalBoxes07 Metric
Step 5: Export the model
import easy_vision
easy_vision.export(export_dir, pipeline_config_path, checkpoint_path)
After you run the preceding code, a model directory is created in the export_dir directory. The name of the model directory contains the UNIX timestamp that indicates
the time when the directory is created. All checkpoints of the model are exported
to a SavedModel file in the model directory.
Step 6: Evaluate the SavedModel file
from easy_vision.python.main import predictor_evaluate
predictor_evaluate(predictor_eval_config)
In the preceding code, predictor_eval_config
specifies the .proto file that is used for the evaluation. For more information,
see Protocol Documentation. You can also use the following files for evaluation:
- detector_eval.config for object detection
- text_detector_eval.config for text detection
- text_recognizer_eval.config for text recognition
- text_spotter_eval.config for end-to-end text recognition
Step 7: Deploy the model as a service
Save the SavedModel file in Object Storage Service (OSS) and use the file to deploy a service in Elastic Algorithm Service (EAS). For more information, see Create a Service.