Dave
Assistant Engineer
Assistant Engineer
  • UID627
  • Fans3
  • Follows0
  • Posts55
Reads:41288Replies:0

Straighten out a TensorFlow continuous training link using Alibaba Cloud Container Service

Created#
More Posted time:Jan 13, 2017 17:04 PM
This article is the third one in this series and will introduce how to use Alibaba Cloud services to quickly establish a TensorFlow delivery platform from training to services.
With the accelerated development of Google's open-source TensorFlow, machine learning has flown off the shelves as common products into numerous homes. But how can we quickly convert machine learning results into products in the service of the public? Take the TensorFlow for example. A typical delivery process is a model training by TensorFlow in accordance with the input data. After the training is concluded and the verification is successful, the model will be published to TensorFlow Serving to serve the public. If the machine learning results can be productized like making a car through the assembly line, doesn't that sound so thrilling?
 
However, the ideal is full, while the truth is always thin. TensorFlow and TensorFlow Serving are far from enough to make up a complete and available machine learning production line. If you want this process to be more efficient and automated, you need to add some infrastructure support, such as:
1. Monitoring on the machine learning process, from the system to the application, including:
a. Condition of the overall computing resources, especially the GPU usage: usage, memory, and temperature
b. Specific resource usage of each machine learning application
c. Visualized machine learning processes
2. Quick and efficient problem diagnosis
Easy one-stop problem diagnosis through the management console of centralized logs  
3. One-click recovery from faults
a. Scheduling from the fault node to an available node
b. Distributed storage of computing checkpoints so as to resume learning tasks on other nodes at any time
4. Continuous improvement and release of models
a. Migrate models to production environments using distributed storage
b. Blue-green release
c. Model rollback
Next we will demonstrate the process of quickly establishing a model from learning to releasing using Alibaba Cloud Container Service. This will be an iterated and constantly-optimized scheme. We will keep iterating and optimizing this scheme in the follow-up articles in a hope that the application delivery and O&M experience of Container Service can help data scientists to focus on machine learning itself so as to provide maximum values. Currently our scheme runs on the CPU machine. In the future after the HPC and the Container Service are integrated, this scheme is very easy to be migrated to the HPC container cluster.
Establish a machine learning production line
Preparations
1. To enable Alibaba Cloud Container Service, see https://yq.aliyun.com/articles/3054
2. For detailed steps to create an OSS data volume, see https://yq.aliyun.com/articles/7581
3. For detailed steps to enable Alibaba Cloud Log Service, see https://yq.aliyun.com/articles/9068
With these services, we can enjoy using machine learning on Alibaba Cloud Container Service. The example we use is the machine learning version of “Hello world” - MNIST.
MNIST is an entry-level computer visual data set. It contains various handwritten digital images:
 
It also contains the label corresponding to each image, indicating the digit number. For example, the labels of the above four images are respectively 5, 0, 4 and 1.
1. First, confirm the OSS data volume mnist_model has been created, and the Minst_data folder should be created on the mnist_model. Download the needed training set and testing set data.
File Content
train-images-idx3-ubyte.gz
Training set images - 55,000 training images, 5,000 verification images
train-labels-idx1-ubyte.gz
Digital labels corresponding to the training set images
t10k-images-idx3-ubyte.gz
Testing set images - 10,000 images
t10k-labels-idx1-ubyte.gz
Digital labels corresponding to the testing set images
2. Deploy the TensorFlow Learning environment in one click on Alibaba Cloud using the following docker-compose template.  
version: '2'
services:
  tensor:
    image: registry-vpc.cn-hangzhou.aliyuncs.com/cheyang/mnist-export
    command:
       - "python"
       - "/mnist_export.py"
       - "--training_iteration=${TRAIN_STEPS}"
       - "--export_version=${VERSION}"
       - "--work_dir=/mnist_export/Minst_data"
       - "/mnist_export/mnist_model"
    volumes:
       - mnist_model:/mnist_export
    labels:
       - aliyun.log_store_mnist=stdout
    environment:
      - CUDA_VISIBLE_DEVICES=-1

Note:
The aliyun.log_store_mnist indicates to import logs into Alibaba Cloud's Log Service. Here the logs are imported from stdout by default.
The “Volumes” adopts the OSS data volume of the Container Service.
Since our testing environment is established in the VPC, the docker images we use need to access the VPC Registry of Alibaba Cloud.
When we create an application in Alibaba Cloud Container Service, a dialog box will pop up, requiring to input the model version and training parameter. In this example, we enter 1 as the model version and 100 as the training parameter.
With aliyun.log_store_mnist, we can view the entire learning process in Alibaba Cloud's Log Service, facilitating problem analysis and diagnosis.
After a learning task is completed, you can log in to the server to check out the learned model.
sh-4.2# cd /mnist/model/
sh-4.2# ls
00000001

3. Now what we need to do is to start a TensorFlow Serving system to release the learned model into the production environment. The following docker-compose template is provided here.
version: '2'
services:
  serving:
    image: registry-vpc.cn-hangzhou.aliyuncs.com/denverdino/tensorFlow-serving
    command:
       - "/serving/bazel-bin/tensorFlow_serving/model_servers/tensorFlow_model_server"
       - "--enable_batching"
       - "--port=9000"
       - "--model_name=mnist"
       - "--model_base_path=/mnist_model"
    volumes:
       - mnist_model:/mnist_model
    labels:
       - aliyun.log_store_serving=stdout
    ports:
      - "9000:9000"
    environment:
      - CUDA_VISIBLE_DEVICES=-1

Note:
Here TensorFlow Serving and TensorFlow Learning share the learned model through distributed storage (OSS is used in this case. It is also easy to change it to NAS data volume).
4. You can view the TensorFlow Serving log in Log Service, and you will find that Version 1 model has been loaded into TensorFlow Serving.


At this time, you should check the endpoint of this service. Here the endpoint of TensorFlow Serving is 10.24.2.11:9000. Of course, we can also release the service to SLB, which has been described in detail in Establish a TensorFlow Serving cluster using Docker and Alibaba Cloud Container Service. I will not repeat it here.
5. A testing client should be deployed to validate TensorFlow Serving. Below is the docker-compose template of the testing client.
version: '2'
services:
  tensor:
    image: registry-vpc.cn-hangzhou.aliyuncs.com/denverdino/tensorFlow-serving
    command:
      - "/serving/bazel-bin/tensorFlow_serving/example/mnist_client"
      - "--num_tests=${NUM_TESTS}"
      - "--server=${SERVER}"
      - "--concurrency=${CONCURRENCY}"


When creating the application, you need to input the test count NUM_TESTS, the endpoint of TensorFlow Serving SERVER, and the access concurrency CONCURRENCY, namely 1000, 10.24.2.11:9000 and 10 respectively.
After the application is created, you can see the running result as below. We will find that the error rate at this moment is 13.5.
serving-client_tensor_1 | 2016-10-11T12:46:59.314358735Z D1011 12:46:59.314217522       5 ev_posix.c:101]             Using polling engine: poll
serving-client_tensor_1 | 2016-10-11T12:47:03.324604352Z ('Extracting', '/tmp/train-images-idx3-ubyte.gz')
serving-client_tensor_1 | 2016-10-11T12:47:03.324652816Z ('Extracting', '/tmp/train-labels-idx1-ubyte.gz')
serving-client_tensor_1 | 2016-10-11T12:47:03.324658399Z ('Extracting', '/tmp/t10k-images-idx3-ubyte.gz')
serving-client_tensor_1 | 2016-10-11T12:47:03.324661869Z ('Extracting', '/tmp/t10k-labels-idx1-ubyte.gz')
serving-client_tensor_1 | 2016-10-11T12:47:04.326217612Z ........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
serving-client_tensor_1 | 2016-10-11T12:47:04.326256766Z Inference error rate: 13.5%
serving-client_tensor_1 | 2016-10-11T12:47:04.326549709Z E1011 12:47:04.326484533      69 chttp2_transport.c:1810]    close_transport: {"created":"@1476190024.326451541","description":"FD shutdown","file":"src/core/lib/iomgr/ev_poll_posix.c","file_line":427}

6. To improve the recognition performance, you need to re-initiate a training by adjusting the parameters and running the TensorFlow Learning application once again. You can click Change Configuration directly on the container cloud service page.
A dialog box will pop up, requiring to input the model version and training parameter. In this example, we enter 2 as the model version and 2000 as the training parameter.
After the learning is complete, check the NAS server again and you will find a new model is created.
sh-4.2# pwd
/mnist/model/
sh-4.2# ls
00000001  00000002

View the TensorFlow Serving log, and you will find the model version has been updated to Version 2.

R-run the testing client and you will find the error rate has been reduced to 8.5%, indicating recognition performance improvement of the new model.
serving-client_tensor_1 | 2016-10-11T16:54:34.926822231Z D1011 16:54:34.926731204       5 ev_posix.c:101]             Using polling engine: poll
serving-client_tensor_1 | 2016-10-11T16:54:37.984891512Z ('Extracting', '/tmp/train-images-idx3-ubyte.gz')
serving-client_tensor_1 | 2016-10-11T16:54:37.984925589Z ('Extracting', '/tmp/train-labels-idx1-ubyte.gz')
serving-client_tensor_1 | 2016-10-11T16:54:37.984930097Z ('Extracting', '/tmp/t10k-images-idx3-ubyte.gz')
serving-client_tensor_1 | 2016-10-11T16:54:37.984933659Z ('Extracting', '/tmp/t10k-labels-idx1-ubyte.gz')
serving-client_tensor_1 | 2016-10-11T16:54:39.038214498Z ........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
serving-client_tensor_1 | 2016-10-11T16:54:39.038254350Z Inference error rate: 8.5%
serving-client_tensor_1 | 2016-10-11T16:54:39.038533016Z E1011 16:54:39.038481361      68 chttp2_transport.c:1810]    close_transport: {"created":"@1476204879.038447737","description":"FD shutdown","file":"src/core/lib/iomgr/ev_poll_posix.c","file_line":427}

P.S.: In the entire working process, we didn't log in to any host via SSH. Instead, all the operations were made on the management platform of the Container Service.
Summary
This is just a beginning. TensorFlow and TensorFlow Serving are just an example of Alibaba Cloud Container Service's support of high-performance computing. In this section, we deliver the learned model for external services through OSS, and implement model iteration. At the same time, we check the working logs of the container in a one-stop manner using the Log Service. This approach implements the most basic continuous learning and improvement. In the production environment, stricter verification and release procedures are required. We will introduce our methods and practices in our future articles.
Alibaba Cloud Container Service will also work with the high-performance computing (HPC) team to provide machine learning solutions integrated with CPU acceleration and Docker cluster management on Alibaba Cloud, in a bid to improve the machine learning efficiency on the cloud end.
Guest