Data preparation and preprocessing play important roles in the deep learning and training process, and affect the speed and quality of model training.
TensorFlow supports the HDFS, integrates big data and deep learning, and completes the chain from data preparation to model training. The deep learning solution of Alibaba Cloud Container Service provides three distributed storage backends (Object Storage Service (OSS), NAS, and HDFS) to support TensorFlow.
This document describes how to convert data to the TFRecord format and store the generated TFRecord files to the HDFS. The HDFS of Alibaba Cloud Elastic MapReduce (E-MapReduce) is used in this example.
Why is TFRecord used
TFRecord is the unified standard data format defined in TensorFlow. It supports multithreading data read and uses the batch size and epoch parameters to control the size of a single batch and the number of iterations of the sample file during the training process. TFRecord can also make better use of the memory and easily perform data replication and movement. Therefore, it is the preferred option for TensorFlow to perform large-scale deep learning training.
Step 1 Create an E-MapReduce cluster
E-MapReduce is a big data processing system solution running on Alibaba Cloud platform. For more information, see E-MapReduce overview.
In this example, a cluster located in China South 1 (Shenzhen) is created, and Network Type is set to VPC.
Step 2 Create container clusters and integrate networks between the two clusters
- Log on to the Container Service console and create a GPU container cluster under the same Virtual Private Cloud (VPC).
- Log on to the ECS console and add nodes of the Container Service cluster to the security group corresponding
to the E-MapReduce cluster.
- Select the region in which the security group resides (China South 1 (Shenzhen) in
this example). Click Manage Instances at the right of the security group.
- Click Add an ECS Instance in the upper-right corner. Select a node in the container cluster and click OK.
- Select the region in which the security group resides (China South 1 (Shenzhen) in this example). Click Manage Instances at the right of the security group.
Step 3 Generate TFRecord data
In this example, the model training service provides a running environment to run convert_to_records.py, generate TFRecord data, and store the data in the HDFS.
- Log on to the Container Service console.
- Click in the left-side navigation pane.
- Click Launch in Training.
- Configure the settings for model training and click OK.
Configurations in this example are as follows:
- Framework: Select tensorflow:1.0.0.
- GPUs Per Worker: Enter 0.
- Data Source: Select No Data Source.
- Git URL: Enter https://code.aliyun.com/deeplearning/mnist-examples.git.
python neural_style.py --iterations 50000 --content /neural-style/examples/1-content.jpg --styles /neural-style/examples/1-style.jpg --output /neural-style/output.jpg
Then, the created application is displayed on the Application List page. Click the application name. Click the Logs tab and view the execution logs, which indicates the TFRecord files are stored in the HDFS.
You can log on to the E-MapReduce machine to check the generated TFRecord files.
# hdfs dfs -ls /mnist-tfrecord SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/apps/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar! /org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/apps/tez-0.8.4/lib/slf4j-log4j12-1.7.10.jar! /org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Found 3 items -rw-r--r-- 3 root hadoop 8910000 2017-05-23 19:34 /mnist-tfrecord/test.tfrecords -rw-r--r-- 3 root hadoop 49005000 2017-05-23 19:33 /mnist-tfrecord/train.tfrecords -rw-r--r-- 3 root hadoop 4455000 2017-05-23 19:33 /mnist-tfrecord/validation.tfrecords