Read OSS and MaxCompute data by using PAI-TensorFlow - Platform For AI

PAI-TensorFlow allows you to read data from Object Storage Service (OSS) buckets and MaxCompute tables.

Warning

GPU-accelerated servers will be phased out. You can submit TensorFlow tasks that run on CPU servers. If you want to use GPU-accelerated instances for model training, go to Deep Learning Containers (DLC) to submit jobs. For more information, see Submit training jobs.

Read OSS data

Procedure	Description
Upload data to OSS. Before you use deep learning frameworks to process the data, you need to upload the data to an OSS bucket.	Create an OSS bucket. Make sure that the OSS bucket resides in the same region as your GPU-accelerated compute cluster. You can upload data by using the Alibaba Cloud classic network without being charged for traffic transmission. Important Do not enable versioning for the OSS bucket. Create a directory, organize the directory structure, and then upload your data. You can perform these operations in the OSS console.
Grant permissions on OSS	To read data from an OSS bucket by using Platform for AI (PAI), you need to assign the AliyunODPSPAIDefaultRole role to the account that you use. For more information, see Grant the permissions that are required to use Machine Learning Designer.
Authorize a RAM role	You can authorize a RAM role to allow PAI to access OSS. For more information, see the "Grant your RAM user or RAM role the permissions to access OSS" section in the Grant the permissions that are required to use Machine Learning Designer topic.
Use PAI-TensorFlow to read OSS data.	Connect the Read File Data component to the TensorFlow component.

The following table describes the permissions of the default role AliyunODPSPAIDefaultRole.

Permission	Description
oss:PutObject	Uploads an object.
oss:GetObject	Queries an object.
oss:ListObjects	Queries objects.
oss:DeleteObjects	Deletes an object.

How PAI-TensorFlow reads OSS data:

Inefficient I/O approaches
You can run TensorFlow code on your on-premises machine or in the cloud in a distributed manner. The following list describes the differences between the two approaches:
- Read data from your on-premises machine: The server directly obtains graphs from the client for computing.
- Read data from the cloud: The server obtains graphs and distributes these graphs to workers for computing.
Usage notes
- Do not use built-in approaches of Python to read data from your on-premises machine.
  PAI supports the built-in I/O approaches of Python. To use these approaches, you must compress the data source and code into a package and upload this package to OSS. This approach writes data to the memory for computing and is inefficient. We recommend that you do not use this approach. The following sample code provides an example on how this approach works:
```
import csv
csv_reader=csv.reader(open('csvtest.csv'))
for row in csv_reader:
  print(row)
```
- Do not use third-party libraries to read data.
  You can read data by using third-party libraries, such as TFLearn and pandas. However, third-party libraries are encapsulated in a Python package to read data. In this case, data reading in PAI is inefficient.
- Do not perform preload operations to read data.
  You may find that GPUs are not significantly faster than on-premises CPUs. The possible cause is that I/O operations waste resources. A preload operation reads data to the memory, and then performs session operations, such as feeding, for computing. A preload operation wastes computing resources and cannot process large amounts of data due to the memory limit.
  For example, if a hard disk contains an image dataset, you must load the image dataset before computing starts. The loading requires 0.1s and the computing requires 0.9s. The GPU is idle for 0.1s every second. This reduces efficiency.

Efficient I/O approaches

An efficient I/O approach converts data to Operations (ops) and calls the session.run method to read the data. A read thread loads the images from the source file system to a memory queue, and a compute thread directly retrieves the data from the memory queue for computing. This prevents computing resources from being idled and wasted. test

The following sample code shows how to read data by using ops:

import argparse
import tensorflow as tf
import os
FLAGS=None
def main(_):
    dirname = os.path.join(FLAGS.buckets, "csvtest.csv")
    reader=tf.TextLineReader()
    filename_queue=tf.train.string_input_producer([dirname])
    key,value=reader.read(filename_queue)
    record_defaults=[[''],[''],[''],[''],['']]
    d1, d2, d3, d4, d5= tf.decode_csv(value, record_defaults, ',')
    init=tf.initialize_all_variables()
    with tf.Session() as sess:
        sess.run(init)
        coord = tf.train.Coordinator()
        threads = tf.train.start_queue_runners(sess=sess,coord=coord)
        for i in range(4):
            print(sess.run(d2))
        coord.request_stop()
        coord.join(threads)
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--buckets', type=str, default='',
                        help='input data path')
    parser.add_argument('--checkpointDir', type=str, default='',
                        help='output model path')
    FLAGS, _ = parser.parse_known_args()
    tf.app.run(main=main)

Description of parameters in the preceding code:

dirname: the path of the OSS object. The value of this parameter can be an array.
reader: PAI-TensorFlow provides APIs for different types of readers. You can select a reader based on your business requirements.
tf.train.string_input_producer: converts the object into a queue.
tf.decode_csv: provides an op to split data. You can use this op to obtain specific parameters in each row.
To retrieve data by using an op, you must call the tf.train.Coordinator() and tf.train.start_queue_runners(sess=sess,coord=coord) methods in a session.

Read MaxCompute data

You can use the TensorFlow component in Machine Learning Designer to read data from and write data to MaxCompute.

The following table uses the iris sample dataset to describe how to read MaxCompute data.

Procedure	Description
Connect components.	Drag the components to the canvas and connect the components.
Configure the Read MaxCompute Table component.	Drag the Read MaxCompute Table component to the canvas and click the component. In the Table Name field on the Select Table tab in the right-side pane, enter the following code to obtain the data: `pai_online_project.iris_data` Data obtained: The following figure describes data formats.
Configure the TensorFlow component.	1. The input port to connect to the OSS input 2. The input port to connect to the MaxCompute input 3. The input port to connect to the model input 4. The output port to connect to the model output 5. The output port to connect to the MaxCompute output If both input and output are MaxCompute tables, you need to only connect the MaxCompute tables to Input port 2 and Output port 5. To read data from and write data to MaxCompute tables, you need to create tables and configure data sources, code files, and output model paths. Python Code Files: Specify an OSS path to store the Python code. Important Your OSS bucket and the current project must be in the same region. Checkpoint Output Directory/Modeling Input Directory: Specify an OSS path to store models. MaxCompute output table: To write data to a MaxCompute table, make sure that the output table exists and the name of the table is the same as the name specified in the code. In this example, enter iris_output. SQL create table statement: If the output table specified in the code does not exist, you can enter an SQL statement in this field to create the table. In this example, enter `create table iris_output(f1 DOUBLE,f2 DOUBLE,f3 DOUBLE,f4 DOUBLE,f5 STRING);` to create a table. PAI command `PAI -name tensorflow180_ext -project algo_public -Doutputs="odps://${The name of the current project}/tables/${The name of the output table}" -DossHost="${The host of OSS}" -Dtables="odps://${The name of the current project}/tables/${The name of the input table}" -DgpuRequired="${The number of GPUs}" -Darn="${The Alibaba Cloud Resource Name (ARN) of the RAM role to access OSS}" -Dscript="${The code file to be executed}";` Replace the descriptions in braces {} with actual values.
Read data from MaxCompute tables.	We recommend that you call the TabelRecordDataset method to read and write MaxCompute data. For more information about this method and examples, see TableRecordDataset.
Write data to MaxCompute tables.