TensorFlow in Machine Learning Platform for AI (PAI) can read data stored in Object Storage Service (OSS) buckets and MaxCompute tables.
Read and write OSS data
Step | Description |
---|---|
Upload data to OSS.
Before you use deep learning frameworks to process your data, you must upload your data to an OSS bucket. |
|
Authorize the current account. | To read data from and write data to an OSS bucket by using PAI, you must assign the AliyunODPSPAIDefaultRole role to the current account. For more information, see Authorization. |
Authorize a RAM role.
You can authorize a RAM role so that PAI can assume this role to access OSS. |
Perform the following steps to authorize a RAM role for PAI to access OSS:
|
Use TensorFlow in PAI to read OSS data. | Connect the Read File Data component to the TensorFlow component. |
Permission | Description |
---|---|
oss:PutObject | Uploads an object. |
oss:GetObject | Queries an object. |
oss:ListObjects | Queries objects. |
oss:DeleteObjects | Deletes an object. |
- Inefficient I/O approaches
You can run TensorFlow code on your computer or in the cloud in a distributed manner. The following list describes the differences between the two approaches:
- Read data from your computer: The server directly obtains graphs from the client for computing.
- Read data from the cloud: The server obtains graphs and distributes these graphs to workers for computing.
Usage notes- Do not use built-in approaches of Python to read data from your computer.
PAI supports the built-in I/O approaches of Python. To use these approaches, you must compress the data source and code into a package and upload this package to OSS. This approach writes data to the memory for computing and is inefficient. We recommend that you do not use this approach. The following code provides an example on how this approach works:
import csv csv_reader=csv.reader(open('csvtest.csv')) for row in csv_reader: print(row)
- Do not use third-party libraries to read data.
Data can be read by using third-party libraries, such as TFLearn and pandas. However, third-party libraries are encapsulated in a Python package to read data. In this case, data reading in PAI is inefficient.
- Do not perform the preload operations to read data.
You may find that GPUs are not significantly faster than local CPUs. The possible cause is that I/O operations waste resources. A preload operation first reads data to the memory. Then, it performs session operations, such as feeding, for computing. A preload operation wastes computing resources and cannot process large amounts of data due to the memory limit.
For example, if a hard disk contains an image dataset, you must load the data before the computing process starts. It requires 0.1s to load the image dataset and 0.9s for computing. The GPU is idle for 0.1s every second. This reduces efficiency.
- Efficient I/O approaches
An efficient I/O approach converts data to Operations (ops) and calls the session.run method to read the data. A read thread loads the images from the source file system into a memory queue, and a compute thread directly retrieves the data from the memory queue for computing. This prevents computing resources from being idled and wasted.The following sample code shows how to read data by using ops:
Description of parameters in the preceding code:import argparse import tensorflow as tf import os FLAGS=None def main(_): dirname = os.path.join(FLAGS.buckets, "csvtest.csv") reader=tf.TextLineReader() filename_queue=tf.train.string_input_producer([dirname]) key,value=reader.read(filename_queue) record_defaults=[[''],[''],[''],[''],['']] d1, d2, d3, d4, d5= tf.decode_csv(value, record_defaults, ',') init=tf.initialize_all_variables() with tf.Session() as sess: sess.run(init) coord = tf.train.Coordinator() threads = tf.train.start_queue_runners(sess=sess,coord=coord) for i in range(4): print(sess.run(d2)) coord.request_stop() coord.join(threads) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--buckets', type=str, default='', help='input data path') parser.add_argument('--checkpointDir', type=str, default='', help='output model path') FLAGS, _ = parser.parse_known_args() tf.app.run(main=main)
- dirname: The path of the OSS object. The value of this parameter can be an array.
- reader: TensorFlow in PAI provides APIs for different types of readers. You can select a reader as needed.
- tf.train.string_input_producer: converts the object into a queue.
- tf.decode_csv: provides an op to split data. You can use this op to obtain specific parameters in each row.
- To retrieve data by using an op, you must call the tf.train.Coordinator() and tf.train.start_queue_runners(sess=sess,coord=coord) methods in a session.
Read and write MaxCompute data
You can use the TensorFlow component of Machine Learning Studio to read data from and write data to MaxCompute.
Step | Description |
---|---|
Connect components. | Drag the components to the canvas and connect the components. ![]() |
Configure the Read MaxCompute Table component. | Drag the Read MaxCompute Table component to the canvas and click the component. In
the Table Name field on the Select Table tab in the right-side pane, enter the following
code to obtain the data: ![]() The following table describes data formats.
![]() |
Configure the TensorFlow component. | ![]()
To read data from and write data to MaxCompute tables, you must create tables and
configure data sources, code files, and output model paths.
PAI command
Replace the descriptions in braces {} with actual values. |
Read data from MaxCompute tables. | We recommend that you call the TabelRecordDataset method to read and write MaxCompute data. For more information about this method and examples, see TableRecordDataset. |
Write data to MaxCompute tables. |