TensorFlow in Machine Learning Platform for AI (PAI) can read data stored in Object Storage Service (OSS) buckets and MaxCompute tables.

Read and write OSS data

Step Description
Upload data to OSS.

Before you use deep learning frameworks to process your data, you must upload your data to an OSS bucket.

  1. Create an OSS bucket.
    The OSS bucket must be in the same region as the GPU compute cluster. Your data is transmitted over the Alibaba Cloud classic network. You are not charged for traffic transmission.
    Notice Do not enable versioning for the bucket.
  2. Create a directory, organize the directory structure, and then upload your data.

    You can perform these operations in the OSS console.

Authorize the current account. To read data from and write data to an OSS bucket by using PAI, you must assign the AliyunODPSPAIDefaultRole role to the current account. For more information, see Authorization.
Authorize a RAM role.

You can authorize a RAM role so that PAI can assume this role to access OSS.

Perform the following steps to authorize a RAM role for PAI to access OSS:
  1. Log on to the PAI console. In the left-side navigation submenu of a Machine Learning Studio project, click Settings and select General.
  2. In the OSS Authorization section of the General page, select Authorize Machine Learning Platform for AI to access my OSS resources. For other parameters, use the default values.
  3. In the OK dialog box, click Click here to authorize access in RAM.
  4. On the page that appears, click Confirm Authorization Policy.
  5. In the Field Settings section, click Refresh below OSS Data Path and view the RAM information that is automatically recorded to the Read File Data component. teshy
Use TensorFlow in PAI to read OSS data. Connect the Read File Data component to the TensorFlow component.
The following table describes the permissions of the default role AliyunODPSPAIDefaultRole.
Permission Description
oss:PutObject Uploads an object.
oss:GetObject Queries an object.
oss:ListObjects Queries objects.
oss:DeleteObjects Deletes an object.
Approaches for TensorFlow in PAI to read OSS data:
  • Inefficient I/O approaches
    You can run TensorFlow code on your computer or in the cloud in a distributed manner. The following list describes the differences between the two approaches:
    • Read data from your computer: The server directly obtains graphs from the client for computing.
    • Read data from the cloud: The server obtains graphs and distributes these graphs to workers for computing.
    reat
    Usage notes
    • Do not use built-in approaches of Python to read data from your computer.
      PAI supports the built-in I/O approaches of Python. To use these approaches, you must compress the data source and code into a package and upload this package to OSS. This approach writes data to the memory for computing and is inefficient. We recommend that you do not use this approach. The following code provides an example on how this approach works:
      import csv
      csv_reader=csv.reader(open('csvtest.csv'))
      for row in csv_reader:
        print(row)
    • Do not use third-party libraries to read data.

      Data can be read by using third-party libraries, such as TFLearn and pandas. However, third-party libraries are encapsulated in a Python package to read data. In this case, data reading in PAI is inefficient.

    • Do not perform the preload operations to read data.

      You may find that GPUs are not significantly faster than local CPUs. The possible cause is that I/O operations waste resources. A preload operation first reads data to the memory. Then, it performs session operations, such as feeding, for computing. A preload operation wastes computing resources and cannot process large amounts of data due to the memory limit.

      For example, if a hard disk contains an image dataset, you must load the data before the computing process starts. It requires 0.1s to load the image dataset and 0.9s for computing. The GPU is idle for 0.1s every second. This reduces efficiency. tewata
  • Efficient I/O approaches
    An efficient I/O approach converts data to Operations (ops) and calls the session.run method to read the data. A read thread loads the images from the source file system into a memory queue, and a compute thread directly retrieves the data from the memory queue for computing. This prevents computing resources from being idled and wasted. test
    The following sample code shows how to read data by using ops:
    import argparse
    import tensorflow as tf
    import os
    FLAGS=None
    def main(_):
        dirname = os.path.join(FLAGS.buckets, "csvtest.csv")
        reader=tf.TextLineReader()
        filename_queue=tf.train.string_input_producer([dirname])
        key,value=reader.read(filename_queue)
        record_defaults=[[''],[''],[''],[''],['']]
        d1, d2, d3, d4, d5= tf.decode_csv(value, record_defaults, ',')
        init=tf.initialize_all_variables()
        with tf.Session() as sess:
            sess.run(init)
            coord = tf.train.Coordinator()
            threads = tf.train.start_queue_runners(sess=sess,coord=coord)
            for i in range(4):
                print(sess.run(d2))
            coord.request_stop()
            coord.join(threads)
    if __name__ == '__main__':
        parser = argparse.ArgumentParser()
        parser.add_argument('--buckets', type=str, default='',
                            help='input data path')
        parser.add_argument('--checkpointDir', type=str, default='',
                            help='output model path')
        FLAGS, _ = parser.parse_known_args()
        tf.app.run(main=main)
    Description of parameters in the preceding code:
    • dirname: The path of the OSS object. The value of this parameter can be an array.
    • reader: TensorFlow in PAI provides APIs for different types of readers. You can select a reader as needed.
    • tf.train.string_input_producer: converts the object into a queue.
    • tf.decode_csv: provides an op to split data. You can use this op to obtain specific parameters in each row.
    • To retrieve data by using an op, you must call the tf.train.Coordinator() and tf.train.start_queue_runners(sess=sess,coord=coord) methods in a session.

Read and write MaxCompute data

You can use the TensorFlow component of Machine Learning Studio to read data from and write data to MaxCompute.

The following table uses the iris sample dataset to describe how to read MaxCompute data.
Step Description
Connect components. Drag the components to the canvas and connect the components. reag
Configure the Read MaxCompute Table component. Drag the Read MaxCompute Table component to the canvas and click the component. In the Table Name field on the Select Table tab in the right-side pane, enter the following code
pai_online_project.iris_data
to obtain the data: test
The following table describes data formats. test
Configure the TensorFlow component. test
  • 1. The input port to connect to the OSS input
  • 2. The input port to connect to the MaxCompute input
  • 3. The input port to connect to the model input
  • 4. The output port to connect to the model output
  • 5. The output port to connect to the MaxCompute output
If both input and output are MaxCompute tables, you need only to connect the MaxCompute tables to Input port 2 and Output port 5.
To read data from and write data to MaxCompute tables, you must create tables and configure data sources, code files, and output model paths.
  • Python Code Files: Specify an OSS path to store the Python code.
    Notice Your OSS bucket and the current project must be in the same region.
  • Checkpoint Output Directory/Modeling Input Directory: Specify an OSS path to store models.
  • MaxCompute output table: To write data to a MaxCompute table, make sure that the output table exists and the name of the table is the same as that specified in the code. In this example, enter iris_output.
  • SQL create table statement: If the output table specified in the code does not exist, you can enter an SQL statement in this field to create the table. In this example, enter create table iris_output(f1 DOUBLE,f2 DOUBLE,f3 DOUBLE,f4 DOUBLE,f5 STRING); to create a table.
PAI command
PAI -name tensorflow180_ext -project algo_public -Doutputs="odps://${The name of the current project}/tables/${The name of the output table}" -DossHost="${The host of OSS}" -Dtables="odps://${The name of the current project}/tables/${The name of the input table}" -DgpuRequired="${The number of GPUs}" -Darn="${The Alibaba Cloud Resource Name (ARN) of the RAM role to access OSS}" -Dscript="${The code file to be executed}";
Replace the descriptions in braces {} with actual values.
Read data from MaxCompute tables. We recommend that you call the TabelRecordDataset method to read and write MaxCompute data. For more information about this method and examples, see TableRecordDataset.
Write data to MaxCompute tables.