All Products
Search
Document Center

Platform For AI:Data conversion in PAI-TensorFlow

Last Updated:Feb 29, 2024

To improve the effect and efficiency of model training, you must convert data to the data format that meets the training requirements. This topic describes how to convert data in PAI-TensorFlow.

Warning

GPU-accelerated servers will be phased out. You can submit TensorFlow tasks that run on CPU servers. If you want to use GPU-accelerated instances for model training, go to Deep Learning Containers (DLC) to submit jobs. For more information, see Submit training jobs.

Python interface: trans_csv_id2sparse

Convert an array of CSV strings that mark valid positions into a sparse matrix.

trans_csv_id2sparse(records, max_id, id_as_value=True, field_delim=",")
  • Configure the parameters. The following table describes the parameters.

    Parameter

    Required

    Description

    records

    Yes

    The array of CSV strings that you want to convert. The value of this parameter is of the STRING type. CSV strings are separated by delimiters.

    max_id

    Yes

    The maximum number of columns in the output sparse matrix. The value of this parameter is of the INT64 type. This parameter specifies the output value of dense_shape. If the actual ID value is greater than or equal to the value of dense_shape, the system reports an error.

    id_as_value

    No

    Specifies whether the index number is used as the value of the valid position in a sparse matrix. The value of this parameter is of the BOOL type. Default value: True. The value of the valid point is of the INT64 type. We recommend that you do not set this parameter to False unless otherwise specified.

    field_delim

    No

    The value of this parameter is of the STRING type. Default value: comma (,). The delimiter of CSV data. The delimiter cannot be a digit, a positive sign (+), a negative sign (-), a lowercase letter e, an uppercase letter E, a period (.), or a multi-byte delimiter. If you use a space as a delimiter, consecutive spaces are considered as one delimiter.

  • Output: A sparse tensor that is converted from index CSV strings. The value of this parameter is of the INT64 type.

Example: Convert a batch of strings that contain index data into a sparse tensor.

  • Input:

    ["2,10","7","0,8"]
  • Requirement:

    Set the width of the matrix column to 20. The valid point is the index before the conversion.

  • Code:

    outsparse = tf.trans_csv_id2sparse(["2,10","7","0,8"], 20)
  • Result:

    SparseTensor(
    indices=[[0,2],[0,10],[1,7],[2,0],[2,8]],
    values=[2, 10, 7, 0, 8],
    dense_shape=[3,20])

Python interface: trans_csv_kv2dense

Convert a collection of CSV strings that mark valid positions and their values into a dense matrix. The valid positions and their values are in the key-value pair format.

trans_csv_kv2dense(records, max_id, field_delim=",")
  • Configure the parameters. The following table describes the parameters.

    Parameter

    Required

    Description

    records

    Yes

    The array of CSV strings that you want to parse. The value of this parameter is of the STRING type. CSV strings are separated by delimiters. Each CSV string is in the key-value pair format. The key and the value of each pair must be separated by a colon (:). Otherwise, the system reports an error.

    max_id

    Yes

    The maximum number of columns in the output dense matrix. The value of this parameter is of the INT64 type. If the actual ID value is greater than or equal to the maximum number of columns, the system reports an error.

    field_delim

    No

    The value of this parameter is of the STRING type. Default value: comma (,). The delimiter of CSV data. The delimiter cannot be a digit, a positive sign (+), a negative sign (-), a lowercase letter e, an uppercase letter E, a period (.), or a multi-byte delimiter. If you use a space as a delimiter, consecutive spaces are considered as one delimiter.

  • Output: A dense matrix that is converted from index CSV strings in the key-value pair format. The default output is of the FLOAT type. A blank space is filled with 0.0.

Example: Convert a batch of strings that contain key-value pairs in the Index:Value format to a dense matrix.

  • Input:

    ["1:0.1,2:0.2,4:0.4,10:1.0",
    "0:0.22,3:0.33,9:0.99",
    "2:0.24,7:0.84,8:0.96"]
  • Requirement:

    Set the column width to 12.

  • Code:

    outmatrix = tf.trans_csv_kv2dense(
    ["1:0.1,2:0.2,4:0.4,10:1.0",
     "0:0.22,3:0.33,9:0.99",
     "2:0.24,7:0.84,8:0.96" ] , 12)
  • Result:

    [[0.0, 0.1, 0.2, 0.0, 0.4, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0]
    [0.22, 0.0, 0.0, 0.33, 0.0, 0.0, 0.0, 0.0, 0.0, 0.99, 0.0, 0.0]
    [0.0, 0.0, 0.24, 0.0, 0.0, 0.0, 0.0, 0.84, 0.96, 0.0, 0.0, 0.0]]

Python interface: trans_csv_kv2sparse

Convert a collection of CSV strings that mark valid positions and their values into a sparse matrix. The valid positions and their values are in the key-value pair format.

trans_csv_kv2sparse(records, max_id, field_delim=",")
  • Configure the parameters. The following table describes the parameters.

    Parameter

    Required

    Description

    records

    Yes

    The array of CSV strings that you want to parse. The value of this parameter is of the STRING type. CSV strings are separated by delimiters. Each CSV string is in the key-value pair format. The key and the value of each pair must be separated by a colon (:). Otherwise, the system reports an error. If you use a space as a delimiter, consecutive spaces are considered as one delimiter.

    max_id

    Yes

    The maximum number of columns in the output sparse matrix. The value of this parameter is of the INT64 type. This parameter specifies the output value of dense_shape. If the actual ID value is greater than or equal to the value of dense_shape, the system reports an error.

    field_delim

    No

    The value of this parameter is of the STRING type. Default value: comma (,). The delimiter of CSV data. The delimiter cannot be a digit, a positive sign (+), a negative sign (-), a lowercase letter e, an uppercase letter E, a period (.), or a multi-byte delimiter. If you use a space as a delimiter, consecutive spaces are considered as one delimiter.

  • Output: A sparse matrix that is converted from index CSV strings in the key-value pair format. The default output is of the FLOAT type.

Example: Convert a batch of strings that contain key-value pairs in the Index:Value format to a sparse matrix.

  • Input:

    ["1:0.1,2:0.2,4:0.4,10:1.0",
    "0:0.22,3:0.33,9:0.99",
    "2:0.24,7:0.84,8:0.96"]
  • Requirement:

    Set the width of the column to 20 to generate a sparse matrix tensor.

  • Code:

    outsparse = tf.trans_csv_kv2sparse(
    ["1:0.1,2:0.2,4:0.4,10:1.0",
     "0:0.22,3:0.33,9:0.99",
     "2:0.24,7:0.84,8:0.96" ] , 20)
  • Result:

    SparseTensor(
    indices=[[0,1],[0,2],[0,4],[0,10],[1,0],[1,3],[1,9],[2,0],[2,7],[2,8]],
    values=[0.1, 0.2, 0.4, 1.0, 0.22, 0.33, 0.99, 0.24, 0.84, 0.96],
    dense_shape=[3,20])

Python interface: trans_csv_id2dense

Convert a collection of CSV strings that mark valid positions into a dense matrix.

trans_csv_id2dense(records, max_id, id_as_value=False, field_delim=",")
  • Configure the parameters. The following table describes the parameters.

    Parameter

    Required

    Description

    records

    Yes

    The array of CSV strings that you want to parse. The value of this parameter is of the STRING type. CSV strings are separated by delimiters.

    max_id

    Yes

    The maximum number of columns in the output dense matrix. The value of this parameter is of the INT64 type. If the actual ID value is greater than or equal to the maximum number of columns, the system reports an error.

    id_as_value

    No

    Specifies whether data of the INT64 type is used as the value of the valid point in a sparse matrix. The value of this parameter is of the BOOL type. Default value: False.

    field_delim

    No

    The value of this parameter is of the STRING type. Default value: comma (,). The delimiter of CSV data. The delimiter cannot be a digit, a positive sign (+), a negative sign (-), a lowercase letter e, an uppercase letter E, a period (.), or a multi-byte delimiter. If you use a space as a delimiter, consecutive spaces are considered as one delimiter.

  • Output: A dense tensor that is converted from index CSV strings. A blank space is filled with 0.0.

Example: Convert a batch of strings that contain index data into a dense matrix.

  • Input:

    ["2,10","7","0,8"]
  • Requirement:

    Set the column width to 12 and the valid point to 1.

  • Code:

    outmatrix = tf.trans_csv_id2dense(
    ["2,10","7","0,8"], 12)
  • Result:

    [[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0]
    [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
    [1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]]

Python interface: trans_csv_to_dense

Convert a collection of CSV strings that contain numeric values into a dense matrix.

trans_csv_to_dense(records, max_id, field_delim=",")
  • Configure the parameters. The following table describes the parameters.

    Parameter

    Required

    Description

    records

    Yes

    The array of CSV strings that you want to convert. The value of this parameter is of the STRING type. CSV strings are separated by delimiters.

    max_id

    Yes

    The maximum number of columns in the output dense matrix. The value of this parameter is of the INT64 type. If the maximum number of columns in the CSV strings is greater than or equal to the value of this parameter, the system reports an error.

    field_delim

    No

    The value of this parameter is of the STRING type. Default value: comma (,). The delimiter of CSV data. The delimiter cannot be a digit, a positive sign (+), a negative sign (-), a lowercase letter e, an uppercase letter E, a period (.), or a multi-byte delimiter. If you use a space as a delimiter, consecutive spaces are considered as one delimiter.

  • Output: A dense matrix that is converted from index CSV strings in the key-value pair format. The default output is of the FLOAT type. A blank space is filled with 0.0.

Example: Convert a batch of CSV strings that contain index data into a dense matrix.

  • Input:

    ["0.1,0.2,0.4,1.0",
    "0.22,0.33,0.99",
    "0.24,0.84,0.96"] 
  • Requirement:

    Set the column width to 6.

  • Code:

    outmatrix = tf.trans_csv_to_dense(
    ["0.1,0.2,0.4,1.0",
     "0.22,0.33,0.99",
     "0.24,0.84,0.96" ] , 6)
  • Result:

    [[0.1, 0.2, 0.4, 1.0, 0.0, 0.0]
    [0.22, 0.33, 0.99, 0.0, 0.0, 0.0]
    [0.24, 0.84, 0.96, 0.0, 0.0, 0.0]]

Sample code

The following sample code uses TensorFlow to read data from a data table stored in MaxCompute. The table contains six columns. The first column contains IDs. The second column contains CSV data in the key-value pair format. The last four columns contains CSV data in the index format. After the system reads the data, the system calls ODPS of TransCSV to convert the CSV data in the five columns to a dense matrix and four sparse matrices for model training.

import tensorflow as tf
import numpy as np
def read_table(filename_queue):
    batch_size = 128
    reader = tf.TableRecordReader(csv_delimiter=';', num_threads=8, capacity=8*batch_size)
    key, value = reader.read_up_to(filename_queue, batch_size)
    values = tf.train.batch([value], batch_size=batch_size, capacity=8*capacity, enqueue_many=True, num_threads=8)
    record_defaults = [[1.0], [""], [""], [""], [""], [""]]
    feature_size = [1322,30185604,43239874,5758226,41900998]
    col1, col2, col3, col4, col5, col6 = tf.decode_csv(values, record_defaults=record_defaults, field_delim=';')
    col2 = tf.trans_csv_kv2dense(col2, feature_size[0])
    col3 = tf.trans_csv_id2sparse(col3, feature_size[1])
    col4 = tf.trans_csv_id2sparse(col4, feature_size[2])
    col5 = tf.trans_csv_id2sparse(col5, feature_size[3])
    col6 = tf.trans_csv_id2sparse(col6, feature_size[4])
    return [col1, col2, col3, col4, col5, col6]
if __name__ == '__main__':
    tf.app.flags.DEFINE_string("tables", "", "tables")
    tf.app.flags.DEFINE_integer("num_epochs", 1000, "number of epoches")
    FLAGS = tf.app.flags.FLAGS
    table_pattern = FLAGS.tables
    num_epochs = FLAGS.num_epochs
    filename_queue = tf.train.string_input_producer(table_pattern, num_epochs)
    train_data = read_table(filename_queue)
    init_global = tf.global_variables_initializer()
    init_local = tf.local_variables_initializer()
    with tf.Session() as sess:
      sess.run(init_global)
      sess.run(init_local)
      coord = tf.train.Coordinator()
      threads = tf.train.start_queue_runners(sess=sess, coord=coord)
      for i in range(1000):
        sess.run(train_data)
      coord.request_stop()
      coord.join(threads)