This topic introduces the data conversion methods in PAI-TensorFlow.

Python interface: trans_csv_id2sparse

Convert a collection of CSV strings that mark valid positions into a sparse matrix.
trans_csv_id2sparse(records, max_id, id_as_value=True, field_delim=",")
  • Configure the following parameters.
    Parameter Required Description
    records Yes The array of CSV strings that you want to parse. The value of this parameter is of the STRING type. CSV strings are separated by delimiters.
    max_id Yes The maximum number of columns for a sparse matrix. The value of this parameter is of the INT64 type. This parameter specifies the output value of dense_shape. If the actual ID is greater than or equal to the value of dense_shape, the system reports an error.
    id_as_value No Specifies whether the index number is used as the value of the valid point in a sparse matrix. The value of this parameter is of the BOOL type. Default value: True. The value of the valid point is of the INT64 type. We recommend that you do not set this parameter to False unless otherwise specified.
    field_delim No The delimiter of CSV data. The value of this parameter is of the STRING type. Default value: comma (,). The delimiter cannot be a digit, positive sign (+), negative sign (-), lowercase letter e, uppercase letter E, period (.), or multi-byte delimiter. If you use a space as a delimiter, consecutive spaces are considered as one delimiter.
  • Output: The sparse tensor that is converted from index CSV strings. The value of this parameter is of the INT64 type.
Example: Convert a batch of strings that contain index data into a sparse tensor.
  • Input:
    ["2,10","7","0,8"]
  • Requirements:

    Set the width of the matrix column to 20. The valid point is the index before the conversion.

  • Code:
    outsparse = tf.trans_csv_id2sparse(["2,10","7","0,8"], 20)
  • Returned results:
    SparseTensor(
    indices=[[0,2],[0,10],[1,7],[2,0],[2,8]],
    values=[2, 10, 7, 0, 8],
    dense_shape=[3,20])

Python interface: trans_csv_kv2dense

Convert a collection of CSV strings that mark valid positions and their values into a dense matrix. The valid positions and their values are in the key-value pair format.
trans_csv_kv2dense(records, max_id, field_delim=",")
  • Configure the following parameters.
    Parameter Required Description
    records Yes The array of CSV strings that you want to parse. The value of this parameter is of the STRING type. CSV strings are separated by delimiters. Each CSV string is in the key-value pair format. The key and the value of each pair must be separated by a colon (:). Otherwise, the system reports an error.
    max_id Yes The number of columns in the output dense matrix. The value of this parameter is of the INT64 type. If the actual ID value is greater than or equal to the number of columns, the system reports an error.
    field_delim No The delimiter of CSV data. The value of this parameter is of the STRING type. Default value: comma (,). The delimiter cannot be a digit, positive sign (+), negative sign (-), lowercase letter e, uppercase letter E, period (.), or multi-byte delimiter. If you use a space as a delimiter, consecutive spaces are considered as one delimiter.
  • Output: A dense matrix that is converted from index CSV strings in the key-value pair format. The default output is of the FLOAT type. The blank space is filled with 0.0.
Example: Convert a batch of strings that contain key-value pairs in the Index:Value format to a dense matrix.
  • Input:
    ["1:0.1,2:0.2,4:0.4,10:1.0",
    "0:0.22,3:0.33,9:0.99",
    "2:0.24,7:0.84,8:0.96"]
  • Requirements:

    Set the column width to 12.

  • Code:
    outmatrix = tf.trans_csv_kv2dense(
    ["1:0.1,2:0.2,4:0.4,10:1.0",
     "0:0.22,3:0.33,9:0.99",
     "2:0.24,7:0.84,8:0.96" ] , 12)
  • Returned results:
    [[0.0, 0.1, 0.2, 0.0, 0.4, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0]
    [0.22, 0.0, 0.0, 0.33, 0.0, 0.0, 0.0, 0.0, 0.0, 0.99, 0.0, 0.0]
    [0.0, 0.0, 0.24, 0.0, 0.0, 0.0, 0.0, 0.84, 0.96, 0.0, 0.0, 0.0]]

Python interface: trans_csv_kv2sparse

Convert a collection of CSV strings that mark valid positions and their values into a sparse matrix. The valid positions and their values are in the key-value pair format.
trans_csv_kv2sparse(records, max_id, field_delim=",")
  • Configure the following parameters.
    Parameter Required Description
    records Yes The array of CSV strings that you want to parse. The value of this parameter is of the STRING type. CSV strings are separated by delimiters. Each CSV string is in the key-value pair format. The key and the value of each pair must be separated by a colon (:). Otherwise, the system reports an error. If you use a space as a delimiter, consecutive spaces are considered as one delimiter.
    max_id Yes The maximum number of columns for a sparse matrix. The value of this parameter is of the INT64 type. This parameter specifies the value of dense_shape in the output. If the actual ID is greater than or equal to the value of dense_shape, the system reports an error.
    field_delim No The delimiter of CSV data. The value of this parameter is of the STRING type. Default value: comma (,). The delimiter cannot be a digit, positive sign (+), negative sign (-), lowercase letter e, uppercase letter E, period (.), or multi-byte delimiter. If you use a space as a delimiter, consecutive spaces are considered as one delimiter.
  • Output: A sparse matrix that is converted from index CSV strings in the key-value pair format. The default output is of the FLOAT type.
Example: Convert a batch of strings that contain key-value pairs in the Index:Value format to a sparse matrix.
  • Input:
    ["1:0.1,2:0.2,4:0.4,10:1.0",
    "0:0.22,3:0.33,9:0.99",
    "2:0.24,7:0.84,8:0.96" ]
  • Requirements:

    Set the width of the column to 20 to generate a sparse matrix tensor.

  • Code:
    outsparse = tf.trans_csv_kv2sparse(
    ["1:0.1,2:0.2,4:0.4,10:1.0",
     "0:0.22,3:0.33,9:0.99",
     "2:0.24,7:0.84,8:0.96" ] , 20)
  • Returned results:
    SparseTensor(
    indices=[[0,1],[0,2],[0,4],[0,10],[1,0],[1,3],[1,9],[2,0],[2,7],[2,8]],
    values=[0.1, 0.2, 0.4, 1.0, 0.22, 0.33, 0.99, 0.24, 0.84, 0.96],
    dense_shape=[3,20])

Python interface: trans_csv_id2dense

Convert a collection of CSV strings that mark valid positions into a dense matrix.
trans_csv_id2dense(records, max_id, id_as_value=False, field_delim=",")
  • Configure the following parameters.
    Parameter Required Description
    records Yes The array of CSV strings that you want to parse. The value of this parameter is of the STRING type. CSV strings are separated by delimiters.
    max_id Yes The number of columns in the output dense matrix. The value of this parameter is of the INT64 type. If the actual ID value is greater than or equal to the number of columns, the system reports an error.
    id_as_value No Specifies whether data of the INT64 type is used as the value of the valid point in a sparse matrix. The value of this parameter is of the BOOL type. Default value: False.
    field_delim No The delimiter of CSV data. The value of this parameter is of the STRING type. Default value: comma (,). The delimiter cannot be a digit, positive sign (+), negative sign (-), lowercase letter e, uppercase letter E, period (.), or multi-byte delimiter. If you use a space as a delimiter, consecutive spaces are considered as one delimiter.
  • Output: A dense tensor that is converted from index CSV strings. The blank space is filled with 0.0.
Example: Convert a batch of strings that contain index data into a dense matrix.
  • Input:
    ["2,10","7”,"0,8"]
  • Requirements:

    Set the column width to 12 and valid point to 1.

  • Code:
    outmatrix = tf.trans_csv_id2dense(
    ["2,10","7","0,8"], 12)
  • Returned results:
    [[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0]
    [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
    [1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]]

Python interface: trans_csv_to_dense

Convert a collection of CSV strings that contain numeric values into a dense matrix.
trans_csv_to_dense(records, max_id, field_delim=",")
  • Configure the following parameters.
    Parameter Required Description
    records Yes The array of CSV strings that you want to parse. The value of this parameter is of the STRING type. CSV strings are separated by delimiters.
    max_id Yes The number of columns in the output dense matrix. The value of this parameter is of the INT64 type. If the number of columns in the CSV strings is greater than or equal to the value of this parameter, the system reports an error.
    field_delim No The delimiter of CSV data. The value of this parameter is of the STRING type. Default value: comma (,). The delimiter cannot be a digit, positive sign (+), negative sign (-), lowercase letter e, uppercase letter E, period (.), or multi-byte delimiter. If you use a space as a delimiter, consecutive spaces are considered as one delimiter.
  • Output: A dense tensor that is converted from index CSV strings. The blank space is filled with 0.0.
Example: Convert a batch of CSV strings that contain index data into a dense matrix.
  • Input:
    ["0.1,0.2,0.4,1.0",
    "0.22,0.33,0.99",
    "0.24,0.84,0.96" ]
  • Requirements:

    Set the column width to 6.

  • Code:
    outmatrix = tf.trans_csv_to_dense(
    ["0.1,0.2,0.4,1.0",
     "0.22,0.33,0.99",
     "0.24,0.84,0.96" ] , 6)
  • Returned results:
    [[0.1, 0.2, 0.4, 1.0, 0.0, 0.0]
    [0.22, 0.33, 0.99, 0.0, 0.0, 0.0]
    [0.24, 0.84, 0.96, 0.0, 0.0, 0.0]]

Code

The following code uses TensorFlow to read data from a data table stored in MaxCompute. The table contains six columns. The first column contains IDs. The second column contains CSV data in the key-value pair format. The last four columns contains CSV data in the index format. After the system reads the data, it calls ODPS of TransCSV to convert the CSV data in the five columns to a dense matrix and four sparse matrices for model training.
import tensorflow as tf
import numpy as np
def read_table(filename_queue):
    batch_size = 128
    reader = tf.TableRecordReader(csv_delimiter=';', num_threads=8, capacity=8*batch_size)
    key, value = reader.read_up_to(filename_queue, batch_size)
    values = tf.train.batch([value], batch_size=batch_size, capacity=8*capacity, enqueue_many=True, num_threads=8)
    record_defaults = [[1.0], [""], [""], [""], [""], [""]]
    feature_size = [1322,30185604,43239874,5758226,41900998]
    col1, col2, col3, col4, col5, col6 = tf.decode_csv(values, record_defaults=record_defaults, field_delim=';')
    col2 = tf.trans_csv_kv2dense(col2, feature_size[0])
    col3 = tf.trans_csv_id2sparse(col3, feature_size[1])
    col4 = tf.trans_csv_id2sparse(col4, feature_size[2])
    col5 = tf.trans_csv_id2sparse(col5, feature_size[3])
    col6 = tf.trans_csv_id2sparse(col6, feature_size[4])
    return [col1, col2, col3, col4, col5, col6]
if __name__ == '__main__':
    tf.app.flags.DEFINE_string("tables", "", "tables")
    tf.app.flags.DEFINE_integer("num_epochs", 1000, "number of epoches")
    FLAGS = tf.app.flags.FLAGS
    table_pattern = FLAGS.tables
    num_epochs = FLAGS.num_epochs
    filename_queue = tf.train.string_input_producer(table_pattern, num_epochs)
    train_data = read_table(filename_queue)
    init_global = tf.global_variables_initializer()
    init_local = tf.local_variables_initializer()
    with tf.Session() as sess:
      sess.run(init_global)
      sess.run(init_local)
      coord = tf.train.Coordinator()
      threads = tf.train.start_queue_runners(sess=sess, coord=coord)
      for i in range(1000):
        sess.run(train_data)
      coord.request_stop()
      coord.join(threads)