The use of EmbeddingVariable in ultra-large-scale training saves memory resources and ensures that the model features are undamaged.

Background information

Embedding has become an effective way to deal with the word and identification features in deep learning. Embedding is a kind of function mapping. It maps high-dimensional sparse features to low-dimensional dense vectors and performs end-to-end model training. In TensorFlow, variables are used to define a model or node state. The implementation of this process depends on tensor, the data structure. Tensor is an abstract data type in TensorFlow. It contains scalars, vectors, matrices, and high-dimensional data structures. Tensor is a data carrier for communication among operators. Any operator that supports tensor as the input and output can be used for graph computing. Tensor uses consecutive storage. Therefore, when you define a variable, you must specify its type and shape. The shape cannot be modified.

TensorFlow uses variables to implement the embedding mechanism. [vocabulary_size, embedding_dimension] is used to specify the shape of a variable in embedding. In scenarios with large-scale sparse features, the following disadvantages exist:
  • vocabulary_size is determined by the ID size. vocabulary_size may be difficult to estimate because the number of IDs in online learning scenarios increases.
  • In most cases, an ID is a string and large in size. Before embedding, you must hash the ID to the vocabulary_size range:
    • If vocabulary_size is excessively small, the probability of a hash collision increases. Different features may find the same embedding, which results in fewer features.
    • If vocabulary_size is excessively large, the variable stores embedding that is not required, which results in memory redundancy.
  • The large values of embedding variables increase the model size. Even if regular expressions are used to reduce the impact of the embedding of some features on the whole model, it is impossible to remove the embedding from the model.

To handle these issues, PAI-TensorFlow provides EmbeddingVariable. Under the condition of feature lossless training, EmbeddingVariable uses memory resources in a cost-effective way to achieve the offline training of ultra-large-scale features and the online model. PAI-TensorFlow provides the EmbeddingVariable V3.1 and Feature_Column V3.3 APIs. We recommend Feature_Column that automatically accelerates the feature identification process for a string.

EmbeddingVariable features

  • Dynamic embedding

    You only need to specify the embedding_dim parameter. Then, PAI-TensorFlow dynamically increases or decreases the dictionary size based on training. This method is suitable for online learning and frees you from the data preprocessing of the PAI-TensorFlow model.

  • Group lasso

    In most cases, the size of an embedding variable that has undergone deep learning is large. If you deploy the variable as an online service, the server may be overloaded. Group lasso-based embedding variables reduce the workload of model deployment.

  • EmbeddingVariable allows you to transfer original feature values to the embedding lookup. This frees you from identification operations, such as hashing. As a result, feature lossless training can be achieved.
  • EmbeddingVariable supports the import and export of the graph inference, backpropagation, and variables. During model training, an optimizer is used to automatically update embedding variables.

tf.get_embedding_variable

tf.get_embedding_variable returns an existing or a new embedding variable. Definition:
get_embedding_variable(
    name,
    embedding_dim,
    key_dtype=dtypes.int64,
    value_dtype=None,
    initializer=None,
    regularizer=None,
    trainable=True,
    collections=None,
    caching_device=None,
    partitioner=None,
    validate_shape=True,
    custom_getter=None,
    constraint=None,
    steps_to_live=None
)
  • name: the name of the embedding variable.
  • embedding_dim: the embedding dimension. Example: 8 or 64.
  • key_dtype: the type of the key in embedding lookup. Default value: int64.
  • value_dtype: the type of the embedding vector. Only the FLOAT type is supported.
  • initializer: the initial value of the embedding vector.
  • trainable: specifies whether the variable is added to the collection of GraphKeys.TRAINABLE_VARIABLES.
  • partitioner: the partition function.
  • steps_to_live: the number of global steps. This parameter is used to remove expired features. The system deletes the features for which the number of global steps exceeds the value of this parameter.

EmbeddingVariable

Structure of EmbeddingVariable:
class EmbeddingVariable(ResourceVariable)

  def total_count():
    # Return the total_count and [rowCount,EmbeddingDim] of the embedding variable.
  def read_value():
    raise NotImplementedError("...")
  def assign():
    raise NotImplementedError("...")
  def assign_add():
    raise NotImplementedError("...")
  def assign_sub():
    raise NotImplementedError("...")
  • The sparse_read() method that is used to read sparse data is supported. If the queried key does not exist and you want the initial value of the embedding variable to be returned, the initializer that corresponds to key is returned.
  • The total_count() method that is used to count the total number of words in the embedding variable is supported. This method returns the dynamic shape value of the variable.
  • The read_value() method that is used to read all embedding variables is not supported.
  • The methods that are used to assign values to embedding variables are not supported. The methods include assign(), assign_add(), and assign_sub().

Use feature_column to build an embedding variable

def tf.contrib.layers.sparse_column_with_embedding(column_name=column_name,
                                                   dtype=tf.string,
                                                   partition_num=None,
                                                   steps_to_live=None,
                                                   # Only the 140 Lite version of TensorFlow supports these parameters.
                                                   steps_to_live_l2reg=None,
                                                   l2reg_theta=None)

  # column_name: column name
  # dtype: type, default is tf.string

Examples

  • Use underlying tf.get_embedding_variable to build a TensorFlow graph that contains an embedding variable
    #! /usr/bin/python
    import tensorflow as tf
    
    var = tf.get_embedding_variable("var_0",
                                    embedding_dim=3,
                                    initializer=tf.ones_initializer(tf.float32),
                                    partitioner=tf.fixed_size_partitioner(num_shards=4))
    
    shape = [var1.total_count() for var1 in var]
    
    emb = tf.nn.embedding_lookup(var, tf.cast([0,1,2,5,6,7], tf.int64))
    fun = tf.multiply(emb, 2.0, name='multiply')
    loss = tf.reduce_sum(fun, name='reduce_sum')
    opt = tf.train.FtrlOptimizer(0.1,
                                 l1_regularization_strength=2.0,
                                 l2_regularization_strength=0.00001)
    
    g_v = opt.compute_gradients(loss)
    train_op = opt.apply_gradients(g_v)
    
    init = tf.global_variables_initializer()
    
    sess_config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False)
    with tf.Session(config=sess_config) as sess:
      sess.run([init])
      print(sess.run([emb, train_op, loss]))
      print(sess.run([emb, train_op, loss]))
      print(sess.run([emb, train_op, loss]))
      print(sess.run([shape]))
  • Save an embedding variable as a checkpoint
    #! /usr/bin/python
    import tensorflow as tf
    
    var = tf.get_embedding_variable("var_0",
                                    embedding_dim=3,
                                    initializer=tf.ones_initializer(tf.float32),
                                    partitioner=tf.fixed_size_partitioner(num_shards=4))
    
    emb = tf.nn.embedding_lookup(var, tf.cast([0,1,2,5,6,7], tf.int64))
    
    init = tf.global_variables_initializer()
    saver = tf.train.Saver(sharded=True)
    print("GLOBAL_VARIABLES: ", tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES))
    print("SAVEABLE_OBJECTS: ", tf.get_collection(tf.GraphKeys.SAVEABLE_OBJECTS))
    
    checkpointDir = "/tmp/model_dir"
    sess_config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False)
    with tf.Session(config=sess_config) as sess:
      sess.run([init])
      print(sess.run([emb]))
    
      save_path = saver.save(sess, checkpointDir + "/model.ckpt", global_step=666)
      tf.train.write_graph(sess.graph_def, checkpointDir, 'train.pbtxt')
      print("save_path", save_path)
      print("list_variables", tf.contrib.framework.list_variables(checkpointDir))
  • Restore an embedding variable from a checkpoint
    #! /usr/bin/python
    import tensorflow as tf
    
    var = tf.get_embedding_variable("var_0",
                                    embedding_dim=3,
                                    initializer=tf.ones_initializer(tf.float32),
                                    partitioner=tf.fixed_size_partitioner(num_shards=4))
    
    emb = tf.nn.embedding_lookup(var, tf.cast([0,1,2,5,6,7], tf.int64))
    
    init = tf.global_variables_initializer()
    saver = tf.train.Saver(sharded=True)
    print("GLOBAL_VARIABLES: ", tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES))
    print("SAVEABLE_OBJECTS: ", tf.get_collection(tf.GraphKeys.SAVEABLE_OBJECTS))
    
    checkpointDir = "/tmp/model_dir"
    sess_config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False)
    with tf.Session(config=sess_config) as sess:
      print("list_variables", tf.contrib.framework.list_variables(checkpointDir))
      saver.restore(sess, checkpointDir + "/model.ckpt-666")
      print(sess.run([emb]))
  • Use feature_column to build a TensorFlow graph that contains an embedding variable
    import tensorflow as tf
    import os
    
    columns_list=[]
    columns_list.append(tf.contrib.layers.sparse_column_with_embedding(column_name="col_emb", dtype=tf.string))
    W = tf.contrib.layers.shared_embedding_columns(sparse_id_columns=columns_list,
            dimension=3,
            initializer=tf.ones_initializer(tf.float32),
            shared_embedding_name="xxxxx_shared")
    
    ids={}
    ids["col_emb"] = tf.SparseTensor(indices=[[0,0],[1,0],[2,0],[3,0],[4,0]], values=["aaaa","bbbbb","ccc","4nn","5b"], dense_shape=[5, 5])
    
    emb = tf.contrib.layers.input_from_feature_columns(columns_to_tensors=ids, feature_columns=W)
    
    fun = tf.multiply(emb, 2.0, name='multiply')
    loss = tf.reduce_sum(fun, name='reduce_sum')
    opt = tf.train.FtrlOptimizer(0.1, l1_regularization_strength=2.0, l2_regularization_strength=0.00001)
    g_v = opt.compute_gradients(loss)
    train_op = opt.apply_gradients(g_v)
    init = tf.global_variables_initializer()
    init_local = tf.local_variables_initializer()
    sess_config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False)
    with tf.Session(config=sess_config) as sess:
      sess.run(init)
      print("init global done")
      sess.run(init_local)
      print("init local done")
      print(sess.run([emb, train_op,loss]))
      print(sess.run([emb, train_op,loss]))
      print(sess.run([emb, train_op,loss]))
      print(sess.run([emb]))
    Note EmbeddingVariable supports only the FTRL optimizer, Adagrad optimizer, Adam optimizer, and AdagradDecay optimizer.