This topic describes how to use the AdagradDecay optimizer to perform ultra-large-scale model training.

Background information

In most cases, more than 1 billion samples are used for ultra-large-scale model training and the number of samples keeps increasing. Training lasts for more than one month. To handle this issue, PAI-TensorFlow provides the AdagradDecay optimizer.

Enable the AdagradDecay optimizer

To use the AdagradDecay optimizer for ultra-large-scale model training, you must define tf.train.AdagradDecayOptimizer. You can use the AdagradDecay optimizer in a way similar to the optimizer of native TensorFlow. The following code defines the usage:
class AdagradDecayOptimizer(optimizer.Optimizer):
  """Optimizer that implements the Adagrad algorithm with accumulator decay.
  Different from the original Adagrad algorithm, AdagradDecay performs decay
  at given step with given rate. So that the accumulator will not be infinity.
  def __init__(self,
    """Construct a new AdagradDecay optimizer.
      learning_rate: A `Tensor` or a floating point value.  The learning rate.
      global_step: global step variable, used for calculating t%T .
      initial_accumulator_value: A floating point value. Starting and baseline
        value for the accumulators, must be positive. The accumulators will not
        be less than it.
      accumulator_decay_step: When global_step reaches times of
        accumulator_decay_step, accumulator will be decayed with
        accumulator_decay_rate. accumulator *= accumulator_decay_rate
      accumulator_decay_rate: Decay rate as above described.
      use_locking: If `True` use locks for update operations.
      name: Optional name prefix for the operations created when applying
        gradients.  Defaults to "AdagradDecay".
      ValueError: If the `initial_accumulator_value`, `accumulator_decay_step`
        or `accumulator_decay_rate` is invalid.