All Products
Search
Document Center

Platform For AI:FastNN

Last Updated:Apr 02, 2024

Fast Neural Network (FastNN) is a distributed neural network library based on the PAISoar framework. FastNN includes common neural networks, such as Inception, Residual Networks (ResNet), and Visual Geometry Group (VGG), and plans to release more advanced models in the future. FastNN is integrated into the Machine Learning Designer module of Platform for AI (PAI). You can use FastNN in the PAI console.

Warning

GPU-accelerated servers will be phased out. You can submit TensorFlow tasks that run on CPU servers. If you want to use GPU-accelerated instances for model training, go to Deep Learning Containers (DLC) to submit jobs. For more information, see Submit training jobs.

Prepare datasets

You can use FastNN in the PAI console in an easy manner. The CIFAR-10, MNIST, and flowers datasets are downloaded and converted into TFRecord files and then stored in Object Storage Service (OSS). You can access the datasets by using the Read Table or OSS Data Synchronization component of PAI. The following table describes the OSS storage paths of the datasets.

Dataset

Number of classes in the dataset

Number of samples in the training dataset

Number of samples in the test dataset

Storage path

MNIST

10

3320

350

  • China (Beijing): oss://pai-online-beijing.oss-cn-beijing-internal.aliyuncs.com/fastnn-data/mnist/

  • China (Shanghai): oss://pai-online.oss-cn-shanghai-internal.aliyuncs.com/fastnn-data/mnist/

CIFAR-10

10

50000

10000

  • China (Beijing): oss://pai-online-beijing.oss-cn-beijing-internal.aliyuncs.com/fastnn-data/cifar10/

  • China (Shanghai): oss://pai-online.oss-cn-shanghai-internal.aliyuncs.com/fastnn-data/cifar10/

flowers

5

60000

10000

  • China (Beijing): oss://pai-online-beijing.oss-cn-beijing-internal.aliyuncs.com/fastnn-data/flowers/

  • China (Shanghai): oss://pai-online.oss-cn-shanghai-internal.aliyuncs.com/fastnn-data/flowers/

FastNN can read data that is stored in a TFRecord file. You can use the TFRecordDataset class to build dataset pipelines for model training, which reduces the time required for data preprocessing. Additionally, FastNN does not support fine-grained data partitioning. To ensure even distribution of data among workers, we recommend that you apply the following rules:

  • Each TFRecord file contains an equal number of samples.

  • Each worker processes an equal number of TFRecord files.

If your dataset is stored in TFRecord files, you can download FastNN code and use the sample files in the datasets directory to build dataset pipelines, including cifar10.py, mnist.py, and flowers.py. In the following example, the CIFAR-10 dataset is used.

The features in the CIFAR-10 dataset are in the following format:

features={
        'image/encoded': tf.FixedLenFeature((), tf.string, default_value=''),
        'image/format': tf.FixedLenFeature((), tf.string, default_value='png'),
        'image/class/label': tf.FixedLenFeature(
          [], tf.int64, default_value=tf.zeros([], dtype=tf.int64)),
}
  1. In the datasets directory, create a file named cifar10.py for data parsing and edit the file.

    """Provides data for the Cifar10 dataset.
    The dataset scripts used to create the dataset can be found at:
    datasets/download_and_covert_data/download_and_convert_cifar10.py
    """
    from __future__ import division
    from __future__ import print_function
    import tensorflow as tf
    """Expect func_name is 'parse_fn'
    """
    def parse_fn(example):
      with tf.device("/cpu:0"):
        features = tf.parse_single_example(
          example,
          features={
            'image/encoded': tf.FixedLenFeature((), tf.string, default_value=''),
            'image/format': tf.FixedLenFeature((), tf.string, default_value='png'),
            'image/class/label': tf.FixedLenFeature(
              [], tf.int64, default_value=tf.zeros([], dtype=tf.int64)),
          }
        )
        image = tf.image.decode_jpeg(features['image/encoded'], channels=3)
        label = features['image/class/label']
        return image, label
  2. In the datasets directory, open the dataset_factory.py file and configure the dataset_map parameter.

    from datasets import cifar10
    datasets_map = {
        'cifar10': cifar10,
    }
  3. When you run a training job, add dataset_name=cifar10 and train_files=cifar10_train.tfrecord in the command to use the CIFAR-10 dataset for model training.

Note

To read dataset in other formats, refer to the utils/dataset_utils.py file to build a dataset pipeline.

Prepare a hyperparameter file

FastNN supports the following types of hyperparameters:

  • Dataset hyperparameters: the basic attributes of training datasets. For example, the dataset_dir hyperparameter specifies the storage path of a training dataset.

  • Data preprocessing hyperparameters: data preprocessing functions and dataset pipeline parameters.

  • Model hyperparameters: the basic parameters for model training, including model_name and batch_size.

  • Learning rate hyperparameters: learning rate parameters and tuning parameters.

  • Optimizer hyperparameters: parameters related to the optimizer.

  • Log hyperparameters: parameters related to the output log.

  • Performance tuning hyperparameters: tuning parameters, such as mixed precision.

The following example shows the format of a hyperparameter file:

enable_paisora=True
batch_size=128
use_fp16=True
dataset_name=flowers
dataset_dir=oss://pai-online-beijing.oss-cn-beijing-internal.aliyuncs.com/fastnn-data/flowers/
model_name=inception_resnet_v2
optimizer=sgd
num_classes=5
job_name=worker
  • Dataset hyperparameters

    Parameter

    Type

    Description

    dataset_name

    string

    The name of the input dataset that you want to parse. Valid values: mock, cifar10, mnist, and flowers. For more information, see the dataset_factory.py file in the image_models/datasets directory. Default value: mock.

    dataset_dir

    string

    The absolute path of the input dataset. Default value: None.

    num_sample_per_epoch

    integer

    The total number of samples in the dataset. Adjust the learning rate based on the value of this parameter.

    num_classes

    integer

    The number of classes in the dataset. Default value: 100.

    train_files

    string

    The names of the files that contain all training data. Separate multiple names with commas (,). Example: 0.tfrecord,1.tfrecord.

  • Data preprocessing hyperparameters

    Parameter

    Type

    Description

    preprocessing_name

    string

    This parameter is used together with the model_name parameter to specify the name of the data preprocessing function. For information about the valid values, see the preprocessing_factory.py file in the image_models/preprocessing directory. Default value: None, which specifies that the data is not preprocessed.

    shuffle_buffer_size

    integer

    The size of the buffer pool for sample-based shuffles when a data pipeline is created. Default value: 1024.

    num_parallel_batches

    integer

    The number of parallel threads, which is multiplied by the value of the batch_size parameter to obtain the value of the map_and_batch parameter. This parameter is used to specify the parallel granularity of parsing samples. Default value: 8.

    prefetch_buffer_size

    integer

    The number of data batches that are prefetched by the data pipeline. Default value: 32.

    num_preprocessing_threads

    integer

    The number of threads that are used by the data pipeline to prefetch data at the same time. Default value: 16.

    datasets_use_caching

    bool

    Specifies whether to enable caching for compressed input data by using memory. Default value: False, which specifies that caching is disabled.

  • Model hyperparameters

    Parameter

    Type

    Description

    task_type

    string

    The type of the task. Valid values:

    • pretrain: model pre-training. This is the default value.

    • finetune: model tuning.

    model_name

    string

    The name of the model that you want to train. Valid values include all models in the image_models/models directory. You can configure this parameter based on the models defined in the image_models/models/model_factory.py file. Default value: inception_resnet_v2.

    num_epochs

    integer

    The number of training rounds for the training dataset. Default value: 100.

    weight_decay

    float

    The weight decay factor during model training. Default value: 0.00004.

    max_gradient_norm

    float

    Specifies whether to perform gradient clipping based on the global normalization value. Default value: None, which specifies that gradient clipping is not performed.

    batch_size

    integer

    The amount of data that a GPU processes in each iteration. Default value: 32.

    model_dir

    string

    The path of the checkpoint file that is used to reload the model. Default value: None, which specifies that model tuning is not performed.

    ckpt_file_name

    string

    The name of the checkpoint file that is used to reload the model. Default value: None.

  • Learning rate hyperparameters

    Parameter

    Type

    Description

    warmup_steps

    integer

    The number of iterations for inverse decay of the learning rate. Default value: 0.

    warmup_scheme

    string

    The inverse decay scheme of the learning rate. Set the value to t2t (Tensor2Tensor). This value specifies the following scheme: Initialize the learning rate at 1/100 of the specified learning rate and follow an inverse exponential decay to reach the specified learning rate.

    decay_scheme

    string

    The decay scheme of the learning rate. Valid values:

    • luong234: Start a four-step decay scheme after two-thirds of the total iterations are completed. Each step reduces the learning rate by 1/2.

    • luong5: Start a five-step decay scheme after half of the total iterations are completed. Each step reduces the learning rate by 1/2.

    • luong10: Start a ten-step decay scheme after half of the total iterations are completed. Each step reduces the learning rate by 1/2.

    learning_rate_decay_factor

    float

    The factor of learning rate decay. Default value: 0.94.

    learning_rate_decay_type

    string

    The type of learning rate decay. Valid values: fixed, exponential, and polynomial. Default value: exponential.

    learning_rate

    float

    The initial learning rate. Default value: 0.01.

    end_learning_rate

    float

    The minimum learning rate during decay. Default value: 0.0001.

  • Optimizer hyperparameters

    Parameter

    Type

    Description

    optimizer

    string

    The name of the optimizer. Valid values: adadelta, adagrad, adam, ftrl, momentum, sgd, rmsprop, adamweightdecay. Default value: rmsprop.

    adadelta_rho

    float

    The decay factor of the Adadelta optimizer. Default value: 0.95. This parameter is valid only if you set the optimizer parameter to adadelta.

    adagrad_initial_accumulator_value

    float

    The initial value of the Adagrad accumulator. Default value: 0.1. This parameter is valid only if you set the optimizer parameter to adagrad.

    adam_beta1

    float

    The exponential decay rate in primary momentum prediction. Default value: 0.9. This parameter is valid only if you set the optimizer parameter to adam.

    adam_beta2

    float

    The exponential decay rate in secondary momentum prediction. Default value: 0.999. This parameter is valid only if you set the optimizer parameter to adam.

    opt_epsilon

    float

    The offset of the optimizer. Default value: 1.0. This parameter is valid only if you set the optimizer parameter to adam.

    ftrl_learning_rate_power

    float

    The idempotent parameter of the learning rate. Default value: -0.5. This parameter is valid only if you set the optimizer parameter to ftrl.

    ftrl_initial_accumulator_value

    float

    The starting point of the FTRL accumulator. Default value: 0.1. This parameter is valid only if you set the optimizer parameter to ftrl.

    ftrl_l1

    float

    The regularization term of FTRL l1. Default value: 0.0. This parameter is valid only if you set the optimizer parameter to ftrl.

    ftrl_l2

    float

    The regularization term of FTRL l2. Default value: 0.0. This parameter is valid only if you set the optimizer parameter to ftrl.

    momentum

    float

    The momentum parameter of the Momentum optimizer. Default value: 0.9. This parameter is valid only if you set the optimizer parameter to momentum.

    rmsprop_momentum

    float

    The momentum parameter of the RMSProp optimizer. Default value: 0.9. This parameter is valid only if you set the optimizer parameter to rmsprop.

    rmsprop_decay

    float

    The decay factor of the RMSProp optimizer. Default value: 0.9. This parameter is valid only if you set the optimizer parameter to rmsprop.

  • Log hyperparameters

    Parameter

    Type

    Description

    stop_at_step

    integer

    The total number of training epochs. Default value: 100.

    log_loss_every_n_iters

    integer

    The iterative frequency at which the loss information is printed. Default value: 10.

    profile_every_n_iters

    integer

    The iterative frequency at which the timeline is printed. Default value: 0.

    profile_at_task

    integer

    The index of the machine that generates the timeline. Default value: 0, which corresponds to the index of the chief worker.

    log_device_placement

    bool

    Specifies whether to print the device placement information. Default value: False.

    print_model_statistics

    bool

    Specifies whether to print the trainable variable information. Default value: false.

    hooks

    string

    The training hooks. Default value: StopAtStepHook,ProfilerHook,LoggingTensorHook,CheckpointSaverHook.

  • Performance tuning hyperparameters

    Parameter

    Type

    Description

    use_fp16

    bool

    Specifies whether to perform semi-precision training. Default value: True.

    loss_scale

    float

    The scaling factor of the loss function during training. Default value: 1.0.

    enable_paisoar

    bool

    Specifies whether to use the PAISoar framework. Default value: True.

    protocol

    string

    Default value: grpc.rdma, which specifies that the cluster uses gRPC Remote Procedure Calls (gRPC) to improve data access efficiency.

Develop a main file

If the existing FastNN models cannot meet your requirements, you can use the dataset, model, and preprocessing APIs for further development. Before development, make sure that you are familiar with the basic logic of a FastNN model. If you download FastNN code, you can view the basic logic of an image classification model in the train_image_classifiers.py entry file. Sample code:

# Initialize the model by using the model_name parameter to create the network_fn function. The input parameter train_image_size may be returned. 
    network_fn = nets_factory.get_network_fn(
            FLAGS.model_name,
            num_classes=FLAGS.num_classes,
            weight_decay=FLAGS.weight_decay,
            is_training=(FLAGS.task_type in ['pretrain', 'finetune']))
# Initialize the preprocess_fn function by using the model_name or preprocessing_name parameter. 
    preprocessing_fn = preprocessing_factory.get_preprocessing(
                FLAGS.model_name or FLAGS.preprocessing_name,
                is_training=(FLAGS.task_type in ['pretrain', 'finetune']))
# Select the valid TFRecord format based on the dataset_name parameter and synchronously call the preprocess_fn function to parse the dataset and obtain the dataset_iterator object. 
    dataset_iterator = dataset_factory.get_dataset_iterator(FLAGS.dataset_name,
                                                            train_image_size,
                                                            preprocessing_fn,
                                                            data_sources,
# Call the network_fn and dataset_iterator.get_next functions to define the loss_fn function that is used to calculate the loss. 
    def loss_fn():
      with tf.device('/cpu:0'):
          images, labels = dataset_iterator.get_next()
        logits, end_points = network_fn(images)
        loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=tf.cast(logits, tf.float32), weights=1.0)
        if 'AuxLogits' in end_points:
          loss += tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=tf.cast(end_points['AuxLogits'], tf.float32), weights=0.4)
        return loss
# Call the PAI-Soar API to encapsulate the native TensorFlow optimizer and the loss_fn function. 
    opt = paisoar.ReplicatedVarsOptimizer(optimizer, clip_norm=FLAGS.max_gradient_norm)
    loss = optimizer.compute_loss(loss_fn, loss_scale=FLAGS.loss_scale)
# Define training tensors based on the value of the opt and loss parameter. 
    train_op = opt.minimize(loss, global_step=global_step)