All Products
Document Center

Machine Learning Platform for AI:FAQ about TensorFlow

Last Updated:Dec 13, 2022

This topic describes the FAQ about TensorFlow.

How do I enable deep learning?

Machine Learning Platform for AI (PAI) provides TensorFlow, Caffe, and MXNet components for deep learning. To enable deep learning, you must be granted permissions to use GPU resources and access Object Storage Service (OSS). For more information about how to grant the permissions on GPU resources, see Manage workspaces. For more information about how to grant the permissions for OSS access, see Grant the permissions that are required to use Machine Learning Designer.

How do I reference multiple Python files?

You can use Python files to organize your training scripts. First, store the data preprocessing logic in a Python file. Then, define the model in another Python file. At last, use one Python file to train the data throughout the entire training process. For example, you can define a function in the file. If the file needs to use the function provided in the file and the file serves as the program entry, you only need to add the and files to a tar.gz package and upload the package. The following figure shows the upload parameters. Multi-script reference

  • Python Code Files: the tar.gz package.

  • Primary Python File: the program entry file.

How do I upload data to OSS?

Before you upload data to OSS, you must create an OSS bucket to store the data of deep learning algorithms. We recommend that you create the OSS bucket in the same region as the GPU cluster that you use for deep learning. This way, you can transmit data in the classic network of Alibaba Cloud, where traffic generated by algorithms is free. After you create an OSS bucket, you can create folders, organize data directories, and upload data in the OSS console.

To upload data to OSS, you can call API operations or use SDKs. For more information, see Simple upload. In addition, OSS provides a large number of tools to facilitate data uploads and downloads. For more information, see OSS tools. We recommend that you use ossutil or osscmd to upload and download files.


When you use these tools to upload files, you must configure an AccessKey ID and an AccessKey secret. You can log on to the Alibaba Cloud Management console to create or view your AccessKey ID and AccessKey secret.

How do I read data from OSS?

Python code cannot be run to read data from OSS. All code that calls functions such as Python Open() and os.path.exist() to perform operations on files and folders cannot be executed. For example, code that invokes scipy.misc.imread() and numpy.load() cannot be executed.

You can read data in PAI by using the following methods:

  • Use functions of the tf.gfile module to read images or text. The following sample functions are supported:

    tf.gfile.Copy(oldpath, newpath, overwrite=False) # Copies a file. 
    tf.gfile.DeleteRecursively(dirname) # Recursively deletes all files in a directory. 
    tf.gfile.Exists(filename) # Checks whether a file exists. 
    tf.gfile.FastGFile(name, mode='r') # Reads a file in non-blocking mode. 
    tf.gfile.GFile(name, mode='r') # Reads a file. 
    tf.gfile.Glob(filename) # Queries all files in a directory. You can filter these files by pattern. 
    tf.gfile.IsDirectory(dirname) # Checks whether an item is a directory.
    tf.gfile.ListDirectory(dirname) # Queries all files in a directory. 
    tf.gfile.MakeDirs(dirname) # Creates a folder in a directory. If no parent directories exist, a parent directory is automatically created. If the folder you want to create already exists and is writable, a success response is returned. 
    tf.gfile.MkDir(dirname) # Creates a folder in a directory. 
    tf.gfile.Remove(filename) # Deletes a file. 
    tf.gfile.Rename(oldname, newname, overwrite=False) # Renames a file. 
    tf.gfile.Stat(dirname) # Queries statistical data about a directory. 
    tf.gfile.Walk(top, inOrder=True) # Queries the file tree of a directory.
  • Use tf.gfile.Glob, tf.gfile.FastGFile, tf.WhoFileReader(), and tf.train.shuffer_batch() functions to batch read files. Before you batch read files, you must obtain the list of the files and create a batch.

When you create a deep learning experiment on the Machine Learning Studio page, you must specify the parameters such as the code file that you want to read and its directory on the right side of the page. Functions of the tf.flags module can pass in parameters in the -XXX form, where XXX represents a string.

import tensorflow as tf
FLAGS = tf.flags.FLAGS
tf.flags.DEFINE_string('buckets', 'oss://{OSS Bucket}/', 'Folder of the training image file')
tf.flags.DEFINE_string('batch_size', '15', 'Size of the batch')
files = tf.gfile.Glob(os.path.join(FLAGS.buckets,'*.jpg')) # Queries the paths of all JPG files in buckets.

We recommend that you use the following methods to batch read files based on the number of files:

  • Use the tf.gfile.FastGfile() function to batch read a small number of files.

    for path in files:
        file_content = tf.gfile.FastGFile(path, 'rb').read() # Remember to specify rb when you call this function. Otherwise, errors may occur. 
        image = tf.image.decode_jpeg(file_content, channels=3) # In this example, JPG images are used.
  • Use the tf.WhoFileReader() function to batch read a large number of files.

    reader = tf.WholeFileReader()  # Instantiates a reader. 
    fileQueue = tf.train.string_input_producer(files)  # Creates a queue for the reader to read. 
    file_name, file_content =  # Uses the reader to read a file from the queue. 
    image_content = tf.image.decode_jpeg(file_content, channels=3)  # Decodes the file content into images. 
    label = XXX  # In this example, label processing operations are omitted. 
    batch = tf.train.shuffle_batch([label, image_content], batch_size=FLAGS.batch_size, num_threads=4,
                                   capacity=1000 + 3 * FLAGS.batch_size, min_after_dequeue=1000)
    sess = tf.Session()  # Creates a session. 
    tf.train.start_queue_runners(sess=sess)  # Starts the queue. If this command is not executed, the thread keeps blocked. 
    labels, images =  # Obtains the result.

    The following description explains the code:

    • tf.train.string_input_producer: converts files to a queue. You must use the tf.train.start_queue_runners syntax to start the queue.

    • tf.train.shuffle_batch includes the following parameters:

      • batch_size: the amount of data to return each time after you run a batch task.

      • num_threads: the number of threads to run. The value is usually set to 4.

      • capacity: the number of pieces of data from which you want to randomly extract data. For example, assume that a dataset has 10,000 pieces of data. If you want to randomly extract data from 5,000 pieces of data to train, set capacity to 5000.

      • min_after_dequeue: the minimum length of the queue to maintain. The value must be less than or equal to the value of capacity.

How do I write data to OSS?

You can write data to OSS by using one of the following methods: (The generated file is stored in the /model/example.txt directory.)

  • Use the tf.gfile.FastGFile() function to write a file. The following code shows a sample function:

    tf.gfile.FastGFile(FLAGS.checkpointDir + 'example.txt', 'wb').write('hello world')
  • Use the tf.gfile.Copy() function to copy a file. The following code shows a sample function:

    tf.gfile.Copy('./example.txt', FLAGS.checkpointDir + 'example.txt')

Why does an OOM error occur?

The out-of-memory (OOM) error occurs because your memory usage reaches the maximum of 30 GB. We recommend that you use gfile functions to read data from OSS. For more information, see the "How do I read data from OSS" section of this topic.

What use cases of TensorFlow are available?

What is the role of model_average_iter_interval when two GPUs are configured?

If the model_average_iter_interval parameter is not set, the parallel Stochastic Gradient Descent (SGD) algorithm is used in the GPUs, and the gradient is updated in each iteration. If the model_average_iter_interval parameter is greater than 1, the model averaging method is used to calculate two average model parameters after the data is trained multiple times at the specified iteration interval. The model_average_iter_interval parameter specifies the number of training times.