This topic describes the FAQ about TensorFlow.

How do I enable machine learning?

Machine Learning Platform for AI (PAI) provides TensorFlow, Caffe, and MXNet components for machine learning. To enable machine learning, you must be granted permissions for graphics processing unit (GPU) resources and Object Storage Service (OSS) access. For more information, see Authorization.

How do I reference multiple Python files?

You can use Python files to organize your training scripts. First, store the data preprocessing logic in a Python file. Then, define the model in another Python file. At last, use one Python file to train the data throughout the entire training process. For example, you can define a function in the file. If the file needs to use the function provided in the file and the file serves as the program entry, you need only to compress the and files into a .tar.gz file and upload the package. The following figure shows the process.Multi-script referencewhere:
  • Python Code Files: the .tar.gz package.
  • Primary Python File: the program entry file.

How do I upload data to OSS?

Before you upload data to OSS, you must create an OSS bucket, where the data of machine learning algorithms is stored. We recommend that you create the OSS bucket in the same region as the GPU cluster that you use for machine learning. This way, you can transmit data in a classic network of Alibaba Cloud, where traffic generated by algorithms is free. After you create an OSS bucket, you can create folders, organize data directories, and upload data in the OSS console.

To upload data to OSS, you can call API operations or use SDKs. For more information, see Simple upload. In addition, OSS provides a lot of tools to facilitate data uploading and downloading. For more information, see OSS tools. We recommend that you use ossutil or osscmd to upload and download files.
Note When you use these tools to upload files, you must configure an AccessKey ID and an AccessKey secret. You can log on to the Alibaba Cloud Management console to create or view your AccessKey ID and AccessKey secret.

How do I read data from OSS?

Python code cannot be run to read data from OSS. All code that calls functions such as Python Open() and os.path.exist() to perform operations on files and folders cannot be executed. For example, code that invokes scipy.misc.imread() and numpy.load() cannot be executed.

You can read data in PAI by using the following methods:
  • Use functions of the tf.gfile module to read images or text. The following sample functions are supported:
    tf.gfile.Copy(oldpath, newpath, overwrite=False) # Copies a file.
    tf.gfile.DeleteRecursively(dirname) # Recursively deletes all files in a directory.
    tf.gfile.Exists(filename) # Checks whether a file exists.
    tf.gfile.FastGFile(name, mode='r') # Reads a file in non-blocking mode.
    tf.gfile.GFile(name, mode='r') # Reads a file.
    tf.gfile.Glob(filename) # Queries all files in a directory. You can filter these files by pattern.
    tf.gfile.IsDirectory(dirname) # Checks whether an item is a directory.
    tf.gfile.ListDirectory(dirname) # Queries all files in a directory.
    tf.gfile.MakeDirs(dirname) # Creates a folder in a directory. If no parent directories exist, a parent directory is automatically created. If the folder you want to create already exists and is writable, a success response is returned.
    tf.gfile.MkDir(dirname) # Creates a folder in a directory.
    tf.gfile.Remove(filename) # Deletes a file.
    tf.gfile.Rename(oldname, newname, overwrite=False) # Renames a file.
    tf.gfile.Stat(dirname) # Queries statistical data about a directory.
    tf.gfile.Walk(top, inOrder=True) # Queries the file tree of a directory.
  • Use tf.gfile.Glob, tf.gfile.FastGFile, tf.WhoFileReader(), and tf.train.shuffer_batch() functions to batch read files. Before you batch read files, you must obtain the list of the files and create a batch.
When you create a machine learning experiment on the Machine Learning Studio page, you must specify the parameters such as the code file that you want to read and its directory on the right side of the page. Functions of the tf.flags module can pass in parameters in the -XXX form, where XXX represents a string.
import tensorflow as tf
FLAGS = tf.flags.FLAGS
tf.flags.DEFINE_string('buckets', 'oss://{OSS Bucket}/', 'Folder of the training image file')
tf.flags.DEFINE_string('batch_size', '15', 'Size of the batch')
files = tf.gfile.Glob(os.path.join(FLAGS.buckets,'*.jpg')) # Queries the paths of all JPG files in buckets.
We recommend that you use the following methods to batch read files based on the number of files:
  • Use the tf.gfile.FastGfile() function to batch read a small number of files.
    for path in files:
        file_content = tf.gfile.FastGFile(path, 'rb').read() # Remember to specify rb when you call this function. Otherwise, errors may occur.
        image = tf.image.decode_jpeg(file_content, channels=3) # In this example, JPG images are used.
  • Use the tf.WhoFileReader() function to batch read a large number of files.
    reader = tf.WholeFileReader()  # Instantiates a reader.
    fileQueue = tf.train.string_input_producer(files)  # Creates a queue for the reader to read.
    file_name, file_content =  # Uses the reader to read a file from the queue.
    image_content = tf.image.decode_jpeg(file_content, channels=3)  # Decodes the file content into images.
    label = XXX  # In this example, label processing operations are omitted.
    batch = tf.train.shuffle_batch([label, image_content], batch_size=FLAGS.batch_size, num_threads=4,
                                   capacity=1000 + 3 * FLAGS.batch_size, min_after_dequeue=1000)
    sess = tf.Session()  # Creates a session.
    tf.train.start_queue_runners(sess=sess)  # Starts the queue. If this command is not executed, the thread keeps blocked.
    labels, images =  # Obtains the results.
    The following description explains the code:
    • tf.train.string_input_producer: converts files into a queue. You must use tf.train.start_queue_runners to start the queue.
    • tf.train.shuffle_batch includes the following parameters:
      • batch_size: the amount of data to return each time after you run a batch task.
      • num_threads: the number of threads to run. The value is usually set to 4.
      • capacity: the number of pieces of data from which you want to randomly extract data. For example, assume that a dataset has 10,000 pieces of data. If you want to randomly extract data from 5,000 pieces of data to train, set capacity to 5000.
      • min_after_dequeue: the minimum length of the queue to maintain. The value must be less than or equal to the value of capacity.

How do I write data to OSS?

You can write data to OSS by using one of the following methods: (The generated file is stored in the /model/example.txt directory.)
  • Use the tf.gfile.FastGFile() function to write a file. The following code shows the sample function.
    tf.gfile.FastGFile(FLAGS.checkpointDir + 'example.txt', 'wb').write('hello world')
  • Use the tf.gfile.Copy() function to copy a file. The following code shows the sample function.
    tf.gfile.Copy('./example.txt', FLAGS.checkpointDir + 'example.txt')

Why does an OOM error occur?

The out-of-memory (OOM) error occurs because your memory usage reaches the maximum of 30 GB. We recommend that you use gfile functions to read data from OSS. For more information, see How do I read data from OSS?.

What use cases of TensorFlow are available?

What is the role of model_average_iter_interval when two GPUs are configured?

If model_average_iter_interval is not set, the parallel Stochastic Gradient Descent (SGD) algorithm is used in the GPUs, and the gradient is updated in each iteration. If model_average_iter_interval is greater than 1, the model averaging method is used to calculate two average model parameters after the data is trained multiple times at the specified iteration interval. The model_average_iter_interval parameter specifies the number of training times.