All Products
Document Center

FAQ of deep learning

Last Updated: Feb 01, 2019


If you are still unable to solve your issue using preceding information, see the Machine learning documentation or send us your logview in a ticket.

How to enable deep learning?

The deep learning feature is currently in beta testing. Three deep learning frameworks: TensorFlow, Caffe, and MXNet are supported. To enable deep learning, log on to your machine learning platform console and enable GPU resources, as shown in the following figure.

After GPU resources have been enabled, the corresponding projects are allowed to access the public resource pool and dynamically use the underlying GPU resources. You also need to grant OSS permissions to the machine learning platform as follows.

How to reference multiple Python scripts?

You can use Python modules to organize your training scripts. You can write different pieces of your model in multiple Python files. For example, you can write data preprocessing code in one Python file and use another Python file as the entry file for your algorithm.

The functions in have referenced specified functions that were defined in, and is the entry file for the algorithm. You can compress and into a TAR.GZ file and upload this compressed file as follows:

  • Python Code File Select the compressed TAR.GZ file.
  • Main Python File Select the entry file for the algorithm.

How to upload data to OSS?

When using deep learning to process your data, you must upload your data to OSS buckets. First, create OSS buckets. The deep learning GPU clusters are only available in the China (Shanghai) region. Therefore, you must choose this region when creating OSS buckets. Your data is then transmitted over the Alibaba Cloud classic network. No traffic fees are incurred when you run your algorithms. After OSS buckets have been created, you can log on to the OSS console to create folders and upload your data.

We recommend that you use the ossutil or osscmd tool to upload or download files using command lines. These tools can resume a transmission from breakpoints.

Note: You must specify your Access Key ID to use these tools. You can log on to the Access Key console to create or view Access Keys.

How to read OSS data?

Python does not support OSS data. You cannot run code that contains file operations such as Open(), os.path.exist() to read OSS data.Scipy.misc.imread() and numpy.load() are not supported either.

You can read OSS data in the machine learning platform by using these approaches:

  • You can use tf.gfile functions to perform simple file operations.

    1. tf.gfile.Copy(oldpath, newpath, overwrite=False) # Copy a file
    2. tf.gfile.DeleteRecursively(dirname) #Recursively delete all files under the specified directory
    3. tf.gfile.Exists(filename) # Check whether the specified file exists
    4. tf.gfile.FastGFile(name, mode='r') # Non-blocking read
    5. tf.gfile.GFile(name, mode='r') # Read a file
    6. tf.gfile.Glob(filename) # List all files in the specified folder. You can use patterns to filter these files.
    7. tf.gfile.IsDirectory(dirname) #Return whether the specified dirname is a directory
    8. tf.gfile.ListDirectory(dirname) # List all files under the specified dirname
    9. tf.gfile.MakeDirs(dirname) # Create a folder under the specified dirname. If the parent directory does not exist, a parent directory is automatically created.
    10. If the parent directory already exists and is writable, True is returned.
    11. tf.gfile.MkDir(dirname) # Create a folder under the specified dirname
    12. tf.gfile.Remove(filename) #Delete the specified filename
    13. tf.gfile.Rename(oldname, newname, overwrite=False) # Rename
    14. tf.gfile.Stat(dirname) # Return statistical data of the specified dirname
    15. tf.gfile.Walk(top, inOrder=True) # Return the file tree of the specified directory

    For more information, see the tf.gfile module.

  • You can use tf.gfile.Glob, tf.gfile.FastGFile, tf.WhoFileReader(),and tf.train.shuffer_batch() to perform batch file operations. You must retrieve the file list before reading file data. To read multiple files, you must create a batch file.

When creating deep learning experiments in the machine learning platform, you must specify the parameters, such as the directory to read data, and the code files on the right side of the page. You can denote these parameters as “—XXX” (XXX represents a string) in tf.flags.

  1. import tensorflow as tf
  2. FLAGS = tf.flags.FLAGS
  3. tf.flags.DEFINE_string('buckets', 'oss://{OSS Bucket}/', 'The folder where training image data is stored')
  4. tf.flags.DEFINE_string('batch_size', '15', 'batch size')
  5. files = tf.gfile.Glob(os.path.join(FLAGS.buckets,'*.jpg')) #List the paths of all JPG files in the buckets

Use tf.gfile.FastGfile() to read a small number of files.

  1. for path in files:
  2. file_content = tf.gfile.FastGFile(path, 'rb').read() #You must specify rb when calling this function. Otherwise, errors may occur.
  3. image = tf.image.decode_jpeg(file_content, channels=3) # In this example, we use JPG images.

Use tf.WhoFileReader() to read a large number of files.

  1. reader = tf.WholeFileReader() # Create a reader object
  2. fileQueue = tf.train.string_input_producer(files) # Create a queue for the reader to read
  3. file_name, file_content = # Use the reader to read a file from the queue
  4. image_content = tf.image.decode_jpeg(file_content, channels=3) # Decode the file contents into images
  5. label = XXX # Label operations are omitted.
  6. batch = tf.train.shuffle_batch([label, image_content], batch_size=FLAGS.batch_size, num_threads=4,
  7. capacity=1000 + 3 * FLAGS.batch_size, min_after_dequeue=1000)
  8. sess = tf.Session() # Create a Session object
  9. tf.train.start_queue_runners(sess=sess) # Note: You must add this function to start the queue. Otherwise, the thread is blocked.
  10. labels, images = # Obtain the results


  • tf.train.string_input_producer: converts the files into a queue. You must use tf.train.start_queue_runners to start the queue.

  • Parameters in tf.train.shuffle_batch are as follows:

    • batch_size: The batch size. This represents the number of data entries that are returned in each batch operation.

    • num_threads: The number of threads. This is usually set to 4.

    • capacity: The maximum number of files that are randomly selected in each batch operation. For example, assume that a dataset contains 10,000 files. Set capacity to 5,000 if you want to randomly select files from a maximum of 5,000 files.

    • min_after_dequeue: The minimum length of the queue. This must not exceed the specified capacity.

How to write data to OSS?

  • Use tf.gfile.FastGFile() to write data to OSS

    1. tf.gfile.FastGFile(FLAGS.checkpointDir + 'example.txt', 'wb').write('hello world')
  • Use tf.gfile.Copy() to copy data to OSS

    1. tf.gfile.Copy('./example.txt', FLAGS.checkpointDir + 'example.txt')

You can write data to OSS by using the preceding functions. The files are stored under the following directory: “Output directory/model/example.txt”.

TensorFlow case studies

  1. How can I perform image classification using TensorFlow?

Other questions

  1. How do I view Tensorflow logs?

    For more information, see:

  2. What is the function of model_average_iter_interval when two GPUs are used?

    • If model_average_iter_interval is not set, the parallel Stochastic Gradient Descent (SGD) algorithm is used, and the gradient is updated in each iteration.

    • If model_average_iter_interval is larger than one, the model average is calculated at regular intervals. model_average_iter_interval specifies the interval.

    Two GPUs accelerate the training speed.