All Products
Search
Document Center

Deep learning

Last Updated: Jul 16, 2019

Contents

Introduction to deep learning

Alibaba Cloud Machine Learning Platform for AI supports multiple deep learning frameworks and provides powerful GPU clusters that contain both M40 and P100 GPU nodes. You can use these frameworks and hardware resources to train your deep learning algorithms.

Supported frameworks currently include TensorFlow (versions 1.0, 1.1, and 1.2), MXNet 0.9.5, and Caffe RC3. TensorFlow and MXNet support Python. Caffe supports custom net files.

Before using deep learning frameworks, you must upload your data to Alibaba Cloud Object Storage Service (OSS). The algorithms can read data from specified OSS directories when running. Note that machine learning GPU clusters are currently only available in the China (Shanghai) region. If your algorithms only read OSS data from the China (Shanghai) region, no traffic fees are incurred.

Enable deep learning

Currently, the deep learning feature is in beta testing. Three deep learning frameworks: TensorFlow, Caffe, and MXNet are supported. To enable deep learning, log on to your machine learning platform console and enable GPU resources, as shown in the following figure:

After GPU resources have been enabled, the corresponding projects are allowed to access the public resource pool and dynamically use the underlying GPU resources.

Upload data to OSS

Before using deep learning to process your data, you must upload your data to OSS. First, create OSS buckets. Since deep learning GPU clusters are only available in the China (Shanghai) region, we recommend that you choose this region when creating OSS buckets. Your data is then transmitted over the Alibaba Cloud classic network. No traffic fees are incurred when you run your algorithms. After the OSS buckets have been created, you can log on to the [OSS console] to create folders and upload your data.

OSS supports multiple data upload methods. For more information, see: https://www.alibabacloud.com/help/doc-detail/31848.html.

OSS also provides many tools to help you make full use of this service. For more information, see: https://www.alibabacloud.com/help/doc-detail/44075.html.

We recommend that you use the ossutil or osscmd tool to upload or download files using command lines. These tools can resume a transmission from breakpoints.

Note: You must specify your Access Key ID to use these tools. You can log on to the Access Key console to create or view Access Keys.

Read OSS buckets

Before reading data from OSS buckets, you must grant “AliyunODPSPAIDefaultRole” access to your DTplus account.

Note: The machine learning platform is based on MaxCompute. Accounts are shared between both platforms. The default role is granted to the MaxCompute account.

You must grant OSS read and write permissions in Settings. For more information, see RAM settings.

Settings

RAM settings

  1. Log on to Machine Learning Platform for AI console, click Settings in the left menu bar, and select General.

  2. Choose Authorize Machine Learning to access my OSS objects on the OSS Authorization tab.

  3. Click Click here to authorize access in RAM.

  4. Click Confirm Authorization Policy.

    Note: For more information about AliyunODPSPAIDefaultRole, visit the RAM console. The default role AliyunODPSPAIDefaultRole can perform the following actions:

    ActionDescription
    oss:PutObjectUpload file or folder objects
    oss:GetObjectObtain file or folder objects
    oss:ListObjectsView listed files
    oss:DeleteObjectsDelete objects
  5. Navigate to the machine learning platform, and click Refresh to load RAM information, as shown in the following figure.

  6. Link the Read OSS Bucket component with the corresponding deep learning component to grant OSS permissions.

TensorFlow

TensorFlow (TF) is an open source machine learning framework developed by Google. TF is simple and easy to use. Alibaba Cloud Machine Learning Platform for AI supports TensorFlow. You can write code in TF and dynamically adjust the GPU resources.

Parameters

  • Parameter settings

    • Python Code File: You can compress multiple code files into a tar.gz file.

    • Main Python File: Specify the main file in the compressed file. This step is optional.

    • Data Source Directory: Select the OSS data source.

    • Hyperparameters and Custom Parameters: You can write commands to specify hyperparameters and custom parameters. Using these parameters, you can try different learning rates and batch sizes in your models.

    • Output Directory: Specify the output directory of your model.

  • Tuning

    You can specify the number of GPUs based on the complexity of your tasks.

PAI commands

Not all parameters are needed for actual use. Specify your own parameters based on your needs. For more information about these parameters, see the following table.

  1. PAI -name tensorflow_ext
  2. -Dbuckets="oss://imagenet.oss-cn-shanghai-internal.aliyuncs.com/smoke_tensorflow/mnist/"
  3. -DgpuRequired="100" -Darn="acs:ram::166408185518****:role/aliyunodpspaidefaultrole"
  4. -Dscript="oss://imagenet.oss-cn-shanghai-internal.aliyuncs.com/smoke_tensorflow/mnist_ext.py";

The description of the parameters are as follows:

Parameter Description Format Default
script Required. The TF algorithm file. This file can be a regular Python file or a TAR.GZ file. oss://imagenet.oss-cn-shanghai-internal.aliyuncs.com/smoke_tensorflow/mnist_ext.py NA
entryFile Optional. The entry file for the algorithm. This parameter is required if you specified a TAR.GZ file for the script parameter. train.py Null
buckets Required. You can specify multiple OSS buckets by separating them with commas (,). Each bucket must end with a backslash (). oss://imagenet.oss-cn-shanghai-internal.aliyuncs.com/smoke_tensorflow/mnist/ Null
arn Required. OSS role_arn NA Null
gpuRequired Required. The number of GPUs that are used. 200 100
checkpointDir Optional. The TF checkpoint directory. oss://imagenet.oss-cn-shanghai-internal.aliyuncs.com/smoke_tensorflow/mnist/ Null
hyperParameters Optional. The path of the hyperparameter file. oss://imagenet.oss-cn-shanghai-internal.aliyuncs.com/smoke_tensorflow/mnist/hyper_parameters.txt Null
  • script and entryFile are used to specify the algorithm scripts. If your algorithm is complex and consists of multiple files, you can compress these files into a TAR.GZ file and use entryFile to specify the entry file for the algorithm.

  • checkpointDir is used to specify the OSS directory that the algorithm writes to. This parameter is required if you must save models in TensorFlow.

  • buckets is used to specify the OSS directory from which the algorithm read data. arn is required to use OSS.

Example

MNIST is the official tutorial provided by TF. This tutorial trains a model that scans images of handwritten digits and predicts what digits they are.

  1. Upload Python code files and training datasets to OSS. In this case, a bucket named tfmnist is created in the China (Shanghai) region to store Python files and training datasets.

  2. Drag a Read OSS Bucket and a TensorFlow component to the canvas and link them as follows.

  3. Specify the parameters for the TensorFlow component as shown in the following figure.

  4. Click Run and wait for the process to finish.

  5. Right-click the TensorFlow component to view the runtime logs.

Read data in TensorFlow

Inefficient I/O approaches

The major difference exists between the TensoFlow code that is run locally and on cloud servers:

  • Reading local data: The server directly obtains graphs from the client.

  • Cloud servers: After obtaining graphs, they need to distribute these graphs to workers for computing. For more information, see TensorFlow advanced article).

The following example reads data from a CVS file and demonstrates how to read data efficiently using TensorFlow. The CSV file is as follows:

  1. 1,1,1,1,1
  2. 2,2,2,2,2
  3. 3,3,3,3,3

Note the following issues:

  • Do not use the built-in I/O methods in Python.

    The machine learning platform supports the built-in I/O methods in Python. To use these methods, you must compress the data source and code into a package and upload this package to OSS. This approach writes data to the memory for computing and is low in efficiency. The example code is as follows.

    1. import csv
    2. csv_reader=csv.reader(open('csvtest.csv'))
    3. for row in csv_reader:
    4. print(row)
  • Do not use third-party methods for file I/O.

    Many developers use third-party methods for file I/O. For example, both TFLearn and Panda provide file I/O methods. However, many of these methods are based on Python I/O methods and still cause inefficiencies.

  • Do not use preload when reading files.

    You may find that the GPUs are not significantly faster than the CPUs when using the machine learning platform. The main cause is in data I/O.

    To preload data into the memory, you may use feed methods to read the data first and then use session methods for computing. This approach wastes computing resources and does not support large volumes of data due to the memory limit.

    For example, if we have a hard disk that contains an image dataset, we need to load the data set into the memory and use either the GPU or the CPU for computing processes. This procedure sounds simple enough but is not that easy to implement. We need to first load the data before the computing process starts. Assume that loading one image takes 0.1s and computing takes 0.9s. For every second the GPU remains idle for 0.1s. This greatly reduces efficiency.

Efficient I/O approach

The efficient I/O approach uses TensorFlow operators to process the data and uses the session.run method to pull the data. One read thread continuously loads images from the file system into a memory queue, and one compute thread directly retrieves the data for computing from the memory queue. No GPU is left idle due to the data I/O.

The following code demonstrates how to read data using TensorFlow operators.

  1. import argparse
  2. import tensorflow as tf
  3. import os
  4. FLAGS=None
  5. def main(_):
  6. dirname = os.path.join(FLAGS.buckets, "csvtest.csv")
  7. reader=tf.TextLineReader()
  8. filename_queue=tf.train.string_input_producer([dirname])
  9. key,value=reader.read(filename_queue)
  10. record_defaults=[[''],[''],[''],[''],['']]
  11. d1, d2, d3, d4, d5= tf.decode_csv(value, record_defaults, ',')
  12. init=tf.initialize_all_variables()
  13. with tf.Session() as sess:
  14. sess.run(init)
  15. coord = tf.train.Coordinator()
  16. threads = tf.train.start_queue_runners(sess=sess,coord=coord)
  17. for i in range(4):
  18. print(sess.run(d2))
  19. coord.request_stop()
  20. coord.join(threads)
  21. if __name__ == '__main__':
  22. parser = argparse.ArgumentParser()
  23. parser.add_argument('--buckets', type=str, default='',
  24. help='input data path')
  25. parser.add_argument('--checkpointDir', type=str, default='',
  26. help='output model path')
  27. FLAGS, _ = parser.parse_known_args()
  28. tf.app.run(main=main)
  • dirname: The OSS file path. This parameter can be an array type.

  • reader: TF provides multiple reader APIs. You can select these APIs based on your needs.

  • tf.train.string_input_producer: Generates a queue from the file.

  • tf.decode_csv: Provides the splite feature and returns a specified parameter in each line.

  • To retrieve data using OP you must call tf.train.Coordinator() and tf.train.start_queue_runners(sess=sess,coord=coord) in session.

To test the code, enter the following three rows as input:

  1. 1,1,1,1,1
  2. 2,2,2,2,2
  3. 3,3,3,3,3

Run the code four times to print the second field in each row. The result is shown in the following figure.

The result indicates that the data structure is a queue.

Others

The machine learning platform has released the notebook feature, which allows you to modify your code online and provides built-in support for multiple deep learning frameworks. To start using this feature, click https://www.alibabacloud.com/product/machine-learning.

Configure multiple workers and tasks in TensorFlow

The machine learning platform currently supports multiple workers and multiple tasks in TensorFlow. This feature is only available in the China (Beijing) region. Using this feature, you can perform large scale data training. Contact us for detailed billing information.

Concepts

  • Parameter Server (PS) node: Stores the parameters generated during the compute process. When multiple PS nodes have been created, these parameters are automatically sliced and stored in different PS ndoes to facilitate the communication between worker nodes and PS nodes.

  • Worker node: A worker node is where the GPU resides.

  • Task node: In TensorFlow, data is sliced and distributed on different task nodes to perform parameter training.

How to use multiple workers and tasks

To use multiple workers and tasks in TensorFlow, you do not need to worry about resource scheduling in the backend. You can create a distributed computing network using simple configurations. For more information about this service, see the following steps.

  1. Basic configurations.

    1. Download the mnist_cluster.tar.gz file and upload this file to OSS.

    2. Configure OSS access permissions.

    3. Drag a TensorFlow component and link it with a Read OSS Bucket. Select the path of mnist_cluster.tar.gz for Python Code File and enter mnist_cluster.py for Python Main Code, as shown in the following figure.

    4. Select Tuning to specify the other parameters.

    5. After completing the configuration, you have created a computing network with multiple workers and tasks, as shown in the following figure. PS represents the Parameter Server node, WORKER represents the compute node, and TASK represents the GPU.

  2. Code configurations.

    In traditional TF experiments, you must specify the port number of each computing node in the code, as shown in the following figure.

    When the number of compute nodes increases, the port configuration becomes more complex. In the machine learning platform, port configuration is much easier. You can use the following code to retrieve the port number of each compute node.

    1. ps_hosts = FLAGS.ps_hosts.split(",") #Get ports of ps_hosts from the framework layer.
    2. worker_hosts = FLAGS.worker_hosts.split(",") #Get ports of worker_hosts from the framework layer.
  3. View runtime logs.

    1. Right-click a TensorFlow component to view the logs. In the following example, two PS nodes and two WORKER nodes are allocated to this TensorFlow component.

    2. Click the blue link to view the running conditions of a worker node in logview.

Download sample code

https://www.alibabacloud.com/help/doc-detail/107974.html

Use hyperparameters in TensorFlow

You can specify hyperparameter files in the Hyperparameters and Custom Parameters field of the parameters setting page. For example:

  1. batch_size=10
  2. learning_rate=0.01

You can reference a hyperparameter as follows:

  1. import tensorflow as tf
  2. tf.app.flags.DEFINE_string("learning_rate", "", "learning_rate")
  3. tf.app.flags.DEFINE_string("batch_size", "", "batch size")
  4. FAGS = tf.app.flags.FLAGS
  5. print("learning rate:" + FAGS.learning_rate)
  6. print("batch size:" + FAGS.batch_size)

Third-party libraries supported by TensorFlow

Third-party libraries supported by TensorFlow 1.0.0

  1. appdirs (1.4.3)
  2. backports-abc (0.5)
  3. backports.shutil-get-terminal-size (1.0.0)
  4. backports.ssl-match-hostname (3.5.0.1)
  5. bleach (2.0.0)
  6. boto (2.48.0)
  7. bz2file (0.98)
  8. certifi (2017.7.27.1)
  9. chardet (3.0.4)
  10. configparser (3.5.0)
  11. cycler (0.10.0)
  12. decorator (4.1.2)
  13. docutils (0.14)
  14. easygui (0.98.1)
  15. entrypoints (0.2.3)
  16. enum34 (1.1.6)
  17. funcsigs (1.0.2)
  18. functools32 (3.2.3.post2)
  19. gensim (2.3.0)
  20. h5py (2.7.0)
  21. html5lib (0.999999999)
  22. idna (2.6)
  23. iniparse (0.4)
  24. ipykernel (4.6.1)
  25. ipython (5.4.1)
  26. ipython-genutils (0.2.0)
  27. ipywidgets (7.0.0)
  28. Jinja2 (2.9.6)
  29. jsonschema (2.6.0)
  30. jupyter (1.0.0)
  31. jupyter-client (5.1.0)
  32. jupyter-console (5.1.0)
  33. jupyter-core (4.3.0)
  34. Keras (2.0.6)
  35. kitchen (1.1.1)
  36. langtable (0.0.31)
  37. MarkupSafe (1.0)
  38. matplotlib (2.0.2)
  39. mistune (0.7.4)
  40. mock (2.0.0)
  41. nbconvert (5.2.1)
  42. nbformat (4.4.0)
  43. networkx (1.11)
  44. nose (1.3.7)
  45. notebook (5.0.0)
  46. numpy (1.13.1)
  47. olefile (0.44)
  48. pandas (0.20.3)
  49. pandocfilters (1.4.2)
  50. pathlib2 (2.3.0)
  51. pbr (3.1.1)
  52. pexpect (4.2.1)
  53. pickleshare (0.7.4)
  54. Pillow (4.2.1)
  55. pip (9.0.1)
  56. prompt-toolkit (1.0.15)
  57. protobuf (3.1.0)
  58. ptyprocess (0.5.2)
  59. pycrypto (2.6.1)
  60. pycurl (7.19.0)
  61. Pygments (2.2.0)
  62. pygobject (3.14.0)
  63. pygpgme (0.3)
  64. pyliblzma (0.5.3)
  65. pyparsing (2.2.0)
  66. python-dateutil (2.6.1)
  67. pytz (2017.2)
  68. PyWavelets (0.5.2)
  69. pyxattr (0.5.1)
  70. PyYAML (3.12)
  71. pyzmq (16.0.2)
  72. qtconsole (4.3.1)
  73. requests (2.18.4)
  74. scandir (1.5)
  75. scikit-image (0.13.0)
  76. scikit-learn (0.19.0)
  77. scikit-sound (0.1.8)
  78. scikit-stack (3.0)
  79. scikit-surprise (1.0.3)
  80. scikit-tensor (0.1)
  81. scikit-video (0.1.2)
  82. scipy (0.19.1)
  83. setuptools (36.2.7)
  84. simplegeneric (0.8.1)
  85. singledispatch (3.4.0.3)
  86. six (1.10.0)
  87. slip (0.4.0)
  88. slip.dbus (0.4.0)
  89. smart-open (1.5.3)
  90. subprocess32 (3.2.7)
  91. tensorflow (1.0.0)
  92. terminado (0.6)
  93. testpath (0.3.1)
  94. tflearn (0.3.2)
  95. Theano (0.9.0)
  96. torch (0.1.12.post2)
  97. tornado (4.5.1)
  98. traitlets (4.3.2)
  99. urlgrabber (3.10)
  100. urllib3 (1.22)
  101. wcwidth (0.1.7)
  102. webencodings (0.5.1)
  103. wheel (0.29.0)
  104. widgetsnbextension (3.0.0)
  105. yum-langpacks (0.4.2)
  106. yum-metadata-parser (1.1.4)
  107. opencv-python (3.3.0.10)

Third-party libraries supported by TensorFlow 1.1.0

  1. appdirs (1.4.3)
  2. backports-abc (0.5)
  3. backports.shutil-get-terminal-size (1.0.0)
  4. backports.ssl-match-hostname (3.5.0.1)
  5. bleach (2.0.0)
  6. boto (2.48.0)
  7. bz2file (0.98)
  8. certifi (2017.7.27.1)
  9. chardet (3.0.4)
  10. configparser (3.5.0)
  11. cycler (0.10.0)
  12. decorator (4.1.2)
  13. docutils (0.14)
  14. easygui (0.98.1)
  15. entrypoints (0.2.3)
  16. enum34 (1.1.6)
  17. funcsigs (1.0.2)
  18. functools32 (3.2.3.post2)
  19. gensim (2.3.0)
  20. h5py (2.7.1)
  21. html5lib (0.999999999)
  22. idna (2.6)
  23. iniparse (0.4)
  24. ipykernel (4.6.1)
  25. ipython (5.4.1)
  26. ipython-genutils (0.2.0)
  27. ipywidgets (7.0.0)
  28. Jinja2 (2.9.6)
  29. jsonschema (2.6.0)
  30. jupyter (1.0.0)
  31. jupyter-client (5.1.0)
  32. jupyter-console (5.2.0)
  33. jupyter-core (4.3.0)
  34. jupyter-tensorboard (0.1.1)
  35. Keras (2.0.8)
  36. kitchen (1.1.1)
  37. langtable (0.0.31)
  38. MarkupSafe (1.0)
  39. matplotlib (2.0.2)
  40. mistune (0.7.4)
  41. mock (2.0.0)
  42. nbconvert (5.3.0)
  43. nbformat (4.4.0)
  44. networkx (1.11)
  45. nose (1.3.7)
  46. notebook (4.4.1)
  47. numpy (1.13.1)
  48. olefile (0.44)
  49. pandas (0.20.3)
  50. pandocfilters (1.4.2)
  51. pathlib2 (2.3.0)
  52. pbr (3.1.1)
  53. pexpect (4.2.1)
  54. pickleshare (0.7.4)
  55. Pillow (4.2.1)
  56. pip (9.0.1)
  57. prompt-toolkit (1.0.15)
  58. protobuf (3.1.0)
  59. ptyprocess (0.5.2)
  60. pycrypto (2.6.1)
  61. pycurl (7.19.0)
  62. Pygments (2.2.0)
  63. pygobject (3.14.0)
  64. pygpgme (0.3)
  65. pyliblzma (0.5.3)
  66. pyparsing (2.2.0)
  67. python-dateutil (2.6.1)
  68. pytz (2017.2)
  69. PyWavelets (0.5.2)
  70. pyxattr (0.5.1)
  71. PyYAML (3.12)
  72. pyzmq (16.0.2)
  73. qtconsole (4.3.1)
  74. requests (2.18.4)
  75. scandir (1.5)
  76. scikit-image (0.13.0)
  77. scikit-learn (0.19.0)
  78. scikit-sound (0.1.8)
  79. scikit-stack (3.0)
  80. scikit-surprise (1.0.3)
  81. scikit-tensor (0.1)
  82. scikit-video (0.1.2)
  83. scipy (0.19.1)
  84. setuptools (36.4.0)
  85. simplegeneric (0.8.1)
  86. singledispatch (3.4.0.3)
  87. six (1.10.0)
  88. slip (0.4.0)
  89. slip.dbus (0.4.0)
  90. smart-open (1.5.3)
  91. subprocess32 (3.2.7)
  92. tensorflow (1.1.0)
  93. terminado (0.6)
  94. testpath (0.3.1)
  95. tflearn (0.3.2)
  96. Theano (0.9.0)
  97. torch (0.1.12.post2)
  98. tornado (4.5.2)
  99. traitlets (4.3.2)
  100. urlgrabber (3.10)
  101. urllib3 (1.22)
  102. wcwidth (0.1.7)
  103. webencodings (0.5.1)
  104. Werkzeug (0.12.2)
  105. wheel (0.29.0)
  106. widgetsnbextension (3.0.2)
  107. yum-langpacks (0.4.2)
  108. yum-metadata-parser (1.1.4)
  109. opencv-python (3.3.0.10)

Third-party libraries supported by TensorFlow 1.2.1

  1. appdirs (1.4.3)
  2. backports-abc (0.5)
  3. backports.shutil-get-terminal-size (1.0.0)
  4. backports.ssl-match-hostname (3.5.0.1)
  5. backports.weakref (1.0rc1)
  6. bleach (1.5.0)
  7. boto (2.48.0)
  8. bz2file (0.98)
  9. certifi (2017.7.27.1)
  10. chardet (3.0.4)
  11. configparser (3.5.0)
  12. cycler (0.10.0)
  13. decorator (4.1.2)
  14. docutils (0.14)
  15. easygui (0.98.1)
  16. entrypoints (0.2.3)
  17. enum34 (1.1.6)
  18. funcsigs (1.0.2)
  19. functools32 (3.2.3.post2)
  20. gensim (2.3.0)
  21. h5py (2.7.1)
  22. html5lib (0.9999999)
  23. idna (2.6)
  24. iniparse (0.4)
  25. ipykernel (4.6.1)
  26. ipython (5.4.1)
  27. ipython-genutils (0.2.0)
  28. ipywidgets (7.0.0)
  29. Jinja2 (2.9.6)
  30. jsonschema (2.6.0)
  31. jupyter (1.0.0)
  32. jupyter-client (5.1.0)
  33. jupyter-console (5.2.0)
  34. jupyter-core (4.3.0)
  35. jupyter-tensorboard (0.1.1)
  36. Keras (2.0.8)
  37. kitchen (1.1.1)
  38. langtable (0.0.31)
  39. Markdown (2.6.9)
  40. MarkupSafe (1.0)
  41. matplotlib (2.0.2)
  42. mistune (0.7.4)
  43. mock (2.0.0)
  44. nbconvert (5.3.0)
  45. nbformat (4.4.0)
  46. networkx (1.11)
  47. nose (1.3.7)
  48. notebook (4.4.1)
  49. numpy (1.13.1)
  50. olefile (0.44)
  51. pandas (0.20.3)
  52. pandocfilters (1.4.2)
  53. pathlib2 (2.3.0)
  54. pbr (3.1.1)
  55. pexpect (4.2.1)
  56. pickleshare (0.7.4)
  57. Pillow (4.2.1)
  58. pip (9.0.1)
  59. prompt-toolkit (1.0.15)
  60. protobuf (3.1.0)
  61. ptyprocess (0.5.2)
  62. pycrypto (2.6.1)
  63. pycurl (7.19.0)
  64. Pygments (2.2.0)
  65. pygobject (3.14.0)
  66. pygpgme (0.3)
  67. pyliblzma (0.5.3)
  68. pyparsing (2.2.0)
  69. python-dateutil (2.6.1)
  70. pytz (2017.2)
  71. PyWavelets (0.5.2)
  72. pyxattr (0.5.1)
  73. PyYAML (3.12)
  74. pyzmq (16.0.2)
  75. qtconsole (4.3.1)
  76. requests (2.18.4)
  77. scandir (1.5)
  78. scikit-image (0.13.0)
  79. scikit-learn (0.19.0)
  80. scikit-sound (0.1.8)
  81. scikit-stack (3.0)
  82. scikit-surprise (1.0.3)
  83. scikit-tensor (0.1)
  84. scikit-video (0.1.2)
  85. scipy (0.19.1)
  86. setuptools (36.4.0)
  87. simplegeneric (0.8.1)
  88. singledispatch (3.4.0.3)
  89. six (1.10.0)
  90. slip (0.4.0)
  91. slip.dbus (0.4.0)
  92. smart-open (1.5.3)
  93. subprocess32 (3.2.7)
  94. tensorflow (1.2.1)
  95. terminado (0.6)
  96. testpath (0.3.1)
  97. tflearn (0.3.2)
  98. Theano (0.9.0)
  99. torch (0.1.12.post2)
  100. tornado (4.5.2)
  101. traitlets (4.3.2)
  102. urlgrabber (3.10)
  103. urllib3 (1.22)
  104. wcwidth (0.1.7)
  105. webencodings (0.5.1)
  106. Werkzeug (0.12.2)
  107. wheel (0.29.0)
  108. widgetsnbextension (3.0.2)
  109. yum-langpacks (0.4.2)
  110. yum-metadata-parser (1.1.4)
  111. opencv-python (3.3.0.10)

MXNet

MXNet is a deep learning framework that supports both imperative and symbolic programming. It can be distributed on either CPU or GPU clusters. MXNet allows you to write more efficient applications than CXXNet, a distributed deep learning framework that was inspired by Minerva.

Parameters

  • Parameter settings

    • Python Code File: You can compress multiple code files into a tar.gz file.

    • Main Python File: Specify the main file in the compressed file. This step is optional.

    • Data Source Directory: Select the OSS data source.

    • Hyperparameters and Custom Parameters: You can write commands to specify hyperparameters and custom parameters. Using these parameters, you can test different learning rates and batch sizes in your models.

    • Output Directory: Specify the output directory of your model.

  • Tuning

    You can specify the number of GPUs based on the complexity of your tasks.

PAI commands

Not all parameters are needed for actual use. Specify your own parameters based on your needs. For more information about these parameters, see the following table.

  1. pai -name mxnet_ext
  2. -Dscript="oss://imagenet.oss-cn-shanghai-internal.aliyuncs.com/mxnet-ext-code/mxnet_cifar10_demo.tar.gz"
  3. -DentryFile="train_cifar10.py"
  4. -Dbuckets="oss://imagenet.oss-cn-shanghai-internal.aliyuncs.com"
  5. -DcheckpointDir="oss://imagenet.oss-cn-shanghai-internal.aliyuncs.com/mxnet-ext-model/"
  6. -DhyperParameters="oss://imagenet.oss-cn-shanghai-internal.aliyuncs.com/mxnet-ext-code/hyperparam.txt.single"
  7. -Darn="acs:ram::1664081855183111:role/role-for-pai";

The description of the parameters are as follows:

Parameter Description Format Default
script Required. The TF algorithm file. This file can be a regular Python file or a TAR.GZ file. oss://imagenet.oss-cn-shanghai-internal.aliyuncs.com/smoke_mxnet/mnist_ext.py oss://imagenet.oss-cn-shanghai-internal.aliyuncs.com/smoke_mxnet/mnist_ext.py
entryFile Optional. The entry file for the algorithm. This parameter is required if you specified a TAR.GZ file for the script parameter. train.py Null
buckets Required. You can specify multiple OSS buckets by separating them with commas (,). Each bucket must end with a backslash (). oss://imagenet.oss-cn-shanghai-internal.aliyuncs.com Null
hyperParameters Optional. The path of the hyperparameter file. oss://imagenet.oss-cn-shanghai-internal.aliyuncs.com/mxnet-ext-code/ Null
gpuRequired Required. The number of GPUs that are used. 200 100
checkpointDir Optional. The checkpoint directory. oss://imagenet.oss-cn-shanghai-internal.aliyuncs.com/mxnet-ext-code/ Null

Example

The CIFAR-10 dataset contains 60,000 32x32 color images in 10 different categories. This dataset is commonly used to train machine learning algorithms to recognize objects and sort them into airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, or trucks. For more information, see: https://www.cs.toronto.edu/~kriz/cifar.html

  1. Upload Python code files and training datasets to OSS. In this case, a bucket named tfmnist is created in the China (Shanghai) region to store Python files and training datasets.

  2. Drag a Read OSS Bucket and a MXNet component to the canvas and link them as follows.

  3. Specify the parameters for the MXNet component as shown in the following figure.

    • Select a TAR.GZ file for the Python Code File.
    • Specify the entry file for the algorithm as the Main Python File.
    • Select a TXT file to specify hyperparameters and custom parameters.
    • The checkpoint directory is the output directory of the model.
  4. Click Run and wait for the process to finish.

  5. Right-click the MXNet component to view the runtime logs.

  6. The model is generated under the checkpoint directory as follows.

Format conversion

Currently, custom data file formats are not supported in PAI Caffe. You must convert your training data files to the required format by using the format conversion component.

  • Link the Input to a Read OSS Bucket component.

  • Parameters

    • OSS input directory, the OSS training data in a file_list file, for example, bucket.hz.aliyun.com/train_img/train_file_list.txt. The format of file_list is as follows:
      1. bucket/ilsvrc12_val/ILSVRC2012_val_00029021.JPEG 817
      2. bucket/ilsvrc12_val/ILSVRC2012_val_00021046.JPEG 913
      3. bucket/ilsvrc12_val/ILSVRC2012_val_00041166.JPEG 486
      4. bucket/ilsvrc12_val/ILSVRC2012_val_00029527.JPEG 327
      5. bucket/ilsvrc12_val/ILSVRC2012_val_00042825.JPEG 138
    • OSS output directory, for example, bucket_name.oss-cn-hangzhou-zmf.aliyuncs.com/ilsvrc12_val_convert. The converted data_file_list.txt file and corresponding data files are stored in this directory. The format of data_file_list is as follows:

      1. bucket/ilsvrc12_val_convert/train_data_00_01
      2. bucket/ilsvrc12_val_convert/train_data_00_02
      • Encoding type: Optional. You can select jpg, png, or raw.
      • Shuffle: The default setting is true.
      • File prefix: The default value is data.
      • resize_height: The default value is 256.
      • resize_width: The default value is 256.
      • isGray: The default setting is false.
      • Generate the image mean file: The default setting is false.

PAI commands

  1. pai -name convert_image_oss2oss
  2. -Darn=acs:ram::1607128916545079:role/test-1
  3. -DossImageList=bucket_name.oss-cn-hangzhou-zmf.aliyuncs.com/image_list.txt
  4. -DossOutputDir=bucket_name.oss-cn-hangzhou-zmf.aliyuncs.com/your/dir
  5. -DencodeType=jpg
  6. -Dshuffle=true
  7. -DdataFilePrefix=train
  8. -DresizeHeight=256
  9. -DresizeWidth=256
  10. -DisGray=false
  11. -DimageMeanFile=false

Parameters

Parameter Description Format Default
ossHost The OSS host address For example, “oss-test.aliyun-inc.com” Optional, the default value is “oss-cn-hangzhou-zmf.aliyuncs.com”.
arn The ARN of the default Role of the OSS bucket For example, “acs:ram::XXXXXXXXXXXXXXXX:role/ossaccessroleforodps”; The 16-digit number in the middle represents the RoleArn. Required
ossImageList The image file list For example, “bucket_name/image_list.txt” Required
ossOutputDir The OSS output directory For example, “bucket_name/your/dir” Required
encodeType The encoding type For example: jpg, png, or raw Optional. The default setting is jpg.
shuffle Whether or not to shuffle the data Boolean type Optional. The default setting is true.
dataFilePrefix The prefix of the data file String type, for example, train or val Required
resizeHeight The image resize height Int type Optional. The default value is 256.
resizeWidth The image resize width Int type Optional. The default value is 256.
isGray Whether or not the image is grayscale Boolean type Optional. The default setting is false.
imageMeanFile Whether or not to generate the image mean file Boolean type Optional. The default setting is false.

Caffe

Caffe is a lightweight, scalable, and fast deep learning framework developed by Berkeley AI Research (BAIR) and by community contributors. Yangqing Jia created the project during his PhD at UC Berkeley. Caffe is released under the BSD 2-Clause. Caffe’s official website is http://caffe.berkeleyvision.org/.

Parameters

  • First, configure OSS access permissions.

  • You only need to specify the OSS path of the solver.prototxt file. Note the following parameters in this file:

    • net: “bucket.hz.aliyun.com/alexnet/train_val.prototxt”. The OSS path of the net file.
    • type: “ParallelSGD”. The type is denoted as a string value.
    • model_average_iter_interval: 1. The synchronization frequency. 1 means synchronize after each running.
    • snapshot_prefix: “bucket/snapshot/alexnet_train”. The OSS output directory of the model.
    1. net: "bucket/alexnet/train_val.prototxt"
    2. test_iter: 1000
    3. test_interval: 1000
    4. base_lr: 0.01
    5. lr_policy: "step"
    6. gamma: 0.1
    7. stepsize: 100000
    8. display: 20
    9. max_iter: 450000
    10. momentum: 0.9
    11. weight_decay: 0.0005
    12. snapshot: 10000
    13. snapshot_prefix: "bucket/snapshot/alexnet_train"
    14. solver_mode: GPU
    15. type: "ParallelSGD"
    16. model_average_iter_interval: 1
  • Select BinaryDataLayer for the datalayer in train_val, as shown in the following example:

    1. layer {
    2. name: "data"
    3. type: "BinaryData"
    4. top: "data"
    5. top: "label"
    6. include {
    7. phase: TRAIN
    8. }
    9. transform_param {
    10. mirror: true
    11. crop_size: 227
    12. mean_file: "bucket/imagenet_mean.binaryproto"
    13. }
    14. binary_data_param {
    15. source: "bucket/ilsvrc12_train_binary/data_file_list.txt"
    16. batch_size: 256
    17. num_threads: 10
    18. }
    19. }
    20. layer {
    21. name: "data"
    22. type: "BinaryData"
    23. top: "data"
    24. top: "label"
    25. include {
    26. phase: TEST
    27. }
    28. transform_param {
    29. mirror: false
    30. crop_size: 227
    31. mean_file: "bucket/imagenet_mean.binaryproto"
    32. }
    33. binary_data_param {
    34. source: "bucket/ilsvrc12_val_binary/data_file_list.txt"
    35. batch_size: 50
    36. num_threads: 10
    37. }
    38. }

    The new data Layer is named BinaryData. You can also use transform param to perform image data conversion operations. The parameters are consistent with the native parameters in Caffe.

binary_data_param specifies the parameter settings for the data layer, including the following parameters.

  • source: the data source. The path of the data source is the same as the path that is specified in filelist. It starts with a bucket name and does not contain oss://.

  • num_threads: the number of concurrent threads for reading OSS data. The default value is 10. You can modify this parameter based on your needs.

PAI commands

  1. pai -name pluto_train_oss
  2. -DossHost=oss-cn-hangzhou-zmf.aliyuncs.com
  3. -Darn=acs:ram::1607128916545079:role/test-1
  4. -DsolverPrototxtFile=bucket_name.oss-cn-hangzhou-zmf.aliyuncs.com/solver.prototxt
  5. -DgpuRequired=1

Parameters

Parameter Description Format Default
ossHost The OSS host address For example, “oss-test.aliyun-inc.com” Optional. The default value is “oss-cn-hangzhou-zmf.aliyuncs.com”.
arn The ARN of the default Role of the OSS bucket For example, “acs:ram::XXXXXXXXXXXXXXXX:role/ossaccessroleforodps”. The 16-digit number in the middle represents the RoleArn. Required
solverPrototxtFile The solver file The OSS path of the solver file, which starts with a bucket name Required
gpuRequired The number of GPUs Int type Optional. The default value is 1.

Example

The following example trains a model with MNIST data using Caffe.

  1. Prepare the data source.

    Download Caffe data in the Deep learning example resources section and extract the data. Upload the data to OSS as follows.

  2. Implement the experiment.

    Drag a Caffe component and link it with a Read OSS Bucket component as follows.

    Set Solver OSS Path to the path of mnist_solver_dnn_binary.prototxt. Click Run.

  3. View logs.

    Right-click a Caffe component to view logs.

    Click a logview link > ODPS Tasks > VlinuxTask > StdErr to view the logs generated during the training.