The TensorFlow deep learning framework is used in this topic to describe how to deploy a NVIDIA GPU Cloud (NGC) environment on instances with GPU capabilities.

Prerequisites

Before you build a TensorFlow deep learning framework, you must complete the following preparations:
  • An Alibaba Cloud account is created and the real-name verification is completed. For more information, see Account management FAQ and Real-name registration FAQ.
  • An NGC account is created from the NGC website.
  • The NGC API key is obtained from the NGC website and saved locally. The NGC API key will be verified when you log on to the NGC container environment.

Background information

As a deep learning ecosystem from NGC allows developers to access the deep learning software stack free of charge and is fit for creating a deep learning development environment.

At present, NGC has been fully deployed in members of the gn5 instance family. Moreover, Alibaba Cloud Marketplace provides NGC container images optimized for NVIDIA Pascal GPUs. Developers can deploy NGC container images from Alibaba Cloud Marketplace to quickly build container environments and access optimized deep learning frameworks while reducing the time spent on product development and business deployment. Other benefits include pre-installation of the development environment, support for optimized algorithm frameworks, and continuous updates.

The NGC website provides images of different versions of the current mainstream deep learning frameworks such as Caffe, Caffe2, CNTK, MXNet, TensorFlow, Theano, and Torch. You can select the desired image to deploy the environment.

The following instance families can be deployed with an NGC environment:
  • gn4, gn5, gn5i, gn6v, gn6i, and gn6e
  • ebmgn5i, ebmgn6i, ebmgn6v, and ebmgn6e

The following example shows how to create an instance with GPU capabilities and deploy an NGC environment on the instance. A gn5 instance is used in this example.

Procedure

  1. Create a gn5 instance. For more information, see Create an instance by using the provided wizard.
    When you configure parameters, take note of the following items:
    • Region: Only China (Qingdao), China (Beijing), China (Hohhot), China (Hangzhou), China (Shanghai), China (Shenzhen), China (Hongkong), Singapore, Australia (Sydney), US (Silicon Valley), US (Virginia), and Germany (Frankfurt) are available.
    • Instance Type: Select a gn5 instance type.
    • Image: Click Marketplace Image, find NVIDIA GPU Cloud Virtual Machine Image in the dialog box that appears, and then click Continue.
    • Public IP Address: Select Assign Public IP Address.
      Note If you do not select Assign Public IP Address, you can bind an Elastic IP address to the instance after creation.
    • Security Group: Select a security group. Access to TCP port 22 must be allowed in the security group. If your instance needs to support HTTPS or DIGITS 6, access to TCP port 443 (for HTTPS) or TCP port 5000 (for DIGITS 6) must be allowed.

    After the ECS instance is created, log on to the ECS console and note down the public IP address of the instance.

  2. Connect to the ECS instance.
    Use one of the following methods based on the logon credential that you selected during instance creation:
  3. Enter the NGC API Key that you obtained from the NGC website, and then press the Enter key to log on to the NGC container environment.
  4. Run the nvidia-smi command.
    You can view information about the current GPU, including the GPU model and the driver version, as shown in the following figure.
  5. Complete the following steps to build the TensorFlow deep learning framework.
    1. Log on to the NGC website, go to the TensorFlow image page, and then obtain the docker pull command.
    2. Download the TensorFlow image.
      docker pull nvcr.io/nvidia/tensorflow:18.03-py3                    
    3. View the downloaded image.
      docker image ls                   
    4. Run the container to deploy the TensorFlow development environment.
      nvidia-docker run --rm -it nvcr.io/nvidia/tensorflow:18.03-py3              
  6. Use one of the following methods to test TensorFlow:
    • Simple test of TensorFlow.
      $python
      >>> import tensorflow as tf
      >>> hello = tf.constant('Hello, TensorFlow!')
      >>> sess = tf.Session()
      >>> sess.run(hello)
      If TensorFlow loads the GPU correctly, the result is as shown in the following figure.
    • Download the TensorFlow model and test TensorFlow.
      git clone https://github.com/tensorflow/models.git
      cd models/tutorials/image/alexnet
      python alexnet_benchmark.py --batch_size 128 --num_batches 100
      						
      The running status is as shown in the following figure.
  7. Save the changes made to the TensorFlow image. Otherwise, the changes will be lost the next time you log on.