As a deep learning ecosystem from NVIDIA, NVIDIA GPU CLOUD (NGC) allows developers to access the deep learning software stack free of charge and is fit for creating a deep learning development environment.

At present, NGC has been fully deployed in the gn5 instances. Moreover, the image market also provides NGC container images optimized for NVIDIA Pascal GPU. By deploying NGC container images from the image market, developers can build an NGC container environment conveniently, and access optimized deep learning frameworks instantly, thus reducing the product development and business deployment time considerably. Other benefits include pre-installation of the development environment, support for optimized algorithm frameworks, and continuous updates.

The NGC website provides images of different versions of the current mainstream deep learning frameworks (such as Caffe, Caffe2, CNTK, MxNet, TensorFlow, Theano, and Torch). You can select the desired image to build the environment. By taking the TensorFlow deep learning framework for example, this article describes how to build an NGC environment on gn5 instances.

Before building a TensorFlow environment, you must do the following:

  • Sign up with Alibaba Cloud and finish real-name registration.
  • Log on to the NGC website and create your NGC account.
  • Log on to the NGC website, get the NGC API Key and save it locally. The NGC API Key will be verified when you log on to the NGC container environment.

Procedure

  1. Create a gn5 instance by referring to create an ECS instance. Pay attention to the following configurations:
    • Region: Only China (Qingdao), China (Beijing), China (Hohhot), China (Hangzhou), China (Shanghai), China (Shenzhen), China (Hong Kong), Singapore, Australia (Sydney), US (Silicon Valley), US (Virginia), and Germany (Frankfurt) are available.
    • Instance: Select a gn5 instance type.
    • Image: Select Marketplace Image. In the displayed dialog box, search for NVIDIA GPU Cloud VM Image, and then click Continue.
    • Network Billing Method: Select Assign Public IP.
      Note If you do not assign a public IP address here, you can bind an EIP address after the instance is created successfully.
    • Security Group: Select a security group. Access to TCP port 22 must be allowed in the security group. If your instance needs to support HTTPS or DIGITS 6, access to TCP port 443 (for HTTPS) or TCP port 5000 (for DIGITS 6) must be allowed.

    After the ECS instance is created successfully, log on to the ECS console and note down the public IP address of the instance.

  2. Connect to the ECS instance: Based on the logon credentials selected during instance creation, you can connect to an ECS instance by using a password or connect to an ECS instance by using an SSH key pair.
  3. Enter the NGC API Key obtained from the NGC website, and then press the Enter key to log on to the NGC container environment.


  4. Run nvidia-smi. You can view the information about the current GPU, including the GPU model, the driver version, and more, as shown below.


  5. Follow the steps below to build the TensorFlow environment:
    1. Log on to the NGC website, go to the TensorFlow image page, and then get the docker pull command.


    2. Download the TensorFlow image.
      docker pull nvcr.io/nvidia/tensorflow:18.03-py3
    3. View the downloaded image.
      docker image ls
    4. Run the container to deploy the TensorFlow development environment.
      nvidia-docker run --rm -it nvcr.io/nvidia/tensorflow:18.03-py3


  6. Test TensorFlow by using one of the following methods:
    • Simple test of TensorFlow.
      $python       
      >>> import tensorflow as tf
      >>> hello = tf.constant('Hello, TensorFlow!')
      >>> sess = tf.Session()
      >>> sess.run(hello)  
      If TensorFlow loads the GPU device correctly, the result is as shown below.

    • Download the TensorFlow model and test TensorFlow.
      git clone https://github.com/tensorflow/models.git
      cd models/tutorials/image/alexnet
      python alexnet_benchmark.py --batch_size 128 --num_batches 100
      The running status is as shown below.

  7. Save the changes made to the TensorFlow image. Otherwise, the configuration will be lost the next time you log on.