How to deploy an NGC environment on a GPU-accelerated instance - Elastic GPU Service

This topic describes how to deploy an NVIDIA GPU Cloud (NGC) environment on a GPU-accelerated instance. In this example, a TensorFlow deep learning framework is used.

Prerequisites

An NGC account is created from the NGC website.
The NGC API key is obtained from the NGC website and saved to your computer.
Note
When you log on to an NGC container environment, the system verifies your NGC API key.

Background information

NGC is a deep learning ecosystem that is developed by NVIDIA. NGC allows developers to access software stacks for free and use the stacks to build development environments for deep learning.
Alibaba Cloud provides instances of the gn5 instance family that are configured with NGC. Alibaba Cloud also provides NGC container images that are optimized for NVIDIA Pascal GPUs in Alibaba Cloud Marketplace. The NGC container images allow developers to quickly deploy NGC container environments and access optimized deep learning frameworks. This way, you can develop and deploy services, and pre-install development environments in an efficient manner. The NGC container images also support optimized algorithm frameworks and real-time updates.
The NGC website provides various image versions for mainstream deep learning frameworks, such as Caffe, Caffe2, Microsoft Cognitive Toolkit (CNTK), MXNet, TensorFlow, Theano, and Torch. You can select an image based on your business requirements to deploy an environment.
You can deploy an NGC environment on an instance that belongs to one of the following instance families:
- gn4, gn5, gn5i, gn6v, gn6i, gn6e, gn7i, gn7e, and gn7s
- ebmgn5i, ebmgn6i, ebmgn6v, ebmgn6e, ebmgn7i, and ebmgn7e

Procedure

The following section describes how to create a GPU-accelerated instance and deploy an NGC environment on the instance. In this example, a GPU-accelerated instance of the gn5 instance family is used.

Create a GPU-accelerated instance of the gn5 instance family.

For more information, see Create an instance on the Custom Launch tab. When you configure parameters for the instance, take note of the following items:

Parameter	Description
Region	Select only one of the following regions: China (Qingdao), China (Beijing), China (Hohhot), China (Hangzhou), China (Shanghai), China (Shenzhen), China (Hong Kong), Singapore, US (Silicon Valley), US (Virginia), and Germany (Frankfurt).
Instance	Instance: Select an instance of the gn5 instance family.
Image	Click Marketplace Image and click Select from Alibaba Cloud Marketplace (including operating system). In the dialog box that appears, find NVIDIA GPU Cloud Virtual Machine Image. Click Use.
Public IP Address	Select Assign Public IPv4 Address. Note If you do not select Assign Public IPv4 Address, you can bind an elastic IP address (EIP) to the instance after the instance is created. For more information, see Associate or disassociate an EIP.
Security Group	Select a security group. You must enable TCP port 22 of the security group. If you need your instance to support HTTPS or Deep Learning GPU Training System (DIGITS) 6, you must enable TCP port 443 for HTTPS or TCP port 5000 for DIGITS 6.

After the GPU-accelerated instance is created, log on to the ECS console to obtain the public IP address of the instance.
Connect to the GPU-accelerated instance.
You can use one of the following logon credentials that you selected when you created the instance to connect to the instance:
- Connect to the GPU-accelerated instance by using a password. For more information, see Connect to a Linux instance by using a username and password.
- Connect to the GPU-accelerated instance by using an SSH key pair. For more information, see Connect to a Linux instance by using an SSH key pair.
Enter the NGC API Key that you obtained from the NGC website, and then press the Enter key to log on to the NGC container environment.
Run the nvidia-smi command.
You can view information about the GPU that the instance uses, such as the GPU model and the driver version. The following figure shows the information about the GPU.
Build a TensorFlow deep learning framework.
1. Log on to the NGC website. In the left-side navigation pane, click Containers.
2. On the Containers page, enter TensorFlow in the search box and click the TensorFlow card.
3. On the TensorFlow page, click Copy Image Path to download the TensorFlow image of the version that you want to use.
  For example, if you want to download the tensorflow:18.03 image, you can obtain the directory nvcr.io/nvidia/tensorflow:18.03-py3 on the page.
4. View the downloaded image.
```
docker image ls                   
```
5. Run the container to deploy the TensorFlow development environment.
```
nvidia-docker run --rm -it nvcr.io/nvidia/tensorflow:18.03-py3              
```
Run the following commands to test TensorFlow:
```
python
```
```
import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
sess.run(hello)
```
If TensorFlow loads the GPU correctly, the result is as shown in the following figure.
Run the following command to save the settings that you configured for the TensorFlow image.
```
docker commit   -m "commit docker" CONTAINER_ID  nvcr.io/nvidia/tensorflow:18.03-py3
# You can run the docker ps command to view the CONTAINER_ID. 
```
Important
Otherwise, the changes will be lost the next time you log on.