This topic describes how to deploy an NVIDIA GPU Cloud (NGC) environment on a GPU-accelerated instance. In this example, a TensorFlow deep learning framework is used.
NGC is a deep learning ecosystem that is developed by NVIDIA. NGC allows developers to access software stacks for free and use the stacks to build development environments for deep learning.
Alibaba Cloud provides instances of the gn5 instance family that are configured with NGC. Alibaba Cloud also provides NGC container images that are optimized for NVIDIA Pascal GPUs in Alibaba Cloud Marketplace. The NGC container images allow developers to quickly deploy NGC container environments and access optimized deep learning frameworks. This way, you can develop and deploy services, and pre-install development environments in an efficient manner. The NGC container images also support optimized algorithm frameworks and real-time updates.
The NGC website provides various image versions for mainstream deep learning frameworks, such as Caffe, Caffe2, Microsoft Cognitive Toolkit (CNTK), MXNet, TensorFlow, Theano, and Torch. You can select an image based on your business requirements to deploy an environment.
- gn4, gn5, gn5i, gn6v, gn6i, and gn6e
- ebmgn5i, ebmgn6i, ebmgn6v, and ebmgn6e
The following section describes how to create a GPU-accelerated instance and deploy an NGC environment on the instance. In this example, a GPU-accelerated instance of the gn5 instance family is used.
- Create a GPU-accelerated instance of the gn5 instance family. For more information,
see Create an instance by using the wizard. When you configure parameters for the instance, take note of the following items:
- Region: Select only one of the following regions: China (Qingdao), China (Beijing), China (Hohhot), China (Hangzhou), China (Shanghai), China (Shenzhen), China (Hong Kong), Singapore (Singapore), Australia (Sydney), US (Silicon Valley), US (Virginia), and Germany (Frankfurt).
- Instance: Select an instance of the gn5 instance family.
- Image: Click Marketplace Image and click Select from Alibaba Cloud Marketplace (including operating system). In the dialog box that appears, find NVIDIA GPU Cloud Virtual Machine Image and click Use.
- Public bandwidth: Select Assign Public IPv4 Address.
Note If you do not select Assign Public IPv4 Address, you must associate an elastic IP address (EIP) with the instance after the instance is created.
- Security Group: Select a security group. You must enable TCP port 22 for the security group. If you need your instance to support HTTPS or Deep Learning GPU Training System (DIGITS) 6, you must enable TCP port 443 for HTTPS or TCP port 5000 for DIGITS 6.
After the GPU-accelerated instance is created, log on to the ECS console to obtain the public IP address of the instance.
- Connect to the GPU-accelerated instance. You can use one of the following logon credentials that you selected when you created the instance to connect to the instance:
- Connect to the GPU-accelerated instance by using a password. For more information, see Connect to a Linux instance by using a password.
- Connect to the GPU-accelerated instance by using an SSH key pair. For more information, see Connect to a Linux instance by using an SSH key pair
- Follow on-screen instructions to enter the NGC API Key that you obtained from the
NGC website. Then, press the Enter key to log on to the NGC container environment.
- Run the
nvidia-smicommand.You can view information about the GPU that the instance uses, such as the GPU model and the driver version. The following figure shows the information about the GPU.
- Build a TensorFlow deep learning framework.
- Log on to the NGC website. In the left-side navigation pane, click Containers.
- On the Containers page, enter TensorFlow in the search box and click the TensorFlow card.
- On the TensorFlow page, click Copy Image Path to download the TensorFlow image of the version that you want to use. For example, if you want to download the
tensorflow:18.03image, you can obtain the directory nvcr.io/nvidia/tensorflow:18.03-py3 on the page.
- View the downloaded image.
docker image ls
- Run the container to deploy the TensorFlow development environment.
nvidia-docker run --rm -it nvcr.io/nvidia/tensorflow:18.03-py3
- Use one of the following methods to test TensorFlow:
- Test TensorFlow in a simple manner.
import tensorflow as tf hello = tf.constant('Hello, TensorFlow!') sess = tf.Session() sess.run(hello)If TensorFlow loads the GPU as expected, the returned information is similar to the command output in the following figure.
- Download the TensorFlow model and test TensorFlow.
git clone https://github.com/tensorflow/models.git cd models/tutorials/image/alexnet python alexnet_benchmark.py --batch_size 128 --num_batches 100The following figure shows an example of the running status of TensorFlow.
- Test TensorFlow in a simple manner.
- Save the settings that you configured for the TensorFlow image. If you do not save the settings that you configured for the TensorFlow image, the changes will be lost the next time you log on to the instance.