Community Blog Experience RAPIDS Data Science Acceleration in Alibaba Cloud Container Service

Experience RAPIDS Data Science Acceleration in Alibaba Cloud Container Service

This article describes how to use RAPIDS, a GPU-accelerated library launched by NVIDIA, to perform an accelerated image search on Alibaba Cloud Container Service.

By Biran


Algorithms, data, and computing power are the three major components of AI. Without computing power, AI cannot be implemented. The computing power of the NVIDIA GPU makes it the top choice for accelerating AI model training, but it is expensive. The solution composed of Alibaba Cloud Container Service, Elastic Container Instance (ECI), and Arena provides an easy, effective, and cost-effective way to take advantage of the computing power of NVIDIA GPU.

Deep Learning is the first thing that strikes anyone while mentioning the NVIDIA GPU. Traditional machine learning and data analysis methods rarely use GPUs. In fact, NVIDIA has an excellent project called RAPIDS, which is a GPU-accelerated library launched by NVIDIA for data science and machine learning. For more information about RAPIDS, visit the official RAPIDS website. This project aims to bring GPU acceleration to traditional algorithms and provides the same operations and user experience as Pandas and scikit-learn. RAPIDS has three modules: cuDF is equivalent to Pandas, cuML is equivalent to scikit-learn, and cuGraph is used to process graph data. Given its sound compatibility, combine RAPIDS with the deep learning framework, use cuDF to accelerate data processing through a GPU, and then use the TensorFlow and PyTorch deep learning model frameworks to perform tasks.


This article shows how to use TensorFlow and RAPIDS to perform search by image on Alibaba Cloud Container Service and use ECI to apply for GPU resources. GPU resources are prepared in seconds and released when no longer needed. This implies that there is no need to prepare GPU instances in advance. Additionally, there is no need to deal with the Kubernetes infrastructure. Run Arena commands to build and run a RAPIDS environment that contains GPUs and manage the GPU infrastructure.


Step 1: Prepare the Cluster

To begin with, prepare a managed Kubernetes cluster. A managed Kubernetes cluster controls nodes running on Alibaba Cloud resources, with Alibaba Cloud being responsible for operations and maintenance (O&M) costs. Meanwhile, this cluster creates a virtual Kubelet node.

In case, a Kubernetes cluster for container service is already created, select a managed Kubernetes cluster.

As the system component container needs to be run, the node must contain at least one worker node.

1) For more information on installing virtual nodes, see Virtual Nodes Documentation.

2) Configure virtual-kubelet-autoscaler. If the GPU resources in the cluster are insufficient, use virtual-kubelet-autoscaler to remove ECI container groups that are using the GPU.

Step 2: Run Arena to Create the RAPIDS Service

1) Install Arena using the commands below.

$ wget http://kubeflow.oss-cn-beijing.aliyuncs.com/arena-installer-0.3.0-b556a36-linux-amd64.tar.gz
$ tar -xvf arena*.tar.gz
$ cd arena-installer
$ ./install.sh

2) Now, run an Arena command to view the cluster's GPU resources. As the following snippet shows, this user's cluster has one real node and does not contain GPU resources. In addition, a virtual node is present, which does not physically exist and therefore is not billed. This node provides unlimited GPU resources, which can be scaled.

$ arena top node
arena top node
NAME                       IPADDRESS      ROLE    STATUS  GPU(Total)  GPU(Allocated)
cn-shanghai.  <none>  ready   0           0
virtual-kubelet      agent   ready   1000        0
Allocated/Total GPUs In Cluster:
0/1000 (0%)

3) Before submitting the RAPIDS task, complete the necessary preparation to accelerate the creation process and simplify access operations.

3.1 Set the method to access LoadBalancer. Note that this method is used only for the sake of simplicity. We recommend that you disallow external IP addresses to access the production environment.

$ find /charts/ -name "*.yaml" | xargs sed -i "s/NodePort/LoadBalancer/g"

3.2 Next, accelerate the startup speed using the step below.

3.2.1 The GPU container image is normally very large in size. For example, the RAPIDS container image used in this experiment has a capacity of 14.7 GB. Generally, the startup time is about 10 minutes. However, the image caching function may reduce this time to 20 seconds.

docker images | grep rapids
registry.cn-shanghai.aliyuncs.com/tensorflow-samples/rapids-samples                0.8.2-cuda10.0-runtime-ubuntu16.04   4597a0334d41        12 days ago         14.7GB

3.2.2 In serverless Kubernetes, just create an ImageCache CRD to directly use the image caching function.

$ cat > imagecache.yaml << EOF
apiVersion: eci.alibabacloud.com/v1
kind: ImageCache
  name: imagecache-rapids
  - registry.cn-shanghai.aliyuncs.com/tensorflow-samples/rapids-samples:0.8.2-cuda10.0-runtime-ubuntu16.04

$ kubectl create -f imagecache.yaml

3.2.3 Wait a moment after submission. Check the status of ImageCache. The CACHEID may use the snapshot-id specified earlier while submitting a previous task.

$ kubectl get imagecache
NAME                AGE    CACHEID                    PHASE   PROGRESS
imagecache-rapids   3d9h   imc-uf6dxdji7txxxxx        Ready   100%

4) Submit the RAPIDS development environment as shown below.

$ arena serve custom \
     --name=rapids \
     --selector=type=virtual-kubelet \
     --toleration=all \
     --annotation=k8s.aliyun.com/eci-image-snapshot-id=imc-uf6dxdji7txxxxx \
     --annotation=k8s.aliyun.com/eci-instance-type=ecs.gn5i-c8g1.2xlarge \
     --gpus=1 \
     -e=PASSWORD=mypassw0rd \
     --restful-port=80 \
configmap/rapids-201912011815-custom-serving created
configmap/rapids-201912011815-custom-serving labeled
service/rapids-201912011815 created
deployment.extensions/rapids-201912011815-custom-serving created

Let's take a quick look at the commnds used in the preceding code snippet.

  • --selector=type=virtual-kubelet: It indicates that the pod is started by using a virtual node.
  • --annotation=k8s.aliyun.com/eci-instance-type=ecs.gn5i-c8g1.2xlarge: It specifies the ECI container group type.
  • ecs.gn5i-c8g1.2xlarge: It refers to the Alibaba Cloud P4 model. For more information about the detailed specifications, see the relevant document.
  • --annotation=k8s.aliyun.com/eci-image-snapshot-id=imc-uf6dxdji7txxxxx: It specifies the CACHEID in Step 3.2.3.
  • -e=PASSWORD=mypassw0rd: It indicates that RAPIDS notebook is accessed by setting the PASSWORD environment variable.
  • --gpus=1: It indicates the number of GPUs applied for.

5) Now, view the access address, which is a combination of ENDPOINT_ADDRESS and PORTS. In this example, its value is At the same time, see that this task switches to the Running state in 32 seconds.

$ arena serve list
rapids  CUSTOM  201911181827  1        1      restful:80

$ arena serve get rapids
 arena serve get rapids
NAME:             rapids
NAMESPACE:        default
VERSION:          201912011815
DESIRED:          1
AVAILABLE:        1
ENDPOINT PORTS:   restful:80
AGE:              32s

INSTANCE                                           STATUS   AGE  READY  RESTARTS  NODE
rapids-201912011815-custom-serving-6b54d5cd-swcwz  Running  32s  1/1    0         N/A

6) Check the GPU usage of the cluster again. Note that the GPU resources are already being used.

$ arena top node
NAME                       IPADDRESS      ROLE    STATUS  GPU(Total)  GPU(Allocated)
cn-shanghai.  <none>  ready   0           0
virtual-kubelet      agent   ready   1000        1
Allocated/Total GPUs In Cluster:
1/1000 (0%)

7) To query the pods that use this GPU, append "-d" to the original command to view specific pod names.

$ arena top node -d

NAME:       cn-shanghai.
ROLE:       <none>

Total GPUs In Node cn-shanghai.      0
Allocated GPUs In Node cn-shanghai.  0 (0%)

NAME:       virtual-kubelet
ROLE:       agent

NAMESPACE  NAME                                                GPU REQUESTS
default    rapids-201912011815-custom-serving-6b54d5cd-swcwz  1

Total GPUs In Node virtual-kubelet:      1000
Allocated GPUs In Node virtual-kubelet:  1 (0%)

Allocated/Total GPUs In Cluster:  1/1000 (0%)

8) Use the access address and port from Step 4 to visit the address in the local browser. Enter http://{ENDPOINT ADDRESS}:{ENDPOINT PORT}. In this example, the address is

Note: We recommend using the Chrome browser.

9) Enter the logon password set in the preceding command and then click on the Log in button. In this example, the password is mypassw0rd.


Step 3: Perform the Image Search Demo

1) Go to the cuml directory where the demo is located.
2) Double-click the cuml_knn.ipynb file.
3) Click on start icon.

Note: Click once to run one cell. Click until the demo is fully executed. For more detailed instructions, refer to the following section on Demo execution process.


Demo Execution Process

The image search demo process contains the following three steps. The demo results demonstrate the different performance of GPU-accelerated RAPIDS cuml KNN and scikit-learn KNN that uses only CPUs.

1) Process the Dataset

1.1) Download and Decompress the Dataset

In this demo, the STL-10 dataset contains 100,000 unlabeled images with the dimensions of 96 x 96 x 3. Use other datasets to extract image features, but make sure that these datasets use images of the same size.

Use the download_and_extract(data_dir) method to download and decompress the STL-10 dataset. In the RAPIDS image, the dataset is downloaded to the ./data directory. Use the download_and_extract() method to decompress the dataset.


1.2) Read the Images.

Data decompressed from the dataset is binary. Use the read_all_images(path_to_data) method to load and convert the data to the NHWC format (batch, height, width, and channels). This format allows Tensorlow to extract image features.


1.3) Display an Image.

Use the show_image(image) method to display a random image from the dataset.


1.4) Split the Dataset.

Split the dataset into two parts at a 9:1 ratio. One part is used to create image index libraries, and the other is used to search for images.


2) Extract Image Features

Use TensorFlow and Keras to extract image features. Use the pre-trained model ResNet50 (notop), which is based on the ImageNet dataset.

2.1) Set TensorFlow Parameters.

By default, TensorFlow uses all GPU memory capacities. Reserve some GPU memory capacities for cuML. Use the following methods to set the GPU memory parameters:

  • Method 1: Allocate the memory according to operational requirements.
config.gpu_options.allow_growth = True
  • Method 2: Set a ratio to determine the amount of memory that is used by TensorFlow.

This demo uses method 2 and sets the ratio to 0.3. This indicates that TensorFlow may use 30% of the GPU memory. Adjust the ratio as needed.

config.gpu_options.per_process_gpu_memory_fraction = 0.3


2.2 Download the pre-trained model ResNet50 (notop). Connect TensorFlow to the public network and download the model. The size of the model is about 91 MB. The model will be downloaded to the /root/.keras/models/ directory.

Parameter Description
weights Valid values:
- None: Initializes the weight to a random value.
- imagenet: Sets the initial value to the weight of the model that is pre-trained by ImageNet.
In this demo, this parameter is set to imagenet.
include_top Valid values:
- True: The last fully connected layer that contains the entire ResNet50 network structure.
- False: The last fully connected layer that does not contain the entire ResNet50 network structure.
This demo uses the neural network model ResNet50 to extract image features instead of classifying images. Therefore, set this parameter to False.
input_shape This optional parameter specifies the input shape of the images. It takes effect only when the include_top parameter is set to False.
You must set three input channels for the images. The width and height of the images must be at least 32 millimetres. Set this parameter to (96, 96, 3).
pooling You must set a pooling layer model when the include_top parameter is set to False. Valid values:
- None: Output a 4D tensor.
- avg: Indicates the average pooling and outputs a 2D tensor.
- max: Indicates the maximum pooling and outputs a 2D tensor. In this demo, this parameter is set to max.


Now, run the model.summary() method to view the network structure of your model.


2.3 Extract Image Features

Call the model.predict() method on the split datasets to extract image features.


3) Search for Similar Images

3.1 Use cuML KNN to Search for Similar Images.

Set K to 3 (k=3) to search for the three most similar images. Adjust the value of K as needed. Use the knn_cuml.fit() method while creating indices. Use the knn_cuml.kneighbors() method while searching for neighbors.


It takes 791 milliseconds for KNN to retrieve vectors.

Use scikit-learn KNN to search for similar images. Set K to 3 (n_neighbors=3), and specify n_jobs=-1 to use all the CPUs to search for the nearest neighbors.

Note: The ecs.gn5i-c8g1.2xlarge model is configured with 8 vCPUs.


It takes 7 minutes and 34 seconds for KNN to retrieve vectors.

Compare the search results of cuML KNN and scikit-learn KNN. Compare the vector retrieval speeds of cuML KNN and scikit-learn KNN. GPU-accelerated cuML KNN takes only 791 milliseconds, whereas scikit-learn KNN that uses CPUs takes 7 minutes and 34 seconds. The retrieval speed of cuML KNN is nearly 600 times faster than that of scikit-learn KNN.

Check whether the search results of cuML KNN and scikit-learn KNN are the same. Compare the following output arrays:

  • Distance: The K smallest distance values. K indicates the number of distance values. This demo searches 10,000 images. The value of K is 3. Therefore, distance.shape=(10000,3).
  • Indices: The corresponding image indices. indices.shape=(10000, 3).

The dataset used in this demo contains identical images. Identical images may have different indices. Therefore, you must use distances instead of indices to compare the results. A calculation deviation may exist. If the deviation of the three smallest distance values for the 10,000 images is smaller than 1 in both methods, the results are considered to be identical.


Image Search Results

This demo selects five random images from the 10,000 images and uses them to search for similar images. The search results are displayed in five rows and four columns.

The first column displays the original five images. The second, third, and fourth columns display similar images. From the second column to the fourth column, the similarity between the original and similar images decreases. The title of each similar image is the calculated distance. A larger value indicates a lower similarity.


Step 4: Clean Up

Execute the following commands to clean up.

$ arena serve delete rapids
service "rapids-201912011815" deleted
deployment.extensions "rapids-201912011815-custom-serving" deleted
configmap "rapids-201912011815-custom-serving" deleted
INFO[0000] The Serving job rapids with version 201912011815 has been deleted successfully


This article describes how to experience fast and simple use of the RAPIDS accelerated data science solution at a low cost through Arena and Alibaba Cloud Serverless Kubernetes.

0 0 0
Share on

Alibaba Container Service

114 posts | 26 followers

You may also like


Alibaba Container Service

114 posts | 26 followers

Related Products