This topic describes how to create a Kubernetes cluster that uses NPU resources.

Prerequisites

You have activated Container Service and Resource Access Management (RAM).

Background information

Compared with CPUs, the most obvious advantage offered by NPUs is fast data processing in complex algorithmic models. Different from the von Neumann architecture used by CPUs, NPUs adopt the data-driven parallel computing architecture, which drastically increases computing power while reducing power consumption based on data streams. NPUs are suitable for processing massive video and image data. Compared with CPUs, NPUs offer 100 to 1,000 times higher speeds while maintaining much lower power consumption.

You can create Kubernetes clusters in Container Service and use Hanguang NPU to run compute-intensive tasks such as machine learning and image processing. You can quickly deploy workloads and dynamically scale resources to meet the actual needs.

Note For more information about Hanguang NPU, visit this website.

This topic creates a Kubernetes cluster and adds an ecs.ebman1.26xlarge instance to demonstrate how to use NPU resources.

Container Service performs the following operations to create a Kubernetes cluster:
  • Creates ECS instances, configures a public key to enable SSH logon from master nodes to other nodes, and configures the Kubernetes cluster through CloudInit.
  • Creates a security group that allows access to the VPC network over ICMP.
  • If you do not specify an existing VPC network, a new VPC network and VSwitch are created, and SNAT rules are created for the VSwitch.
  • Creates VPC routing rules.
  • Creates a NAT gateway and an Elastic IP address.
  • Creates a RAM user and grants it permissions to query, create, and delete ECS instances, permissions to add and delete cloud disks, and all permissions on SLB, CloudMonitor, VPC, Log Service, and NAS. The Kubernetes cluster dynamically creates SLB instances, cloud disks, and VPC routing rules based on your settings.
  • Creates an internal SLB instance and opens port 6443.
  • Creates a public SLB instance and opens ports 6443, 8443, and 22. If you choose to enable SSH logon when you create the cluster, port 22 is enabled. Otherwise, port 22 is not enabled.

Limits

  • SLB instances that are created along with the cluster only support the pay-as-you-go billing method.
  • Kubernetes clusters only support VPC networks.
  • By default, each account has specific quotas on the amount of cloud resources that can be created. You cannot create clusters if the quota limit is exceeded. Make sure that you have sufficient quotas before you create a cluster.

    To request a quota increase, submit a ticket.

    • You can create up to 5 clusters across all regions under an account. A cluster can contain up to 40 nodes. To create more clusters or nodes, submit a ticket.
      Note In a Kubernetes cluster, you can create up to 48 route entries per VPC. This means that a cluster can contain up to 48 nodes. To increase the number of nodes, submit a ticket to increase the number of route entries first.
    • You can create up to 100 security groups under an account.
    • You can create up to 60 pay-as-you-go SLB instances under an account.
    • You can create up to 20 Elastic IP addresses under an account.
  • The limits on ECS instances are as follows:
    • Only CentOS operating systems are supported.
    • The pay-as-you-go and subscription billing methods are supported.
    Note After an ECS instance is created, you can change its billing method from pay-as-you-go to subscription in the console. For more information, see Switch the billing method from pay-as-you-go to subscription.

Create a Kubernetes cluster with NPU resources

  1. Log on to the Container Service console.
  2. In the left-side navigation pane, choose Clusters > Clusters. In the upper-right corner, click Create Kubernetes Cluster.
  3. On the Select Cluster Template page that appears, select Heterogeneous Computing Cluster in the Other Clusters section and click Create. The Dedicated Kubernetes page appears.
    This example creates a dedicated heterogeneous computing cluster. You can also select Heterogeneous Computing Cluster in the Managed Clusters section to create a managed cluster.Select a cluster template
    Note To create an NPU cluster, select ECS instance types with NPU capabilities to create worker nodes. For more information about other parameters, see Create an ACK cluster.
  4. Configure worker nodes. This example uses NPU nodes as worker nodes and selects the ecs.ebman1.26xlarge instance type.
    • To create new instances, you need to specify the instance family, instance type, and the number of worker nodes. In this example, two NPU nodes are created and the instance type is ecs.ebman1.26xlarge.
      Set nodes
    • To add existing instances, you must create ECS instances with NPU capabilities in the selected region in advance. For more information, see Instance families.
  5. Set the other parameters and click Create Cluster to start the deployment.
    After the cluster is created, click Clusters > Nodes to go to the Nodes page.

    Select the target cluster from the Clusters drop-down list. Find the newly created node and click More > Details in the Actions column to view the NPU devices that are attached to the node.

Set up a private image pull secret

To use Docker images with NPU capabilities provided by Alibaba Cloud, contact our sales staff to obtain an authorized Docker registry account. Download the required Docker image and set up a private image pull secret in the cluster.

  1. In the left-side navigation pane, choose Clusters > Clusters. The Clusters page appears.
  2. Select the target cluster and click More > Open Cloud Shell in the Actions column.
    After you connect to the cluster, the output is as follows:Output
  3. Run the following commands to create a docker-registry secret.
    kubectl create secret \
    docker-registry regsecret \
    --docker-server=registry.cn-shanghai.aliyuncs.com \
    --docker-username=<your_username>
    --docker-password=<your_password>
    Note
    • regsecret: The name of the secret. You can enter a custom name.
    • --docker-server: The address of the Docker registry.
    • --docker-username: The username of the Docker registry account.
    • --docker-password: The password of the Docker registry account.
  4. Add a secret in the pod configuration file to pull the private image with NPU capabilities.
    apiVersion: v1
    kind: Pod
    metadata:
      name: test-npu
    spec:
      containers:
      - name: <container name>
        image: registry.cn-shanghai.aliyuncs.com/hgai/<the Docker image with NPU capabilities>
      imagePullSecrets:
      - name: <secret name>
    Note
    • imagePullSecrets specifies the secret that is required to pull the image.
    • regsecret must be the same as the secret name specified in step 3.
    • The Docker registry address in image must be the same as the one specified in --docker-server.

Use NPU resources

If a pod needs to use NPU resources, set the aliyun.com/npu parameter in resources.limits.

apiVersion: v1
kind: Pod
metadata:
  name: <pod name>
spec:
  containers:
    - name: <container name>
      image: <image name>
      resources:
        limits:
          aliyun.com/npu: <the number of requested NPU devices>

Run TensorFlow on an NPU environment

You can use NPU resources to train TensorFlow models. The following example starts a pod to perform model training with NPU resources.

  1. Connect to the target cluster. For more information, see Use kubectl on Cloud Shell to manage Kubernetes clusters.
    Run the following command in Cloud Shell.
    cat > test-pod.yaml <<- EOF
    apiVersion: v1
    kind: Pod
    metadata:
      name: test-npu-pod
    spec:
      restartPolicy: Never
      imagePullSecrets:
        - name: regsecret
      containers:
        - name: resnet50-npu
          image: registry.cn-shanghai.aliyuncs.com/hgai/tensorflow:v1_resnet50-tensorflow1.9.0-toolchain1.0.2-centos7.6
          resources:
            limits:
              aliyun.com/npu: 1 # requesting  NPUs
    EOF
  2. Run the following command to create a pod:
    kubectl apply -f test-pod.yaml
  3. Run the following command to query the pod status:
    kubectl get po test-npu-pod
    Note If the pod status is Error, run the kubectl logs test-npu-pod command to check the pod logs and troubleshoot the error.

Result

Wait a while and query the pod status again.
kubectl get po test-npu-pod
If the pod status is Completed, check the pod logs again.
kubectl logs test-npu-pod
If the following output is displayed, it indicates that the training is completed.
2019-10-30 12:10:50.389452: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
100%|##########| 98/98 [00:26<00:00,  3.67it/s]
resnet_v1_50, result =  {'top_5': 0.9244321584701538, 'top_1': 0.7480267286300659}