This topic describes how to create a dedicated GPU cluster for heterogeneous computing in the Container Service for Kubernetes (ACK) console.

Background information

You must complete the following operations in the ACK console to create an ACK cluster:

  • Create ECS instances, configure a public key to enable Secure Shell (SSH) logon from master nodes to other nodes, and then configure the ACK cluster through CloudInit.
  • Create a security group that allows access to a VPC network through Internet Control Message Protocol (ICMP).
  • Create a VPC network and a VSwitch and create SNAT rules for the VSwitch if you do not specify an existing VPC network.
  • Add route entries to a VPC network.
  • Create a NAT gateway and an Elastic IP address.
  • Create a RAM user and an AccessKey pair. Grant the following permissions to the RAM user: permissions to query, create, and delete ECS instances, permissions to add and delete cloud disks, and all permissions on SLB, Cloud Monitor, VPC, Log Service, and Network Attached Storage (NAS). The ACK cluster automatically creates SLB instances, cloud disks, and VPC routing entries based on your configurations.
  • Create an internal SLB instance and open port 6443.
  • Create a public Server Load Balancer (SLB) instance and open ports 6443, 8443, and 22. If you choose to enable SSH logon when you create the cluster, port 22 is enabled. Otherwise, port 22 is not enabled.

Limits

  • SLB instances that are created along with the cluster only support the pay-as-you-go billing method.
  • Kubernetes clusters only support Virtual Private Cloud (VPC) networks.
  • By default, each account has specific quotas on the amount of cloud resources that can be created. You cannot create clusters if the quota limit is exceeded. Make sure that you have sufficient quotas before you create a cluster.

    To request a quota increase, submit a ticket.

    • You can create up to 50 clusters across all regions under an account. A cluster can contain up to 100 nodes. To create more clusters or nodes, submit a ticket.
      Notice In a Kubernetes cluster, you can create up to 48 route entries per VPC by default. This means that a VPC-connected cluster can contain up to 48 route entries. To create more route entries, submit a ticket.
    • You can create up to 100 security groups under an account.
    • You can create up to 60 pay-as-you-go SLB instances under an account.
    • You can create up to 20 Elastic IP addresses under an account.
  • The limits on ECS instances are as follows:

    The pay-as-you-go and subscription billing methods are supported.

    Note After an ECS instance is created, you can change its billing method from pay-as-you-go to subscription in the ECS console. For more information, see Switch the billing method from pay-as-you-go to subscription.

Procedure

  1. Log on to the ACK console.
  2. In the left-side navigation pane, choose Clusters > Clusters.
  3. Click Create Kubernetes Cluster in the upper-right corner of the page. In the Select Cluster Template dialog box that appears, click Create on the Dedicated Cluster for Heterogeneous Computing card. The Dedicated Kubernetes tab appears by default.
  4. Set the cluster parameters.
    1. Configure basic settings.
      Parameter Description
      Cluster Name Enter a name for the cluster.
      Note The name must be 1 to 63 characters in length, and can contain digits, Chinese characters, letters, and hyphens (-).
      Region Select the region where you want to create the cluster.
      Resource Group Place the pointer over Account's all Resources on the top of the page, and select the resource group to which the cluster belongs. The name of the selected resource group appears in the Resource Group field.Resource group
      Kubernetes Version Select a Kubernetes version.
      Container Runtime Dedicated Kubernetes clusters support only Docker.
      VPC Set the Virtual Private Cloud (VPC) network for the cluster.
      Note Kubernetes clusters support only VPC networks. You can select a VPC network from the drop-down list. If no VPC network is available, you can click Create VPC to create one. For more information, see Create a VPC network.
      VSwitch Set the VSwitch.

      Select one to three VSwitches. We recommend that you select VSwitches in different zones. If no VSwitch is available, click Create VSwitch to create one. For more information, see Create a VSwitch.

      Network Plug-in Select a network plug-in. Flannel and Terway are available. For more information, see Flannel and Terway.
      • Flannel: a simple and stable Container Network Interface (CNI) plug-in developed by the Kubernetes community. Flannel offers a few simple features and does not support standard Kubernetes network policies.
      • Terway: a network plug-in developed by Alibaba Cloud Container Service. Terway allows you to assign Alibaba Cloud Elastic Network Interfaces (ENIs) to containers. It also allows you to customize network policies of Kubernetes to control intercommunication among containers, and implement bandwidth throttling on individual containers.
        Note The number of pods supported by a node depends on the number of ENIs that are attached to the node and the number of secondary IP addresses provided by these ENIs.
      Pod CIDR Block If you set Network Plug-in to Flannel, the Pod CIDR Block parameter is available.

      The CIDR block specified by Pod CIDR Block cannot overlap with the CIDR blocks that are used by the VPC network or existing clusters in the VPC network. After you create the cluster, you cannot modify the pod CIDR block. In addition, the service CIDR block cannot overlap with the pod CIDR block. For more information, see Plan Kubernetes CIDR blocks under a VPC.

      Terway Mode If you set Network Plug-in to Terway, the Terway Mode parameter is available.
      When you set Terway Mode, you can select or clear Assign One ENI to Each Pod.
      • If you select this check box, an ENI will be assigned to each pod.
      • If you clear this check box, an ENI will be shared among multiple pods. A secondary IP address of the ENI will be assigned to each pod.
      Note This feature is only available to users in the whitelist. If you are not in the whitelist, submit a ticket.
      Pod VSwitch If you set Network Plug-in to Terway, the Pod VSwitch parameter is available. Pod VSwitch specifies VSwitches for pods. The ENIs that are assigned to pods must be in the same zone as the nodes that host the pods. For each VSwitch that has been assigned to nodes, select a VSwitch for pods in the same zone. Pod VSwitches will assign IP addresses to pods when the cluster is started. We recommend that you select VSwitches whose prefix length is no greater than 19 bits. This ensures that the number of pods is sufficient.
      Service CIDR Set the Service CIDR parameter. The CIDR block specified by Service CIDR cannot overlap with the CIDR blocks that are used by the VPC network or existing clusters in the VPC network. After you create the cluster, you cannot modify the service CIDR block. In addition, the service CIDR block cannot overlap with the pod CIDR block. For more information, see Plan Kubernetes CIDR blocks under a VPC.
      IP Addresses per Node If you set Network Plug-in to Flannel, the IP Addresses per Node parameter is available.
      Note IP Addresses per Node specifies the maximum number of IP addresses that can be assigned to each node. We recommend that you use the default value.
      Configure SNAT Specify whether to configure source network address translation (SNAT) rules for the VPC network.
      • If the specified VPC network has a network address translation (NAT) gateway, Container Service uses this NAT gateway.
      • Otherwise, the system automatically creates a NAT gateway. If you do not want the system to create a NAT gateway, clear Configure SNAT for VPC. In this case, you must manually create a NAT gateway and configure SNAT rules to enable Internet access to the VPC network. Otherwise, the cluster cannot be created.
      Public Access Select or clear Expose API Server with EIP.
      The Kubernetes API server provides multiple HTTP-based RESTful APIs, which can be used to create, modify, query, watch, or delete resource objects such as pods and services.
      • If you select this check box, an Elastic IP address (EIP) is created and associated with the internal SLB instance. Port 6443 used by the API Server is opened on master nodes. You can connect to and manage the cluster by using kubeconfig over the Internet.
      • If you clear this check box, no EIP is created. You can only connect to and manage the cluster by using kubeconfig within the VPC network.
      SSH Logon

      Before you enable SSH, you must select Expose API Server with EIP.

      • If you select this check box, you can access the cluster by using SSH.
      • If you clear this check box, you cannot access the cluster by using SSH or kubectl. If you want to access an ECS instance in the cluster through SSH, you must manually bind an EIP to the instance and configure security group rules to open SSH port 22. For more information, see Use SSH to connect to a cluster.
      RDS Whitelist Set the Relational Database Service (RDS) whitelist. Add the IP addresses of cluster nodes to the RDS whitelist.
      Security Group

      Create Basic Security Group, Create Advanced Security Group, and Select Existing Security Group are available. For more information, see Overview.

    2. Configure advanced settings.
      Parameter Description
      Kube-proxy Mode

      iptables and IPVS are supported.

      • iptables is a mature and stable service that uses iptables rules to configure service discovery and load balancing. This mode provides average performance and is closely dependent on the cluster size. This mode is suitable for clusters that run a small number of services.
      • IPVS provides high performance and uses IP Virtual Server (IPVS) to configure service discovery and load balancing. This mode is suitable for clusters that run a large number of services. We recommend that you use this mode in scenarios where high load balancing performance is required.
      Labels

      Attach labels to the cluster. Enter the key and value, and click Add.

      Note
      • The key field is required and the value field is optional.
      • Keys are not case-sensitive. A key must be 1 to 64 characters in length and cannot start with aliyun, http://, and https://.
      • Values are not case-sensitive. A value must be 1 to 128 characters in length and cannot start with http:// or https://.
      • The keys of labels attached to the same resource must be unique. If you add a label with a used key, the label overwrites the one that uses the same key.
      • You can attach up to 20 labels to each resource. If you attach more than 20 labels to a resource, all labels become invalid. You must detach unused labels for the remainings to take effect.
      Custom Image

      If you select a custom image, the default image will be replaced.

      Cluster Domain Set the cluster domain.
      Note The default cluster domain is cluster.local. Custom domains are supported. A domain consists of two parts. Each part must be 1 to 63 characters in length, and contain only lowercase and uppercase letters, and digits.
      Custom Certificate SANs

      Specify the Subject Alternative Names (SANs) included on the API server certificate. Separate multiple IP addresses or domains with commas (,).

      Service Account Token Volume Projection

      Enable Service Account Token Volume Projection to enhance security when you use service accounts. For more information, see Deploy service account token volume projection.

      Cluster CA If you select Custom Cluster CA, upload a CA certificate for the Kubernetes cluster to protect data transmission between the server and client.
      Deletion Protection Specify whether to enable Deletion Protection. If you select this check box, the cluster cannot be deleted in the console or through API operations. This avoids user errors.
  5. Click Next:Master Configurations to configure master nodes.
    Parameter Description
    Billing Method These billing methods are supported: Pay-As-You-Go and Subscription.
    Duration If you select Subscription, you must set the duration of the subscription.
    Auto Renewal If you select Subscription, you must specify whether to enable Auto Renewal.
    Master Node Quantity Set the number of master nodes. You can create three or five master nodes.
    Instance Type Select an instance type for each master node. For more information, seeInstance families .
    Note If no instance type is available, you can change VSwitches on the Cluster Configurations wizard page.
    System Disk By default, system disks are mounted to master nodes. ESSDs, SSDs, and ultra disks are supported.
    Note You can select Enable Backup to back up disk data.
  6. Click Next:Worker Configurations to configure worker nodes.
    1. Set worker instances.
      • If you select Create Instance, you must set the parameters listed in the following table.
        Parameter Description
        Instance Type Choose Heterogeneous Computing > Compute Optimized Type with GPU to show a list of available instance types, and select one or more required instance types from the list. For more information, see Instance families.
        Note If no instance type is available, you can change VSwitches on the Cluster Configurations wizard page.
        Heterogeneous computing
        Selected Types The specified instance types appear here.
        Quantity Set the number of worker nodes.
        System Disk ESSDs, SSDs, and ultra disks are supported.
        Note You can select Enable Backup to back up disk data.
        Mount Data Disk

        ESSDs, SSDs, and ultra disks are supported. You can enable disk encryption and backup when you add data disks.

        Operating System CentOS 7.7 and AliyunLinux 2.1903 are supported.
        Logon Type
        • Key Pair:

          If no key pair is available, you can click create a key pair to create one in the ECS console. For more information, see Create an SSH key pair. After the key pair is created, set it as the credentials to log on to the cluster.

        • Password:
          • Password: Set the logon password.
          • Confirm Password: Enter the password again.
        Key Pair
      • If you select Add Existing Instance, you must select ECS instances that are deployed in the specified region. Then, set the Operating System, Logon Type, and Key Pair parameters in the same way as you create ECS instances.
    2. Configure advanced settings.
      Parameter Description
      Node Protection Specify whether to enable node protection.
      Note This check box is selected by default. Cluster nodes cannot be deleted in the console or through API operations. This avoids user errors.
      User Data For more information, see Prepare user data.
      Custom Node Name Specify whether to enable Custom Node Name.
      A node name consists of a prefix, an IP substring, and a suffix.
      • Both the prefix and suffix can contain one or more parts that are separated with periods (.). These parts can contain lowercase letters, digits, and hyphens (-), and must start and end with a lowercase letter or digit.
      • The IP substring length specifies the number of digits at the end of the returned node IP address. Valid values: 5 to 12.

      For example, if the node IP address is 192.168.0.55, the prefix is aliyun.com, IP substring length is 5, and the suffix is test, the node name will be aliyun.com00055test.

      Node Port Range Set the node port range. The default port range is 30000 to 32767.
      CPU Policy Set the CPU policy.
      • None: specifies that the default CPU affinity policy is used. This option is selected by default.
      • Static: allows pods with certain resource characteristics to be granted with enhanced CPU affinity and exclusivity on the node.
      Taints Add taints to worker nodes in the cluster.
  7. Click Next:Component Configuration to configure components.
    Parameter Description
    Ingress Specify whether to install Ingress controllers. By default, Install Ingress Controllers is selected. For more information, see Support for Ingress.
    Volume Plug-in Select a storage plug-in. Flexvolume and CSI are supported. Kubernetes clusters can be automatically bound to Alibaba Cloud disks, Network Attached Storage (NAS) file systems, and Object Storage Service (OSS) buckets through pods. For more information, see Storage management - Flexvolume and Storage management - CSI.
    Monitoring Agents

    Specify whether to install one or more monitoring agents. You can install the CloudMonitor agent on ECS nodes. This allows you to view the monitoring information about these nodes in the CloudMonitor console. If you are in the whitelist, you can select Enable Prometheus Monitoring.

    Log Service

    Specify whether to enable Log Service. You can select an existing project or create a project.

    If you select Enable Log Service, the Log Service plug-in is automatically installed in the cluster. When you create an application, you can get started with Log Service through a few simple steps. For more information, see Use Log Service to collect Kubernetes logs.

    After you select Enable Log Service, you can select Create Ingress Dashboard to create Ingress dashboards in the Log Service console, or select Install node-problem-detector and Create Event Center to create an event center in the Log Service console.

    Workflow Engine Specify whether to use AGS.
    • If you select this check box, the system automatically installs the AGS workflow plug-in when it creates the cluster.
    • If you clear this check box, you must manually install the AGS workflow plug-in. For more information, see Introduction to AGS CLI.
  8. Click Next:Confirm Order.
  9. Select Terms of Service and click Create Cluster.
    Note It takes approximately 10 minutes for the system to create a Kubernetes cluster that consists of multiple nodes.

What to do next

After the cluster is created, go to the Clusters page, find the new cluster, and then click the cluster name or click Manage in the Actions column for the cluster. The Cluster Information page is displayed by default. In the left-side navigation pane, click Nodes to go to the Nodes page. Select the worker node that is configured when you create the cluster, and choose More > Details in the Actions column for the worker node to view the GPU-based devices associated with the node.