All Products
Search
Document Center

Platform For AI:Lingjun resource quotas

Last Updated:Apr 22, 2024

Alibaba Cloud Machine Learning Platform for AI (PAI) provides Lingjun resources for AI development and training. You can create resource quotas for purchased Lingjun resources to perform high-performance AI training and computing. This topic describes how to create, manage, and use resource quotas.

Prerequisites

Create a resource quota

You can create a resource quota to allocate the resources in the resource pool. To create a resource quota, perform the following steps:

  1. Log on to the PAI console. In the left-side navigation pane, choose AI Computing Resources > Resource Quota.

  2. On the Intelligent Computing Lingjun resources tab, click Add Resource Quota.

  3. On the Add Resource Quota page, configure the parameters and click Submit.

    Parameter

    Description

    Name

    The name of the resource quota.

    Scheduling Policy

    The scheduling policy. Select an appropriate scheduling policy to improve the utilization of computing resources. Valid values:

    • Intelligent

    • Balance

    • Round Robin

    • FIFO

    Associate Workspace

    The workspace with which the resource quota is associated.

    Description

    The description that is used to distinguish different resource quotas.

    Source Type

    The type of source of resources to be allocated to the resource quota. Valid values:

    • Dedicated Resource Group: Allocate resources from a dedicated resource group to the resource quota.

    • Existing Resource Quota: Allocate resources from an existing resource quota to the resource quota.

    Source

    The source of resources to be allocated to the resource quota. Select a dedicated resource group or an existing resource quota from the Source drop-down list.

    Specifications/Resources

    Click Add. In the panel that appears, specify the specifications and node quantity for resources that you want to allocate from a dedicated resource group or an existing resource quota.

    VPC

    Select a VPC, a vSwitch, and a security group from the drop-down lists.

    Note

    If your Lingjun resources need to access the Internet, you must configure an Internet NAT gateway for the selected VPC and associate an elastic IP address (EIP) with the Internet NAT gateway. We recommend that you select the VPC that you want to use to access the Internet. For more information, see Use the SNAT feature of an Internet NAT gateway to access the Internet.

    Security Group

    vSwitch

Manage resource quotas

After you create a resource quota, you can click the name of the resource quota to view the basic information and resource usage and manage the resource quota. You can also increase or decrease the resource quota limit or create child-level resource quotas to optimize the allocation of resources. For more information, see Manage resource quotas.

Use a resource quota

  • Associate a resource quota with a workspace

    Before you use a resource quota to perform AI development and training jobs, you must associate the resource quota with a workspace. For more information, see Overview.

  • Use a resource quota that is associated with a workspace for AI development and training

    • Select an image.

      To submit a Deep Learning Containers (DLC) training job by using a resource quota for Lingjun resources involves the integration of hardware and software, such as the servers, networks, drivers, and training frameworks. Therefore, we recommend that you use the official PAI image or build an image based on the official PAI image.

      Note

      If you use a custom image, you may need to update the drivers, frameworks, and software to appropriate versions to make full use of the high-performance Lingjun resources.

      Image name

      Framework

      Model

      CUDA

      Operating system

      Supported region

      Programming language and version

      deepspeed-training:23.06-gpu-py310-cu121-ubuntu22.04

      • PyTorch 2.1

      • Megatron-LM 23.06

      • DeepSpeed 0.9.5

      • Transformers 4.29.2

      • Nemo 1.19.0

      GPU

      121

      ubuntu22.04

      China (Ulanqab)

      Python3.10

      megatron-training:23.06-gpu-py310-cu121-ubuntu22.04

      • PyTorch 2.1

      • Megatron-LM 23.06

      • DeepSpeed 0.9.5

      • Transformers 4.29.2

      • Nemo 1.19.0

      GPU

      121

      ubuntu22.04

      China (Ulanqab)

      Python3.10

      nemo-training:23.06-gpu-py310-cu121-ubuntu22.04

      • PyTorch 2.1

      • Megatron-LM 23.06

      • DeepSpeed 0.9.5

      • Transformers 4.29.2

      • Nemo 1.19.0

      GPU

      121

      ubuntu22.04

      China (Ulanqab)

      Python3.10

    • Submit a DLC training job by using a resource quota for Lingjun resources. For more information, see Submit training jobs.

    • Create a Data Science Workshop (DSW) instance based on Lingjun resources. For more information, see Create a DSW instance.

    • Deploy services by using Elastic Algorithm Service (EAS). For more information, see Model service deployment by using the PAI console.