In scenarios in which graphics computing is not required, such as deep learning and AI, we recommend that you use a GPU-accelerated compute-optimized instance configured with a GPU driver. This topic describes how to create a Linux GPU-accelerated compute-optimized instance configured with a GPU driver that supports automatic installation.

Background information

Alibaba Cloud allows you to configure GPU drivers that support automatic installation only when you create specific GPU-accelerated instances and use Linux public images. The instances must belong to GPU-accelerated compute-optimized families, such as the GPU-accelerated compute-optimized instance family and the GPU-accelerated compute-optimized Elastic Compute Service (ECS) Bare Metal Instance family. You cannot configure GPU drivers that support automatic installation in the following scenarios:

The installation methods and driver types may vary based on individual use cases. For more information, see Installation guideline for NVIDIA drivers.

Preparations

  1. Create an Alibaba Cloud account and complete account information.
  2. Go to the Custom Launch tab of the instance buy page in the ECS console.

Procedure

Step 1: Complete the settings in the Basic Configurations step

In the Basic Configurations step, you can configure the basic parameters and resources that are required to purchase an instance. The basic parameters include the billing method, region, and zone. The basic resources include the instance type, image, and storage. After you complete the settings in the Basic Configurations step, click Next.

  1. Select a billing method.
    The billing method determines how the billing and charging rules are applied to an instance. The billing method also determines how the status of the resources on the instance is changed.
    Billing methodDescriptionReferences
    SubscriptionYou pay for resources before you use them. Subscription
    Pay-as-you-goYou pay for resources after you use them. The billing cycles of pay-as-you-go instances are accurate to the second. You can purchase and release instances based on your business requirements.
    Note We recommend that you use this billing method together with savings plans to reduce costs.
    Preemptible InstanceYou pay for resources after you use them. The price of a preemptible instance is lower than the price of a pay-as-you-go instance. However, the system may release preemptible instances due to fluctuations in the market price or insufficient resources of instance types. Preemptible instances
  2. Select a region and a zone.
    Select a region that is close to your geographical location to reduce latency. After an instance is created, the region and the zone of the instance cannot be changed. For more information, see Regions and zones.
  3. Select an instance type and configure the relevant settings.
    1. Set the Architecture parameter to Heterogeneous Computing and set the Category parameter to Compute Optimized Type with GPU. Alternatively, set the Architecture parameter to ECS Bare Metal Instance and set the Category parameter to GPU Type. Then, select an instance type.
      Note
      • The available instance types vary based on the selected region. To view the instance types that can be used in each region, go to the ECS Instance Types Available for Each Region page.
      • You may have specific requirements on the settings. For example, you may want to attach multiple elastic network interfaces (ENIs), enhanced SSDs (ESSDs), or local disks to the instance. You must make sure that the selected instance type meets the requirements. For information about the features, scenarios, and specifications of instance types, see Overview of instance families.
      • If you want to purchase an instance for a specific scenario, click the Scenario-based Selection tab to view the instance types that are recommended for different scenarios. For example, you can set the Business Scenario parameter to AI Machine Learning to view the recommended GPU-accelerated instance types.
    2. Check whether the value of the Selected Instance Type parameter is the same as the selected instance type.
    3. If you set Billing Method to Preemptible Instance, configure the Use Duration and Maximum Price for Instance Type parameters.
      Use Duration specifies the protection period of a preemptible instance. After the protection period ends, the instance may be released due to insufficient resources or a lower bid than the market price. The following table describes the valid values of the Use Duration parameter.
      ValueDescription
      One HourAfter the preemptible instance is created, it enters a 1-hour protection period during which it cannot be automatically released.
      NoneThe preemptible instance is created without a protection period. Preemptible instances without a protection period are lower-cost than preemptible instances with a protection period.
      The following table describes the valid values of the Maximum Price for Instance Type parameter.
      ValueDescription
      Use Automatic BidThe real-time market price of the instance type is automatically used. The price can be up to but cannot exceed the pay-as-you-go price of the instance type. Automatic bidding can prevent the preemptible instance from being released due to lower bids than the market price, but cannot prevent the instance from being released due to insufficient resources.
      Set Maximum PriceYou must specify a maximum price. If the real-time market price exceeds your specified maximum price or if available resources are insufficient, the preemptible instance is released.
    4. Specify the number of instances to create.
      You can create a maximum of 100 instances at a time by using the wizard. In addition, the number of instances within your account cannot exceed your instance quota. The instance quota is displayed on the buy page. For more information, see View and increase instance quotas.
  4. Select an image.
    1. In the Image section, click Public Image and select the Linux distribution and version that you want to use.
    2. Select Auto-install GPU Driver, and determine whether to select AIACC-Training and AIACC-Inference based on your business requirements. Then, select the versions of the CUDA library, GPU driver, and cuDNN library that you want to use.
      Note If you select an instance of the sccgn7ex GPU-accelerated compute-optimized Super Computing Cluster (SCC) instance family, you can determine whether to install a remote direct memory access (RDMA) software stack that supports automatic installation based on your business requirements.
      The following information describes GPU drivers, RDMA software stacks, AIACC-Training, and AIACC-Inference:
      • GPU drivers are used to drive physical GPUs and can work efficiently when used together with the CUDA and cuDNN libraries. If you select Auto-install GPU Driver, a CUDA library and a cuDNN library are installed when you install the GPU driver. You can select Auto-install GPU Driver only when you use specific Linux public images. The following table lists the image versions and the instance families supported for GPU drivers of different versions.
        Note For a new business system, we recommend that you use the latest versions of the GPU driver, CUDA library, and cuDNN library.
        CUDA library versionGPU driver versioncuDNN library versionSupported Alibaba Cloud public image versionSupported instance family
        11.4.1470.82.018.2.4
        • Alibaba Cloud Linux 2 and Alibaba Cloud Linux 3
        • Ubuntu 20.04, Ubuntu 18.04, and Ubuntu 16.04
        • CentOS 8.x and CentOS 7.x
        • Debian 10.10
          Note Debian 10.10 is supported only for the sccgn7ex instance family.
        • gn7i, gn7e, gn6v, gn6i, gn6e, gn5, and gn5i
        • ebmgn7, ebmgn7i, ebmgn7e, ebmgn6v, ebmgn6i, ebmgn6e, and ebmgn5i
        • sccgn7ex
        11.2.2460.91.038.1.1
        • Alibaba Cloud Linux 2 and Alibaba Cloud Linux 3
        • Ubuntu 20.04, Ubuntu 18.04, and Ubuntu 16.04
        • CentOS 8.x and CentOS 7.x
        • gn7, gn7i, gn7e, gn6v, gn6i, gn6e, gn5, and gn5i
        • ebmgn7, ebmgn7i, ebmgn7e, ebmgn6v, ebmgn6i, ebmgn6e, and ebmgn5i
        11.0.2460.91.03
        • 8.1.1
        • 8.0.4
        • Alibaba Cloud Linux 2
        • Ubuntu 20.04, Ubuntu 18.04, and Ubuntu 16.04
        • CentOS 8.x and CentOS 7.x
        • gn7, gn7e, gn6v, gn6i, gn6e, gn5, and gn5i
        • ebmgn7, ebmgn7e, ebmgn6v, ebmgn6i, ebmgn6e, and ebmgn5i
        10.2.89460.91.03
        • 8.1.1
        • 8.0.4
        • 7.6.5
        • Alibaba Cloud Linux 2
        • Ubuntu 18.04 and Ubuntu 16.04
        • CentOS 8.x and CentOS 7.x
        • gn6v, gn6i, gn6e, gn5, and gn5i
        • ebmgn6v, ebmgn6i, ebmgn6e, and ebmgn5i
        10.1.168
        • 450.80.02
        • 440.64.00
        • 8.0.4
        • 7.6.5
        • 7.5.0
        • Ubuntu 18.04 and Ubuntu 16.04
        • Centos 7.x
        • gn6v, gn6i, gn6e, gn5, and gn5i
        • ebmgn6v, ebmgn6i, ebmgn6e, and ebmgn5i
        10.0.130
        • 450.80.02
        • 440.64.00
        • 7.6.5
        • 7.5.0
        • 7.4.2
        • 7.3.1
        • Ubuntu 18.04 and Ubuntu 16.04
        • Centos 7.x
        • gn6v, gn6i, gn6e, gn5, and gn5i
        • ebmgn6v, ebmgn6i, ebmgn6e, and ebmgn5i
        9.2.148
        • 450.80.02
        • 440.64.00
        • 390.116
        • 7.6.5
        • 7.5.0
        • 7.4.2
        • 7.3.1
        • 7.1.4
        • Ubuntu 16.04
        • Centos 7.x
        • gn6v, gn6e, gn5, and gn5i
        • ebmgn6v, ebmgn6e, and ebmgn5i
        9.0.176
        • 450.80.02
        • 440.64.00
        • 390.116
        • 7.6.5
        • 7.5.0
        • 7.4.2
        • 7.3.1
        • 7.1.4
        • 7.0.5
        • Ubuntu 16.04
        • Centos 7.x
        • SUSE 12sp2
        • gn6v, gn6e, gn5, and gn5i
        • ebmgn6v, ebmgn6e, and ebmgn5i
        8.0.61
        • 450.80.02
        • 440.64.00
        • 390.116
        • 7.1.3
        • 7.0.5
        • Ubuntu 16.04
        • Centos 7.x
        • gn5 and gn5i
        • ebmgn5i
        Note If you want to change the OS of an instance after the instance is created, make sure that GPU drivers can be automatically installed when you use the selected image.
      • RDMA software stack

        To further optimize the network performance of GPU-accelerated instances that use the SHENLONG architecture, Alibaba Cloud provides GPU-accelerated compute-optimized SCC instance families, which are named sccgn instance families. sccgn instances provide superior computing power and network communication. RDMA software stacks can be automatically installed for the sccgn7ex instance family. This way, you can use the GPUDirect RDMA feature with ease. For more information, see sccgn instance family.

      • AIACC-Training is an AI accelerator that is developed by Alibaba Cloud. AIACC-Training can significantly improve training performance for mainstream AI computing frameworks, such as TensorFlow, PyTorch, MxNet, and Caffe. For more information, see AIACC-Training.
        Note CentOS 6, SUSE Linux, and Alibaba Cloud Linux do not support AIACC-Training.
      • AIACC-Inference is an AI accelerator that is developed by Alibaba Cloud. AIACC-Inference can significantly improve inference performance for mainstream AI computing frameworks, such as TensorFlow, and the frameworks that can be converted to the Open Neural Network Exchange (ONNX) format. For more information, see AIACC-Inference.
        Note CentOS 6, SUSE Linux, and Alibaba Cloud Linux do not support AIACC-Inference.
  5. Complete storage settings.
    ECS instances provide storage capabilities based on the system disks, data disks, and Apsara File Storage NAS file systems that are attached to the instances. ECS provides cloud and local disks to meet the storage requirements of different scenarios.
    Cloud disks including ESSDs, standard SSDs, and ultra disks and can be used as system disks or data disks. For more information, see Disks.
    Note A cloud disk that is created along with an instance uses the same billing method as the instance.
    Local disks can be used only as data disks. For instance families equipped with local disks (such as instance families with local SSDs and big data instance families), the information of the local disks is displayed. For more information, see Local disks.
    Note Local disks cannot be attached to instances on your own.
    1. Configure the system disk.
      System disks are used to install operating systems. The default capacity of a system disk is 40 GiB. However, the actual minimum capacity is related to the image. The following table describes the capacity ranges of system disks for different images.
      ImageSystem disk capacity range (GiB)
      Linux (excluding CoreOS and Red Hat)[max{20, Image size}, 500]
      FreeBSD[max {30, Image size}, 500]
      CoreOS[max {30, Image size}, 500]
      Red Hat[max {40, Image size}, 500]
      Windows[max {40, Image size}, 500]
    2. Optional:Add data disks.
      You can create data disks from scratch or from snapshots. A snapshot is a point-in-time backup of a disk. You can import data in a quick manner by creating a disk from a snapshot. When you add a data disk, you can encrypt the disk to meet data security and regulatory compliance requirements. For more information about data encryption, see Encryption overview.
      Note A limited number of data disks can be attached to a single instance. For more information, see the "Elastic Block Storage (EBS) limits" section of Limits.
    3. Optional:Add NAS file systems.
      If you have a large amount of data to share among multiple instances, we recommend that you use NAS file systems to reduce costs in data transmission and synchronization.

      Select an existing NAS file system or click Create a file system to create a NAS file system in the NAS console. For more information, see Create a General-purpose NAS file system in the NAS console. After a NAS file system is created, go back to the ECS instance creation wizard and click the refresh icon to query the NAS file system list. For information about how to mount NAS file systems, see Mount NAS file systems when you purchase an ECS instance.

  6. Optional:Configure the snapshot service.
    You can use automatic snapshot policies to periodically create snapshots to back up disk data and prevent risks such as accidental data deletion.

    Select an existing automatic snapshot policy or click Create Automatic Snapshot Policy to create an automatic snapshot policy on the Snapshots page. For more information, see Create an automatic snapshot policy. After an automatic snapshot policy is created, go back to the ECS instance creation wizard and click the refresh icon to query the automatic snapshot policy list.

Step 2: Complete the settings in the Networking step

In the Networking step, you can configure parameters to allow instances to access the Internet and other Alibaba Cloud resources. This ensures the security of your instances. After you complete the settings in the Networking step, click Next.

  1. Specify parameters in the Network Type and Public IP Address sections.
    ParameterDescriptionReferences
    Network TypeSelect VPC.

    A virtual private cloud (VPC) is a logically isolated virtual network in Alibaba Cloud. You have full control over VPCs that belong to you. For example, you can specify a CIDR block and configure route tables and gateways for the VPC.

    If you do not want to use a custom VPC or vSwitch in the specified region when you create an instance, you can skip this operation. Then, the system creates a default VPC and a default vSwitch.
    Note You can skip this operation only if no available VPCs exist in the region where the instance is deployed.

    Select an existing VPC and vSwitch. You can also click go to the VPC console to create a VPC and a vSwitch in the VPC console. After the VPC and the vSwitch are created, go back to the ECS instance creation wizard and click the refresh icon to view the VPC and the vSwitch that you created.

    Public IP AddressIf you select an image of Windows 2008 R2 or earlier in the Basic Configurations step, you can select Assign Public IPv4 Address, or you can associate an elastic IP address (EIP) with the instance after the instance is created. This way, you can connect to the instance over other protocols such as the Remote Desktop Protocol (RDP) built into Windows, PC over IP (PCoIP), and XenDesktop HDX 3D. Otherwise, you cannot connect to the instance from a Virtual Network Console (VNC) client after the GPU driver is installed. A persistent black screen or startup interface appears when you attempt to connect to the instance.
    Note RDP does not support some applications such as DirectX and OpenGL applications. If you want to use these applications, you must manually install the VNC service and client.
    To assign a public IP address, perform the following operations:
    1. Select Assign Public IPv4 Address.
    2. Specify the Bandwidth Billing parameter.
      • Pay-By-Bandwidth: You are charged based on the specified bandwidth. This billing method is suitable for the scenarios that require stable network bandwidth.
      • Pay-By-Traffic: You are charged based on the traffic that you use. You can configure a peak bandwidth value to avoid excessive fees due to sudden traffic spikes. This billing method is suitable for scenarios that require highly variable bandwidth, such as the scenarios where traffic is low in most cases but spikes occasionally occur.
    3. Set Bandwidth or Peak Bandwidth based on your requirements.
    What is EIP?
  2. Select security groups.
    A security group is a virtual firewall that is used to control the inbound and outbound traffic of instances in the security group. For more information, see Overview.

    If you do not want to configure security group-related parameters when you create an instance, you can skip the step. The system creates a default security group. The default security group allows inbound traffic over SSH port 22, Remote Desktop Protocol (RDP) port 3389, and Internet Control Message Protocol (ICMP). You can modify the security group configurations after the security group is created.

    1. To create a security group, click create a security group.
      For more information about how to configure a security group, see Create a security group.
    2. Click Reselect Security Group.
    3. In the Select Security Group dialog box, select one or more security groups and click Select.
  3. Configure ENIs.
    ENIs are classified into primary ENIs and secondary ENIs. Primary ENIs cannot be unbound from instances. They cannot be created or released independently of the instances to which they are bound. Secondary ENIs can be bound to or unbound from instances to allow traffic to be switched between instances. To create a secondary ENI when you create an instance, click the add-nic icon and select a vSwitch to which to connect the secondary ENI.
    Note You can bind only one secondary ENI when you create an instance. Alternatively, you can create secondary ENIs and bind them to an instance after the instance is created. For more information about the number of ENIs that can be bound to an instance of each instance type, see Overview of instance families.

Step 3: Complete the settings in the System Configurations step

In the System Configurations step, you can configure the parameters to customize what you want to display for the GPU-accelerated instance in the ECS console and in the OS, and how to use the instance. For example, you can configure the Logon Credentials, Host, and User Data parameters. After you complete the settings in the System Configurations step, click Next.

  1. Configure logon credentials.

    We recommend that you set the Logon Credentials parameter to Key Pair or Password. If you set the Logon Credentials parameter to Set Later, you must bind an SSH key pair or reset the password before you connect to the instance from a management terminal. Then, you must restart the instance so that the logon credentials can take effect. If you restart the instance when the GPU driver is being installed, the GPU driver fails to be installed.

  2. Specify the instance name and description that you want to display in the ECS console. Specify the hostname that can be obtained from within the operating system. Configure whether to append incremental suffixes to the instance name and hostname.
    If you want to create multiple instances, you can set sequential instance names and hostnames to facilitate management. For more information about how to configure sequential instance names and hostnames, see Batch configure sequential names or hostnames for multiple instances.
  3. Configure advanced settings.
    1. Select an instance Resource Access Management (RAM) role.
      An ECS instance can assume an instance RAM role to obtain the permissions of the role. Then, the instance can securely make API requests to specific Alibaba Cloud services and manage specific Alibaba Cloud resources based on the Security Token Service (STS) temporary credentials of the role.

      Select an existing instance RAM role or click Create Instance RAM Role to create an instance RAM role in the RAM console. After an instance RAM role is created, go back to the ECS instance creation wizard and click the refresh icon to query the instance RAM role list. For more information, see Attach an instance RAM role.

    2. Select an instance metadata access mode.
      ECS instance metadata includes instance information in Alibaba Cloud. You can view the metadata of running instances and configure or manage the instances based on their metadata. You can view instance metadata. For more information, see View instance metadata.
    3. Configure user data.
      User data can be run as scripts on instance startup to automate instance configurations, or can be passed to instances as regular data. For more information, see Manage the user data of Linux instances and Manage the user data of Windows instances.
      If you select Auto-install GPU Driver, Auto-install RDMA Software Stack, AIACC-Training, and AIACC-Inference in the Basic Configurations step, an automatic installation script appears in the lower part of the Advanced section. You can select Auto-install RDMA Software Stack only when you use an instance of the sccgn7ex instance family. The first time the instance is started after the instance is created, cloud-init runs the script. 2022-04-20_14-50-18.png
      Note You can also customize an automatic installation script and import the script so that a GPU driver, an RDMA software stack, AIACC-Training, and AIACC-Inference can be automatically installed. For more information, see Configure an automatic installation script.

Step 4: (Optional) Complete the settings in the Grouping (Optional) step

In the Grouping (Optional) step, you can configure parameters such as Tags and Resource Group for easy search and management. After you complete the settings in the Grouping (Optional) step, click Next.

  1. Add tags.
    Each tag consists of a key and a value. You can add tags to resources that have identical characteristics, such as resources that belong to the same organization and resources that serve the same purpose. You can use tags to search for and manage resources in an efficient manner. For more information, see Overview.

    Select an existing tag, or enter a key and a value to create a tag.

  2. Select a resource group from the Resource Group drop-down list.
    Resource groups allow you to manage resources across regions or across services based on your business requirements and manage the permissions of resource groups. For more information, see Resource groups.

    Select an existing resource group, or click click here to create a resource group on the Resource Group page. After a resource group is created, go back to the ECS instance creation wizard and click the refresh icon to query the resource group list. For more information, see Create a resource group.

  3. Select a deployment set.
    Deployment sets support the high availability strategy. After you apply the high availability strategy to a deployment set, all the instances in the deployment set are distributed across different physical servers to ensure business availability and implement underlying disaster recovery.

    Select an existing deployment set or click manage the deployment set to create a deployment set. After a deployment set is created, go back to the ECS instance creation wizard and click the refresh icon to query the deployment set list. For more information, see Create a deployment set.

  4. Select a dedicated host.
    A dedicated host is a cloud host whose physical resources are exclusively reserved for a single tenant. Dedicated hosts meet strict security compliance requirements and support bring your own license (BYOL) when you migrate services to Alibaba Cloud.

    Select an existing dedicated host or click create a DDH to create a dedicated host. After the dedicated host is created, go back to the ECS instance creation wizard and click the refresh icon to query the dedicated host list. For more information, see Create a dedicated host.

  5. Select a private pool.
    After an elasticity assurance or a capacity reservation is created, the system generates a private pool to reserve resources for a specific number of instances that have specific attributes. During the validity period of the elasticity assurance or capacity reservation, you always have access to the resources reserved in the private pool when you want to create instances. For more information, see Overview.
    Note Only pay-as-you-go instances can be created from the resources reserved by elasticity assurances or capacity reservations.
    Private poolDescription
    OpenThe capacity in open private pools takes priority over the capacity in the public pool. If no capacity is available in private pools, the system attempts to use the capacity in the public pool.
    NoneThe capacity in private pools is not used.
    TargetedThe capacity in a specified or open private pool is used to create instances. If no capacity is available in the specified private pool, the instances cannot be created.

Step 5: Complete the settings in the Preview step

Before the instance is created, make sure that all selected settings, such as the usage duration, meet your business requirements.

  1. Check the selected settings.
    To modify the settings in a step, click the edit icon to go to the relevant step. You can generate a template based on the selected settings. Then, you can use the template to create instances that have similar settings. The following table describes the buttons that you can use to generate the template.
    ButtonDescriptionReferences
    Save as Launch TemplateSaves the settings as a launch template. You can use the launch template to create instances without the need to configure the settings again. Create an instance by using a launch template
    View Open APIGenerates the API workflow and the SDK sample code for your reference.
    Save as ROS TemplateSaves the settings as a Resource Orchestration Service (ROS) template. You can create stacks from the template to deliver resources in an efficient manner. Create a stack
  2. Configure the usage duration of the instance.
    • Pay-as-you-go instance or preemptible instance: Specify an automatic release time for the instance. You can also manually release the instance or specify an automatic release time for the instance after the instance is created. For more information, see Release an instance.
    • Subscription instance: Specify the usage duration and specify whether to enable auto-renewal. You can also manually renew the instance or enable auto-renewal for the instance after the instance is created. For more information, see Renewal overview.
  3. Read ECS Terms of Service and Product Terms of Service. If you agree to them, select ECS Terms of Service and Product Terms of Service.
  4. In the lower part of the page, view the total fees of the instance, confirm the order, and then follow on-screen instructions to complete the payment.
    If you select Auto-install GPU Driver, the system installs the GPU driver after the instance is created. The installation duration takes about 10 to 20 minutes and varies based on the internal bandwidth and the number of vCPUs provided by different instance types. You can connect to the instance to view the installation process. You can view the installation logs in the /root/auto_install/auto_install.log directory after the GPU driver is installed. The following table describes the display effects during the installation process.
    Installation processDisplay effect
    In progressThe installation progress bar appears.
    InstalledThe installation result ALL INSTALL OK appears.
    FailedThe installation result INSTALL FAIL appears.
    Important When the GPU driver is being installed, the GPU is unavailable. You cannot perform operations or install other GPU-related software on the instance. This prevents an installation failure and ensures instance availability.

Configure an automatic installation script

You can use the automatic installation script in the following scenarios:
  • You do not want to select Auto-install GPU Driver, Auto-install RDMA Software Stack, AIACC-Training, or AIACC-Inference in the Basic Configurations step, and you want to enter an automatic installation script in the System Configurations step.
  • You want to call the RunInstances operation to create a GPU-accelerated instance. In this case, you must upload an automatic installation script by specifying the UserData parameter.

To configure an automatic installation script and use the script to install a GPU driver when you create the instance, perform the following operations:

  1. Customize an automatic installation script.
    The automatic installation script contains the following content:
    #!/bin/sh
    
    #Please input version to install
    
    IS_INSTALL_RDMA=""
    
    IS_INSTALL_AIACC_TRAIN=""
    
    IS_INSTALL_AIACC_INFERENCE=""
    
    DRIVER_VERSION=""
    
    CUDA_VERSION=""
    
    CUDNN_VERSION=""
    
    IS_INSTALL_RAPIDS="FALSE"
    
     
    
    INSTALL_DIR="/root/auto_install"
    
     
    
    #using .run to install driver and cuda 
    
    auto_install_script="auto_install.sh"
    
     
    
    script_download_url=$(curl http://100.100.100.200/latest/meta-data/source-address | head -1)"/opsx/ecs/linux/binary/script/${auto_install_script}"
    
    echo $script_download_url
    
     
    
    mkdir $INSTALL_DIR && cd $INSTALL_DIR
    
    wget -t 10 --timeout=10 $script_download_url && sh ${INSTALL_DIR}/${auto_install_script} $DRIVER_VERSION $CUDA_VERSION $CUDNN_VERSION $IS_INSTALL_AIACC_TRAIN $IS_INSTALL_AIACC_INFERENCE $IS_INSTALL_RDMA $IS_INSTALL_RAPIDS
    Note The automatic installation script uses the .run installation package to install modules, such as GPU drivers.
    You must add the following parameters to the script based on your business requirements.
    • Specify the versions of the GPU driver, CUDA library, and cuDNN library based on the selected instance family and image version. For more information, see Image versions and instance families supported for GPU drivers. Sample code:
      DRIVER_VERSION="470.82.01"
      CUDA_VERSION="11.4.1"
      CUDNN_VERSION="8.2.4"
    • Specify whether to install an RDMA software stack.
      Note You can install RDMA software stacks only when you use instances that belong to the sccgn7ex instance family.
      If you want to install an RDMA software stack, set the IS_INSTALL_RDMA parameter to TRUE. If you do not want to install an RDMA software stack, set the IS_INSTALL_RDMA parameter to FALSE. Sample code:
      IS_INSTALL_RDMA="TRUE"
    • Specify whether to install AIACC-Training and AIACC-Inference.
      • If you want to install AIACC-Training, set the IS_INSTALL_AIACC_TRAIN parameter to TRUE. If you do not want to install AIACC-Training, set the IS_INSTALL_AIACC_TRAIN parameter to FALSE.
      • If you want to install AIACC-Inference, set the IS_INSTALL_AIACC_INFERENCE parameter to TRUE. If you do not want to install AIACC-Inference, set the IS_INSTALL_AIACC_INFERENCE parameter to FALSE.
      Sample code:
      IS_INSTALL_AIACC_TRAIN="TRUE"
      IS_INSTALL_AIACC_INFERENCE="FALSE"
  2. After the script is customized, enter the script in the field below User Data in the Advanced section in the System Configurations step.

    After the instance is started, the system installs the GPU driver, CUDA library, and cuDNN library. The system also determines whether to install the RDMA software stack, AIACC-Training, and AIACC-Inference based on the script that you entered. After the installation, the system restarts the instance for the GPU driver to run.

    Note The GPU driver in persistence mode is more stable. When you use the automatic installation script, the system enables the persistence mode for the GPU driver in Linux on instance startup. This ensures that the persistence mode is enabled for the GPU driver after the instance is restarted.