All Products
Search
Document Center

FAQ

Last Updated: Sep 07, 2021

This topic provides answers to some frequently asked questions (FAQ) about Elastic High Performance Computing (E-HPC).

Why am I unable to create clusters in some regions?

In some cases, you cannot create a cluster in a region or zone where E-HPC is supported due to the following reasons:

  • Apsara File Storage NAS (NAS) file systems cannot be created or are insufficient in the region. In this case, you cannot mount a NAS file system on your cluster.

  • The region or zone does not have Elastic Compute Service (ECS) instance types that match the compute nodes in your E-HPC cluster, or ECS instances are insufficient.

We recommend that you select another region to create a cluster.

Can I use the ECS console to manage nodes?

No, you cannot use the ECS console to manage nodes.

Each node is an ECS instance. However, the E-HPC console provides additional deployment processes, including but not limited to the following processes:

  • E-HPC allows you to specify the number of cluster nodes for different instance types. In this case, E-HPC creates ECS instances in batches for various types of nodes.

  • After the ECS instance that corresponds to each node is created, E-HPC deploys the management system.

  • E-HPC pre-installs the selected software and dependencies on your ECS instances by using the management system.

  • E-HPC configures a job scheduler on a management node.

All the preceding processes depend on E-HPC. If you use the ECS console to manage nodes, exceptions may occur in the clusters or nodes. In addition, the cluster resources may be unavailable. Therefore, you cannot use the ECS console to manage nodes.

How do nodes communicate with ECS instances over the internal network?

If your node and the ECS instance that you purchase are in the same virtual private cloud (VPC), they can communicate with each other in the VPC.

Why am I unable to log on to a cluster by using SSH?

You may be unable to log on to a cluster by using SSH due to various reasons. Troubleshoot the issue based on the actual situation.
  1. Check whether the username or password is valid.

  2. Check whether the on-premises network of the client or carrier network is connected.

  3. Check whether the security group rule of the logon node allows access to the required ports, for example, the default port 22 of SSH.

  4. Use the iptables -nvL --line-number to check whether firewall is enabled or firewall rules are configured for the logon node.

If the problem persists, you can use Virtual Network Computing (VNC) to connect to the cluster. For more information, see FAQ about connecting to ECS instances.

Why do I fail to set the number of scale-out compute nodes for a cluster that uses Slurm when I configure the auto scaling policy?

By default, a cluster that uses Slurm has eight dummy nodes. If the current cluster has five compute nodes, a job can use up to 13 nodes. If you need to use more nodes to run the job, you must manually add compute nodes or increase the number of dummy nodes. To increase the number of dummy nodes, perform the following steps:
  1. Log on to the cluster as a root user.

    For more information, see Log on to a cluster.

  2. Add the dummynodexxx file to the /opt/slurm/<slurm_version>/nodes directory.

    For example, to perform a job that requires 18 nodes, you need to add 10 dummy nodes. You can use dummyNode8-dummyNode17 to indicate the dummy nodes that you want to add.

    Note

    <slurm_version> is the version of Slurm in your cluster.

  3. In the /opt/slurm/<slurm_version>/etc/slurm.conf file, find PartitionName and add dummy node information.

    The following code shows how to add dummy node information:

    PartitionName=comp Nodes=dummynode0,dummynode1,dummynode2,dummynode3,dummynode4,dummynode5,dummynode6,dummynode7,dummynode8,dummynode9,dummynode10,dummynode11,dummynode12,dummynode13,dummynode14,dummynode15,dummynode16,dummynode17,compute000 Default=YES MaxTime=INFINITE State=UP

How do I perform real-name verification to purchase Alibaba Cloud services in mainland China?

If you want to purchase and use Alibaba Cloud services in mainland China, you must complete real-name verification. Then, you can use existing resources, purchase resources, and renew resources. If you select a region in mainland China when you purchase Alibaba Cloud services, the system checks whether you have completed the real-name verification. If you have not completed the real-name verification, an error message appears on the buy page. For more information, see Real-name Registration FAQs.

How do I configure a remote mount directory for a NAS file system?

When you create a cluster, you can specify a mount target and remote mount directory for a NAS file system. For example, your cluster has the following settings:

ClusterId=ehpc-mrZSoWf****      # The ID of the cluster.
VolumeMountpoint=045324a6dd-m****.cn-hangzhou.nas.aliyuncs.com # The mount target of the NAS file system.
RemotePath=/          # The root directory of the remote directory.

To mount the NAS file system on the nodes when you create a cluster, perform the following steps.

Note

You can specify a remote mount directory based on your needs. Before you mount a NAS file system, you must create a mount target and mount directory.

  1. Create two subdirectories in the root directory.

    /ehpc-mrZSoWf****/opt
    /ehpc-mrZSoWf****/home
  2. When or after you create a cluster, set the remote directory based on your needs.

    You may set the mount directory in the following cases.

    /     #Mount the NAS file system on the /ehpcdata directory.
    /ehpc-mrZSoWf****/home    #Mount the NAS file system on the /home directory.
    /ehpc-mrZSoWf****/opt    #Mount the NAS file system on the /opt directory.

What is the maximum number of nodes that I can create in a cluster?

You can create up to three clusters in a cluster in a region. To increase the quota, submit a ticket.

What is the maximum number of nodes that I can create in a cluster?

You can create up to 500 nodes in a cluster or add 500 compute nodes at a time. To increase the quota, submit a ticket.

Why am I unable to select a custom image when I use a cluster?

When you create or scale out a cluster, or configure an auto scaling policy, you may be unable to select a custom image due to the following reasons:

  • Your Alibaba Cloud account does not have a custom image in the current region. For more information, see Overview.

  • The operating system of the custom image is not supported by E-HPC. The following table lists the operating systems supported by E-HPC.

    Operating system

    Version

    CentOS

    • CentOS_6.9_64

    • CentOS_7.2_64

    • CentOS_7.3_64

    • CentOS_7.4_64

    • CentOS_7.5_64

    • CentOS_7.6_64

    • CentOS_8.0_6

    Windows Server

    • Windows Server 2019 Data Center Edition 64bit Chinese Edition

    • Windows Server 2019 Data Center Edition 64bit English Edition

    • Windows Server 2016 Data Center Edition 64bit Chinese Edition

    • Windows Server 2016 Data Center Edition 64bit English Edition

    • Windows Server 2012 R2 Data Center Edition 64bit Chinese Edition

    • Windows Server 2012 R2 Data Center Edition 64bit English Edition

    • Windows Server 2008 R2 Enterprise 64bit Chinese Edition

    • Windows Server 2008 R2 Enterprise 64bit English Edition

  • E-HPC allows you to modify only an image provided by Alibaba Cloud.

  • When you configure the auto scaling policy, the image specified in the global settings and that specified in the queue settings must be the same.

Why do I fail to scale out a cluster by using a custom image?

When you scale out a cluster, you can select a custom image. However, the scale-out may fail. Take note of the following limits:

  • You cannot modify the yum source configurations of the operating system in the custom image.

  • The mount directory of the custom image cannot be the /home directory or /opt directory.

  • You must keep the group whose account group ID is 1000 in the custom image.

What is role authorization?

Resource Access Management (RAM) provides a service-linked role named AliyunServiceRoleForEHPC for E-HPC. This role is used to authorize E-HPC to access associated cloud resources. E-HPC can assume the AliyunServiceRoleForEHPC role to access ECS, VPC, and NAS.

If the AliyunServiceRoleForEHPC role has not been attached to your account, you must first complete the role authorization. For more information, see Manage a service-linked role.

Why am I unable to log on to the console to view cluster information as a RAM user?

If the RAM user is not granted the AliyunEHPCReadOnlyAccess permission, the Switch to RAM for authorization message appears. You must grant the AliyunEHPCReadOnlyAccess permission to the RAM user. Then, you can view cluster information as a RAM user.

To create a cluster, user, or job, you must grant the AliyunEHPCFullAccess and AliyunNASFullAccess permissions to the RAM user. For more information, see Grant permissions to a RAM user.