All Products
Search
Document Center

Elastic High Performance Computing:FAQ

Last Updated:Jan 29, 2023

This topic provides answers to some frequently asked questions (FAQ) about Elastic High Performance Computing (E-HPC).

Why am I unable to create clusters in some regions?

In some cases, you cannot create an E-HPC cluster in a region or zone where E-HPC is supported due to the following reasons:

  • In the current region, Apsara File Storage NAS (NAS) file systems cannot be created or no NAS file system is available. In this case, you cannot mount a NAS file system on your cluster.

  • The region or zone does not have Elastic Compute Service (ECS) instance types that match the compute nodes in your E-HPC cluster, or the ECS instances are insufficient.

We recommend that you select another region to create an E-HPC cluster.

Can I use the ECS console to manage nodes?

No, you cannot use the ECS console to manage nodes.

Each node is an ECS instance. However, the E-HPC console provides additional deployment processes, including but not limited to the following processes:

  • E-HPC allows you to specify the number of cluster nodes for different instance types. In this case, E-HPC creates ECS instances in batches for various types of nodes.

  • After the ECS instance that corresponds to each node is created, E-HPC deploys the management system.

  • E-HPC pre-installs the selected software and dependencies on your ECS instances by using the management system.

  • E-HPC configures a job scheduler on a management node.

All the preceding processes depend on E-HPC. If you use the ECS console to manage nodes, exceptions may occur in the clusters or nodes. In addition, the cluster resources may be unavailable. Therefore, you cannot use the ECS console to manage nodes.

How do nodes communicate with ECS instances over the internal network?

If your node and the ECS instance that you purchase are in the same virtual private cloud (VPC), they can communicate with each other in the VPC.

Why am I unable to log on to an E-HPC cluster by using SSH?

You may be unable to log on to an E-HPC cluster by using SSH due to various reasons. Troubleshoot the issue based on the actual situation.

  1. Check whether the username or password is valid.

  2. Check whether the on-premises network of the client or carrier network is connected.

  3. Check whether the security group rule of the logon node allows access to the required ports, for example, the default port 22 of SSH.

  4. Run the iptables -nvL --line-number command to check whether firewall is enabled or firewall rules are configured for the logon node.

If the issue persists, you can use Virtual Network Computing (VNC) to connect to the cluster. For more information, see FAQ about connecting to ECS instances.

Why does it take so long to log on to an E-HPC cluster equipped with NIS by using SSH?

Issue

It takes much time to log on to a node by using SSH or switch between nodes. Occasionally, the logon fails. Besides, I cannot restart sshd. The error message Failed to activate service 'org.freedesktop.systemd1': timed out is returned.

Cause

This is a bug of systemd that may occur when NIS is used.

Solution

  1. Log on to the node as a root user.

  2. Check the content of the /etc/nsswitch.conf file:

    cat /etc/nsswitch.conf

    If [NOTFOUND=return] is not displayed in the passwd, shadow, and group parameters, proceed with the following steps:

    passwd:     files sss nis
    shadow:     files sss nis
    group:      files sss nis
  3. Optional. Upgrade glibc:

    yum update glibc
  4. Update the nsswitch configuration file.

    1. Open the nsswitch.conf file:

      vim /etc/nsswitch.conf
    2. Modify the following content in the nsswitch.conf file and save the file:

      passwd:      files sss nis [NOTFOUND=return]
      shadow:     files sss nis [NOTFOUND=return]
      group:        files sss nis [NOTFOUND=return]

Why am I unable to set the number of scale-out compute nodes for an E-HPC cluster that uses Slurm when I configure the auto scaling policy?

By default, an E-HPC cluster that uses Slurm has eight dummy nodes. If the current cluster has five compute nodes, a job can use a maximum of 13 nodes. If you need to use more nodes to run the job, you must manually add compute nodes or increase the number of dummy nodes. To increase the number of dummy nodes, perform the following steps:

  1. Log on to the cluster as a root user.

    For more information, see Log on to an E-HPC cluster.

  2. Add the dummynodexxx file to the /opt/slurm/<slurm_version>/nodes directory.

    For example, to perform a job that requires 18 nodes, you need to add 10 dummy nodes. You can use dummyNode8-dummyNode17 to indicate the dummy nodes that you want to add.

    Note

    <slurm_version> is the version of Slurm in your cluster.

  3. In the /opt/slurm/<slurm_version>/etc/slurm.conf file, find PartitionName and add dummy node information.

    The following code shows how to add dummy node information:

    PartitionName=comp Nodes=dummynode0,dummynode1,dummynode2,dummynode3,dummynode4,dummynode5,dummynode6,dummynode7,dummynode8,dummynode9,dummynode10,dummynode11,dummynode12,dummynode13,dummynode14,dummynode15,dummynode16,dummynode17,compute000 Default=YES MaxTime=INFINITE State=UP

How do I perform real-name verification to purchase Alibaba Cloud services in the Chinese mainland?

If you want to purchase and use Alibaba Cloud services in the Chinese mainland, you must complete real-name verification. Then, you can use existing resources, purchase resources, and renew resources. If you select a region in the Chinese mainland when you purchase Alibaba Cloud services, the system checks whether you have completed real-name verification. If you have not completed real-name verification, an error message appears on the buy page. For more information, see Real-name verification.

How do I configure a remote mount directory for a NAS file system?

When you create an E-HPC cluster, you can specify a mount target and remote mount directory for a NAS file system. For example, your cluster has the following settings:

ClusterId=ehpc-mrZSoWf****      # The ID of the cluster.
VolumeMountpoint=045324a6dd-m****.cn-hangzhou.nas.aliyuncs.com # The mount target of the NAS file system.
RemotePath=/          # The root directory of the remote directory.

To mount the NAS file system on the nodes when you create an E-HPC cluster, perform the following steps.

Note

You can specify a remote mount directory based on your needs. Before you mount a NAS file system, you must create a mount target and mount directory.

  1. Create two subdirectories in the root directory:

    /ehpc-mrZSoWf****/opt
    /ehpc-mrZSoWf****/home
  2. When or after you create an E-HPC cluster, set the remote directory based on your needs.

    You may set the mount directory in the following cases. For more information, see Create an E-HPC cluster by using the wizard and Manage storage resources.

    /     # Mount the NAS file system on the /ehpcdata directory.
    /ehpc-mrZSoWf****/home    # Mount the NAS file system on the /home directory.
    /ehpc-mrZSoWf****/opt    # Mount the NAS file system on the /opt directory.

How do I manually install business software in an E-HPC cluster?

E-HPC clusters use Apsara File Storage NAS to share data between compute nodes. Therefore, you can manually install business software by using one of the following methods:

  • Install business software in the /opt directory. In this case, all cluster users can access and use the business software.

  • Install business software in the home directory of a cluster user. Generally, only the cluster user can access and use the business software.

Important

When you install some software, you must also install drivers or specific runtime environments, such as GPU drivers and YUM packages, on each compute node. After you install software on a compute node, you can use the custom image that is created based on the compute node to add more compute nodes. This way, the software can be automatically installed on all compute nodes.

What is the maximum number of clusters that I can create?

You can create a maximum of three clusters in a region. To increase the quota, submit a ticket.

What is the maximum number of nodes that I can create in an E-HPC cluster?

You can create a maximum of 500 nodes in an E-HPC cluster or add 500 compute nodes at a time. To increase the quota, submit a ticket.

What types of images are supported?

E-HPC supports public images, custom images, shared images, Marketplace images, and community images. The image types that you can select depend on the specified region and whether the current Alibaba Cloud account has available image resources.

Why am I unable to select a custom image?

When you create or scale out an E-HPC cluster, or configure an auto scaling policy, you may be unable to select a custom image due to the following reasons:

  • Your Alibaba Cloud account does not have a custom image in the current region. For more information, see Overview.

  • The operating system of the custom image is not supported by E-HPC.

  • E-HPC allows you to modify only an image provided by Alibaba Cloud.

  • When you configure the auto scaling policy, you must specify the same image in the global settings and the queue settings.

Why am I unable to scale out or create an E-HPC cluster by using a custom image?

When you scale out or create an E-HPC cluster by using a custom image, the cluster may fail to be scaled out or created due to the following limits:

  • You cannot modify the yum source configurations of the operating system in the custom image.

  • The /home directory or /opt directory cannot be the mount directory of the custom image or a symbolic link.

  • If the fstab file in the /etc directory of the custom image contains the mounting information about a file system, make sure that the cluster can access the file system or reside in the same VPC as the file system. Otherwise, you must delete the mounting information from the fstab file before you scale out or create the cluster.

  • You must keep the group whose account group ID is 1000 in the custom image.

  • The size of the system disk must be greater than or equal to that of the custom image.

What is role-based authorization?

Resource Access Management (RAM) provides a service-linked role named AliyunServiceRoleForEHPC for E-HPC. This role is used to authorize E-HPC to access associated cloud resources. E-HPC can assume the AliyunServiceRoleForEHPC role to access ECS, VPC, and NAS.

If the AliyunServiceRoleForEHPC role has not been attached to your account, you must first complete role authorization. For more information, see Manage a service-linked role.

Why am I unable to log on to the console to view cluster information as a RAM user?

If the RAM user is not granted the AliyunEHPCReadOnlyAccess permission, the Switch to RAM for authorization message appears. You must grant the AliyunEHPCReadOnlyAccess permission to the RAM user. Then, you can view cluster information as a RAM user.

To create an E-HPC cluster, user, or job, you must grant the AliyunEHPCFullAccess and AliyunNASFullAccess permissions to the RAM user. For more information, see Grant permissions to a RAM user.