All Products
Search
Document Center

E-MapReduce:FAQ about cluster management

Last Updated:Mar 15, 2025

This topic provides answers to some frequently asked questions about cluster management.

Can I upgrade an EMR cluster?

No, it is not possible to upgrade an E-MapReduce (EMR) cluster or the services deployed within it. To obtain an upgraded EMR cluster or services, you must release the current cluster and create a new one.

What services do EMR clusters support?

The range of services you can add to an EMR cluster depends on the EMR version. For more information, see the referenced document.

Can I add the Zeppelin component in the console?

EMR does not support the addition of the Zeppelin component as a new service through the console. If you want to incorporate Zeppelin, you can install it on an ECS server associated with any master node. For other components that are not addable via the console, they can be installed and managed directly on the underlying ECS server. For detailed information on cluster scenarios and the services that can be integrated, see the referenced document.

Do EMR clusters support Oozie? What do I do if Oozie is not supported and I want to use the service?

DataLake clusters in EMR V5.8.0 or later, along with EMR V3.42.0 or later, do not include Oozie. If you require a workflow scheduling service, EMR Workflow is available as an alternative. For more information, see the referenced document.

Why does a high-availability EMR cluster have three master nodes?

A high-availability EMR cluster is more reliable with three master nodes than with two. Clusters with only two master nodes are no longer supported due to their reduced reliability.

To enhance reliability, EMR high-availability clusters distribute master nodes across distinct hardware, mitigating the risk of simultaneous failures.

How do I enable data disk encryption? What are the impacts after I enable data disk encryption?

You can enable data disk encryption during the Basic Configuration step within the Advanced Configuration section on the Create Cluster page. For more information, see the referenced document.

Important

Data disk encryption can only be enabled when creating a cluster; it is not possible to enable it for an existing cluster.

Once a data disk is encrypted, both the data in transit and the data at rest on the disk are encrypted. The data disk encryption feature is beneficial for businesses with security compliance requirements. It is transparent to applications at the operating system level of ECS instances and does not impact job execution.

How do I release a cluster that fails to be created?

Typically, an EMR cluster creation fails due to incorrect RDS configurations or because the selected ECS instance type is not available.

If any ECS instances were created but the cluster status indicates 'Startup Failed', you must visit the ECS console to release those ECS instances. Once released, the system will automatically release the EMR cluster.

Should the EMR deployment fail with the cluster status showing 'Abnormal End', the cluster will not hold any resources or charges. You can click Delete in the Actions column next to the desired cluster to remove it.

Can I add services to an EMR cluster after I create the cluster?

EMR allows the addition of certain services post-cluster creation that were not initially installed. For more information, see the referenced document.

Important
  • After adding services, you may need to manually adjust the configurations of some existing services and restart them for the changes to take effect. It is recommended to add services during off-peak hours.

  • The services that can be added to an existing cluster depend on the EMR version. The EMR console displays the specific services available for addition to your cluster.

Do I need to restart a service after I modify the configurations of the service?

When you change server-side configurations in an EMR cluster, such as those for Spark, Hive, or the Hadoop Distributed File System (HDFS), a service restart is required for the changes to take effect. However, if you alter client-side configurations, you can click Deploy Client Configuration to apply the changes without needing to restart the service. For more information on how to modify or add configuration items, see the referenced document.

What is rolling restart?

A rolling restart is a process where the system sequentially restarts each ECS instance. It ensures that the previous ECS instance has been restarted and all big data services running on it have been restored before proceeding to the next instance. Typically, restarting an ECS instance takes about five minutes.

How do I associate a public IP address with an ECS instance after I create an EMR cluster?

You can apply for an EIP address and associate it with a VPC-type instance that has not been assigned a public IP address, enabling the ECS instance to be accessed via the Internet. For more information, see the referenced document.

In which scenarios do I need to turn on add to deployment set?

Alibaba Cloud ECS offers the deployment set feature to manage the distribution of ECS instances. For clusters whose core nodes are ECS instances with local disks, it is advisable to enable 'Add to Deployment Set' to enhance data security. By doing so, you ensure that multiple ECS instances are not deployed on the same physical machine, safeguarding against data loss on local HDFS in the event of physical machine failure.

A deployment set can include up to 20 ECS instances. For more information, see the referenced document.

How do I configure the add to deployment set parameter when I scale out an EMR cluster?

By default, the Add to Deployment Set feature is enabled for ECS instances with local disks and disabled for ECS instances with other types of disks. You can choose to enable or disable the Add to Deployment Set feature according to your business needs. For more information on enabling the Add to Deployment Set feature, see the referenced document.

How do I specify the disk size when I scale out an EMR cluster?

When scaling out an EMR cluster, the disk size for new nodes will match that of the existing nodes within the node group and cannot be changed. However, you can expand the node group's disk capacity to meet your business needs. For more information on disk expansion, see the referenced document.

Can I resize the disks of EMR clusters?

You can expand the data disks in EMR clusters, but you cannot reduce their size. System disks cannot be resized.

To expand the data disks of a node group, navigate to the Node Management tab of the desired cluster and click Expand Disk for the specific node group. For more information, see the referenced document.

Can I scale out or scale in an EMR cluster?

Yes, you can scale out or scale in an EMR cluster, but the procedures differ depending on the type of node:

  • Scale-out: You are permitted to add core and task nodes only. The new nodes will have the same configuration as the existing nodes in their respective node groups by default. It is essential to complete the payment for your order when scaling out; otherwise, the process will fail. For more information on scaling out a cluster, see the referenced document.

  • Scale-in: The rules for scaling in also depend on the node type. The applicable rules are as follows:

How do I modify the configurations of ECS instances in a node group?

To modify the configurations of ECS instances within a node group of a subscription EMR cluster, you can upgrade the node group's specifications. However, it is not possible to downgrade the specifications of a node group.

To upgrade the specifications of a node group, navigate to the Node Management tab of the desired subscription cluster. There, select either the master node group or core node group and click more > Upgrade Configuration to proceed with the upgrade. For more information, see the referenced document.

What do I do if the "The specified parameter AddNumber is not valid" error message appears when I scale out a cluster?

  • Problem Description: When attempting to scale out a cluster, the following error message is displayed: The specified parameter AddNumber is not valid. add instances number :xxx larger than deploymentSet availableAmount: xxx deploymentSetId: ds-uf6gwfou0a13kekupt14xxxx.

  • Cause: This error message suggests that the deployment set feature is active for your cluster, and the number of nodes in the node group has reached the deployment set's maximum limit. For more information about deployment sets, see the referenced document.

  • Solution: To increase the maximum number of nodes allowed in a deployment set for your account, please contact ECS technical support.

How do I disable collection of service operational logs?

To stop EMR from collecting your data, you can turn off the service operational logs collection feature.

Important

Disabling service operational logs collection will limit the EMR cluster health check and technical support. Other cluster features will still be accessible. Exercise caution when proceeding.

Solution:

  1. Disable collection of service operational logs

    • When creating a cluster: In the Software Configuration step of the Create Cluster page, click Allow Collection Of Service Operational Logs.

    • After the cluster is created: On the Basic Information page of the target cluster, in the Software Information section, click Service Run Log Collection Status.

  2. Verify that service operational logs collection is disabled.

    Check for the presence of namenode-log information in the /usr/local/ilogtail/user_log_config.json file. If it is absent, the log collection is disabled.

    Note

    Please note that after disabling the log collection, it may take approximately 2 to 3 minutes for the change to take effect. Wait patiently during this time.

What type of data does service operational logs contain?

Service operational logs exclusively capture information related to the activities of service components within a cluster. You can toggle log collection for all services with a single click. Disabling the collection of service operational logs will restrict the EMR cluster health check and the availability of after-sales technical support.

Important

By default, the collection of Service Operational Logs is enabled when you create a cluster. You should decide whether to disable this feature based on your business needs. For more information, see the referenced document.

Which types of clusters support EMR Doctor (health check feature in the EMR console)?

The health check feature is supported only by DataLake and Hadoop clusters. Once a cluster is created, you can access the health check feature through the Monitoring And Diagnostics > Health Check tab for the respective cluster in the EMR console.

If your Hadoop cluster does not have the health check feature, you must enable EMR Doctor. For more information, see the referenced document.

Does the installation or upgrade of EMR Doctor exert impacts on services in an EMR cluster and jobs that run on the cluster?

The installation or upgrade of EMR Doctor does not require restarting any services within an EMR cluster, nor does it impact existing jobs running on the cluster. Once EMR Doctor is installed, it automatically configures the necessary parameters for the cluster, eliminating the need for manual configuration.

When installing or upgrading EMR Doctor, EMR delivers configurations for services such as YARN, Spark, Tez, and Hive to the clusters. Before installing or upgrading EMR Doctor, it is advisable to check if any service configurations have been modified and saved but not yet delivered, and to assess the potential impact of delivering these configurations to the clusters.

What type of data does EMR Doctor collect?

EMR Doctor does not collect your actual data, nor does it scan your files or their content.

It collects only essential event data, including the start and end times, metrics, and counters associated with a job.

Am I charged for EMR Doctor?

You can use EMR Doctor at no cost.

What are the impacts of job data collection on job execution?

The storage metadata collection feature of EMR Doctor can dynamically adjust the resource collection volume based on user resource availability, ensuring that it does not consume excessive user resources.

EMR Doctor's job data collection utilizes Java probe technology without initiating a separate Java process for monitoring. Data is gathered asynchronously, ensuring that the main job process is not interrupted. In instances of high collection load, data is automatically discarded, and collection frequency can be modified through parameter configuration.

Below is a table displaying data from select TPC-DS tests.

SQL and engine

Collection time with EMR Doctor (average of 10 times)

Collection time without EMR Doctor (average of 10 times)

query7 (Spark)

21.0 s

21.2 s

query71 (Tez)

50.8 s

49.8 s

query19 (MapReduce)

68.6 s

68.2 s

Note

The TPC-DS implementation discussed here is based on the TPC-DS benchmark but is not intended for comparison with official TPC-DS benchmark results. The tests mentioned do not fulfill all the TPC-DS benchmark criteria.

When can I see the collection report?

Once EMR Doctor is installed or upgraded on an EMR cluster, the daily cluster report feature analyzes the jobs you plan to run and checks if the storage metadata collection feature is active. For this analysis to occur, the EMR cluster must have jobs present.

  • Computing jobs: After computing jobs in an EMR cluster are collected, you can view the latest reports on the following day. These reports provide an overall assessment of the cluster, reflecting the execution status of the jobs within it.

  • Storage analysis: The Collect Information About Storage Resources feature within EMR Doctor is turned off by default. You can activate this feature manually. Once activated, it collects data at 10:00 AM on the same day. Analysis of this data occurs the next morning, with reports generated based on the findings. If the feature is activated in the afternoon, the reports will be available on the third day.

Can specific values be provided for parameters?

The optimization suggestions from EMR Doctor are intended to guide you. For instance, we may advise reducing memory usage or altering garbage collection parameters, but we do not supply exact values for these parameters. EMR Doctor employs a sampling method to gather data, ensuring minimal disruption to your programs. It is your responsibility to fine-tune the suggested configurations and test their effectiveness.

What do I do if the "Insufficient ECS resources" error message appears when I scale out a cluster?

  • Problem Description: A cluster scale-out fails with the "Insufficient ECS resources_OutofStock" or "Insufficient ECS resources_OperationDenied.NoStock" error message.

  • Cause: The error indicates that the ECS instance type required for the node group's scale-out is currently unavailable.

  • Solution: You can wait for the desired ECS instance type to become available before scaling out the cluster, or you can create a new node group with a different ECS instance type to proceed with the scale-out. For more information, see the referenced document.

What do I do if the "Insufficient ECS resources" error message appears when I create a cluster?

  • Problem Description: The creation of a cluster or node group fails, displaying an "Insufficient ECS resources_OutofStock" or "Insufficient ECS resources_OperationDenied.NoStock" error message.

  • Cause: This error indicates that the selected ECS instance type for creating a cluster or node group is currently unavailable and does not fulfill your requirements.

  • Solution: Choose an alternative ECS instance type that has adequate resources available and satisfies your business needs for the creation of a cluster or node group.

How do I remove unnecessary services?

Once a service is deployed in a cluster and started, it cannot be removed from the console or through an API operation.

How do I log on to a node in a cluster?

Once you have created an E-MapReduce cluster, you can log on to the master node using the password you established during the cluster's creation. For instructions on logging on to other nodes, see the referenced document.

How do I view the vSwitch to which an instance belongs?

In Alibaba Cloud EMR on ECS, vSwitch information is closely associated with the node group and is not directly visible on the Basic Information page. To view the vSwitch details for an instance, navigate to the Node Management page, click the name of the node group associated with the instance, and there you can find the vSwitch information. image

How do I resolve packet loss in a large-scale cluster?

  • Problem description: Clusters frequently experience packet loss. System logs may contain error messages like neighbour: arp_cache: neighbor table overflow!, indicating that the ARP cache table has reached its capacity limit and cannot effectively map MAC addresses to IP addresses, leading to reduced network performance.

  • Cause: Network instability and packet loss can occur in large-scale distributed systems, particularly when a single cluster has over 1,000 servers and uses EMR V5.18.0 or earlier, or EMR V3.52.0 or earlier (excluding these versions). Configuring system parameters can optimize ARP cache management.

    An ARP cache table stores MAC and IP address pairs. Related parameters include the following:

    • net.ipv4.neigh.default.gc_thresh1: The threshold below which the ARP cache table does not perform garbage collection. Default value: 128.

    • net.ipv4.neigh.default.gc_thresh2: The threshold above which the ARP cache table performs garbage collection within five seconds. Default value: 512.

    • net.ipv4.neigh.default.gc_thresh3: The maximum number of entries allowed in the ARP cache table. Default value: 1024.

    Note

    The default values for these parameters are low, which can lead to packet loss and network instability in clusters with more than 1,000 servers. You should adjust the default values according to your business needs.

  • Solution:

    1. Edit the /etc/sysctl.conf file to add the following content, which increases the ARP cache capacity limit and optimizes the maximum number of connections.

          net.ipv4.neigh.default.gc_thresh1 = 512
          net.ipv4.neigh.default.gc_thresh2 = 2048
          net.ipv4.neigh.default.gc_thresh3 = 10240
          net.nf_conntrack_max = 524288
    2. Execute the sudo sysctl -p command to apply the new settings.

      Note

      If you receive the error message sysctl: cannot stat /proc/sys/net/nf_conntrack_max: No such file or directory upon executing the sysctl -p command, you should first execute sudo modprobe nf_conntrack to load the necessary module. Afterwards, attempt to update the configuration by running sysctl -p again.

What do I do if I receive the SystemMaintenance.Redeploy system event?

Receiving a system maintenance instance redeployment (SystemMaintenance.Redeploy) system event for a local disk instance indicates a potential software or hardware threat in the underlying host of your ECS instance within the cluster node. This threat could lead to the redeployment of your ECS instance. To prevent data loss, avoid clicking Redeploy in the ECS console.

Solution:

  1. Determine which node the event is affecting by reviewing the event details.

  2. Add a new node to the same node group as the affected node. For more information, see the referenced document.

  3. Remove the faulty node.

    • For subscription clusters, remove nodes from either a core node group or a task node group. For more information, see the referenced document.

      Note

      When unsubscribing from a subscription ECS instance, ECS will calculate and display the refund amount. If you have questions, or submit a ticket and choose Elastic Compute Service under the Product Classification section.

    • For pay-as-you-go clusters, remove nodes from a task node group. For more information, see the referenced document.

How do I add an EMR cluster ID tag to the disks of ECS instances in an EMR cluster by default?

To automatically add an EMR cluster ID tag to the disks of ECS instances within an EMR cluster, you can enable the associated resource tagging feature in Resource Management. Once enabled, any disk attached to an ECS instance will inherit the instance's tags and update accordingly if the instance's tags change.

Procedure:

  1. Log on to the Tag Console.

  2. In the left-side navigation pane, choose Tag > > Associated Resource Tag Settings.

  3. Read the feature description and select the checkbox to create a service-linked role.

    By enabling the Associated Resource Tagging feature, the system creates the AliyunServiceRoleForTag service-linked role, which is necessary for performing tag-related operations on associated resources. For more information, see the referenced document.

  4. Click Enable And Set Rules.

  5. Configure an associated resource tagging rule.

    For a resource type that supports the Associated Resource Tagging feature, you can either select All Tag Keys or choose "Specify some tag keys" to determine which tag keys the associated resource should inherit.

    image

  6. Click Confirm.

For detailed instructions on configuring tags, see the referenced document.

Error: IdempotentParameterMismatch

  • Problem description: You may encounter the following error message when attempting to release a cluster or update its configurations.

  • Cause: This error message suggests that the same client token has been used for multiple requests.

    The request uses the same client token as a previous, but non-identical request. Do not reuse a client token with different requests, unless the requests are identical.
  • Solution: Verify if your operation is currently underway. If it is, there is no need to resubmit the operation. If not, refresh the console page, and the EMR console will automatically generate a new client token.

Error: QuotaExceeded.PrivateIpAddress

  • Problem description: You may encounter the following error message when attempting to create or expand a cluster.

    [QuotaExceeded.PrivateIpAddress] The specified VSwitch "vsw-xxxx" does not have enough IP addresses.
  • Cause: This error message signifies that the selected vSwitch does not have enough available IP addresses to fulfill the requirements for creating or scaling out a cluster.

  • Solution: To resolve this issue, create a node group and choose a vSwitch with a sufficient number of IP addresses for the creation or expansion of your cluster.

Error: LostProxy

  • Problem description: The "taihao-proxy disconnect" error message occurs when creating, scaling out, or updating the configurations of a cluster.

  • Cause: This indicates that the EMR management proxy on the cluster node is disconnected.

  • Solution:

    1. Check the cluster status and resolve any node issues.

      • If multiple nodes are disconnected, examine the CPU and memory metrics.

        • If CPU or memory usage is high, the cluster may be overloaded. Consider upgrading the cluster's specifications or adding nodes to alleviate the load.

        • If CPU or memory usage is low, verify the security group configurations to ensure proper network communication.

      • If only a few nodes are disconnected, check the nodes' load and whether CPU or memory usage is at 100%. If the load is high, investigate and terminate any abnormal processes consuming resources. If no abnormal processes are detected, consider these solutions:

        • For the master node, inspect processes that use excessive CPU resources. You may need to upgrade the master node or add a MASTER-EXTEND node to reduce the load.

        • For non-master nodes, if an ECS instance is overloaded or unresponsive, consider removing the node or adding a new one.

          Execute the following command to restart the service on the node:

          service taihao-proxy restart
    2. After completing the above checks and actions, try creating, scaling out, or updating the service configurations of the cluster again.

What do I do if the "Insufficient account balance" error message appears when I create, scale out, or upgrade a cluster?

  • Problem Description: You encounter the following error message when creating, scaling out, or upgrading a cluster:

    nvalidAccountStatus.NotEnoughBalance Message: Your account does not have enough balance to order postpaid product. 
  • Cause: The balance in your account is insufficient.

  • Solution: Verify that your account balance exceeds the fees for the necessary resources. . Once you have confirmed that your account balance is sufficient, try the operation again.

What do I do if the "QuotaExceed.DiskCapacity" error message appears when I scale out a cluster or disk?

  • Problem description: You may see the following error message when attempting to scale out a cluster or disk.

    [QuotaExceed.DiskCapacity] The used capacity of disk type has exceeded the quota in the zone,  quota check fail.
  • Cause: This error message signifies that you have reached the disk quota.

  • Solution: The capacity used for the specified disk type has exceeded the quota limit in the zone. You can check your quota at the Quota Center and request an increase if necessary.

What do I do if the "QuotaExceed.DiskCapacity" error message appears when I create or scale out a cluster?

  • Problem description: You may see the "QuotaExceed.DiskCapacity" error message when creating or scaling out a cluster.

    QuotaExceed.ElasticQuota Message: The number of the specified ECS instances has exceeded the quota of the specified instance type. 
  • Cause: This error occurs when the ECS instance quota has been reached.

  • Solution: Consider selecting a different instance type or reducing the number of instances before purchasing. Alternatively, you can visit the ECS console or Quota Center to request an increase in your quota.

What do I do if a bootstrap script fails to be executed?

To address a failed bootstrap script execution, review the execution logs in the operation history details:

  • If the logs include specific error messages, revise the bootstrap script according to these messages and retry the operation.

  • Should the logs mention the exitCode without specific error messages, enhance the script with additional execution logs to aid in debugging, then attempt the operation once more.

  • In cases where a task times out or the logs show no output, verify the following:

    • Ensure that the user has read permissions for the OSS bucket containing the bootstrap script.

    • Examine the network settings of the ECS instance to confirm accessibility to the OSS internal same-region endpoint, then reattempt the operation.