All Products
Search
Document Center

E-MapReduce:FAQ about data development

Last Updated:Dec 20, 2023

This topic provides answers to some frequently asked questions about data development.

Why do I fail to submit a job when an excessively large number of environment variables exist or the lengths of values of a parameter used by an environment variable are excessively long?

  • Problem description: The following error is reported when I submit a job:

    Message: FailedReason:FailedReason:[[JOB_ENGINE][JOB_ENGINE_START_JOB_FAILED/ERR-200001] Failed to execute job: [FJ-xxxx]].
  • Cause: The data development system imposes limits on the number of environment variables and the lengths of values of a parameter used by an environment variable in a single project. If an excessively large number of environment variables exist or the lengths of values of a parameter used by an environment variable exceed 1024 characters, you will fail to submit jobs.

  • Solution: If an excessively large number of environment variables need to be edited, add the variables to different projects to reduce the number of environment variables and shorten the lengths of values of a parameter used by an environment variable in a single project.

Why is the status of a Shell job inconsistent with the status of the YARN application that is started by the Shell job?

  • Problem description: I select the Shell job type on the Data Platform tab and randomly write a Shell job that can start a YARN application such as hive -f xxx.sql. Before the YARN application ends, I click a button to terminate the Shell job. As a result, the Shell job enters the KILLED state, but the YARN application continues to run until it naturally ends.

  • Cause: When you terminate a Shell job, the system sends a termination signal to the Shell process. If the driver of the YARN application does not have a parent-child relationship with the Shell process, the YARN application does not terminate when the Shell process is terminated. You can find the same issue in the services such as Hive, Sqoop, and Spark-submit in cluster mode.

  • Solution: We recommend that you do not use Shell jobs to develop Hive, Spark, or Sqoop jobs. Instead, you can use jobs of native types, such as Hive, Spark, and Sqoop jobs. These job types have an association mechanism, which can ensure the consistency between the status of a job driver and that of a YARN application.

What are the differences between jobs and workflows?

  • Job

    When you create a job on E-MapReduce (EMR), you configure only the JAR package, data input and output addresses, and some runtime parameters. These configurations determine how the job runs. After you complete the configurations and specify a job name, the job is created.

  • Workflow

    A workflow associates a job with a cluster.

    • You can use a workflow to run a sequence of jobs.

    • You can specify an existing cluster for a workflow to run jobs. If you do not specify a cluster, a temporary cluster is automatically created for the workflow.

    • You can also schedule the execution of your workflow. After all jobs of the workflow are completed, the temporary cluster is automatically released.

    • You can view the status and logs of each execution in the records of each workflow.

How do I view job execution records?

After you submit a job, you can view the details of the job in the Data Platform module of the EMR console or on the YARN web UI.

  • Use the Data Platform module of the EMR console

    This method is suitable for jobs that are created and submitted in the EMR console.

    1. After you run the job, view the operational logs of the job on the Log tab.

    2. Click the Records tab to view the execution records of the job instance. Job execution records

    3. Find the record whose details you want to view and click Details in the Action column. On the Scheduling Center page, view the information about the job instance, job submission logs, and YARN containers.

  • Use the YARN web UI

    This method is suitable for jobs that are created and submitted in the EMR console or CLI.

    1. Enable port 8443.

    2. On the EMR on ECS page, find the cluster and click the name of the cluster. On the page that appears, click the Access Links and Ports tab.

    3. On the Access Links and Ports tab, click the link for YARN UI.

      To access the YARN web UI by using your Knox account, you must obtain the username and password of the Knox account.

    4. In the Hadoop console, click the ID of the job to view the details of the job. Hadoop console

How do I view logs on Object Storage Service (OSS)?

  1. Log on to the EMR console and click the Data Platform tab. Find the project, and click Workflows in the Action column. In the left-side navigation pane of the page that appears, click the workflow whose logs you want to view. Click the Records tab in the lower part of the page.

  2. On the Records tab, find the workflow instance that you want to view, and click Details in the Action column. On the Job Instance Info tab of the page that appears, obtain the ID of the cluster where the job is run.

  3. Go to the OSS://mybucket/emr/spark directory and find the folder named after the cluster ID.

  4. Go to the OSS://mybucket/emr/spark/clusterID/jobs directory, which contains folders named after job IDs. Each directory stores the operational logs of a job.

What do I do if the following error is reported when I read data from or write data to MaxCompute tables: java.lang.RuntimeException.Parse response failed: '<!DOCTYPE html>…'?

  • Cause: The MaxCompute Tunnel endpoint may be invalid.

  • Solution: Enter a valid MaxCompute Tunnel endpoint. For more information, see Endpoints.

What do I do if a TPS conflict occurs when multiple consumer IDs consume the same topic?

Cause: This topic may have been created during public preview or in other environments. This causes consumption data inconsistency among multiple consumer groups.

Can I view the job logs stored in core nodes?

Yes, you can view the job logs stored in core nodes on the YARN web UI. For more information, see Use the YARN web UI.

Why is a Spark Streaming job still in the Running state in the EMR console after the job has been stopped?

  • Cause: EMR cannot effectively monitor the status of Spark Streaming jobs that run in yarn-client mode.

  • Solution: Change the running mode of the job to yarn-cluster.

How do I fix the error "Error: Could not find or load main class"?

Check whether the path protocol header of the JAR file for the job is ossref. If it is not, change it to ossref.

How do I include local shared libraries in a MapReduce job?

Log on to the EMR console, go to the Configure tab of the YARN service page, and then modify parameters on the mapred-site tab based on the following information:

<property>  
    <name>mapred.child.java.opts</name>  
    <value>-Xmx1024m -Djava.library.path=/usr/local/share/</value>  
  </property>  
  <property>  
    <name>mapreduce.admin.user.env</name>  
    <value>LD_LIBRARY_PATH=$HADOOP_COMMON_HOME/lib/native:/usr/local/lib</value>  
  </property>

How do I specify the file path of an OSS data source for a MapReduce or Spark job?

You can specify the input and output data sources of a job in the oss://[accessKeyId:accessKeySecret@]bucket[.endpoint]/object/path format, which is similar to hdfs:// URLs.

You can access OSS data with or without an AccessKey pair:

  • (Recommended) EMR provides MetaService, which allows you to access OSS data without an AccessKey pair. You can specify a path in the oss://bucket/object/path format.

  • (Not recommended) You can configure the AccessKey ID, AccessKey secret, and endpoint parameters on the Configuration object for a MapReduce job or the SparkConf object for a Spark job. You can also directly specify the AccessKey ID, AccessKey secret, and endpoint in a Uniform Resource Identifier (URI).

How do I view the logs of a service deployed in an EMR cluster?

Log on to the master node of the cluster. View the logs of the service in the /mnt/disk1/log directory.

What do I do if the following error is reported: "No space left on device"?

  • Problem description:

    • A master or core node has insufficient storage space, which causes a job submission failure.

    • If a disk is full, an exception may occur in local Hive metadatabases such as MySQL Server, or a Hive metastore connection error may occur.

  • Solution: Free up enough disk space on the master node, including the system disk and HDFS space.

What do I do if the error "ConnectTimeoutException" or "ConnectionException" is reported when I access OSS or Log Service?

  • Cause: The OSS endpoint is a public endpoint, but your EMR core node does not have a public IP address. As a result, you cannot access OSS. The error is reported when you access Log Service for the same reason.

  • Solution:

    • Change the OSS endpoint to an internal endpoint.

    • You can also use MetaService provided by EMR to access OSS or Log Service. If you use this method, you do not need to specify an endpoint.

    For example, the select * from tbl limit 10 command can be successfully executed, but Hive SQL: select count(1) from tbl fails. Change the OSS endpoint to an internal network endpoint.

    alter table tbl set location "oss://bucket.oss-cn-hangzhou-internal.aliyuncs.com/xxx"
    alter table tbl partition (pt = 'xxxx-xx-xx') set location "oss://bucket.oss-cn-hangzhou-internal.aliyuncs.com/xxx"

How do I clear the log data of a completed job?

  • Problem description: The HDFS space of a cluster is full. A large volume of data is stored in the /spark-history directory.

  • Solution:

    1. Go to the Configure tab for the Spark service. Check whether the spark_history_fs_cleaner_enabled parameter is specified in the Service Configuration section.

      • If it is, change its value to true to periodically clear the logs of completed jobs.

      • If it is not, click the spark-defaults tab. Click Custom Configuration in the upper-right corner. In the dialog box that appears, add the spark_history_fs_cleaner_enabled parameter and set it to true.

    2. Select Restart All Components from the Actions drop-down list in the upper-right corner.

    3. In the Cluster Activities dialog box, specify Description and click OK.

    4. In the Confirm message, click OK.

Why does AppMaster take a long time to start a task?

  • Cause: If the number of tasks or Spark executors is large, AppMaster may take a long time to start a task. The running duration of a single task is short, but the overhead for scheduling jobs is large.

  • Solution:

    • Use CombinedInputFormat to reduce the number of tasks.

    • Increase the block size (dfs.blocksize) of the data generated by former jobs.

    • Set mapreduce.input.fileinputformat.split.maxsize to a larger value.

    • For a Spark job, log on to the EMR console and go to the Configure tab of the Spark service page. Then, set spark.executor.instances to a smaller value to reduce the number of executors or set spark.default.parallelism to a smaller value to reduce the number of parallel jobs.

Does EMR support real-time computing?

EMR provides three types of real-time computing services: Spark Streaming, Storm, and Flink.

How do I pass job parameters to a script?

You can run a script in a Hive job and use the -hivevar option to pass job parameters to the script.

  1. Prepare a script.

    In a script, you can reference a variable in the format of ${varname}, for example, ${rating}. Example:

    • Script name: hivesql.hive

    • Path of the script in OSS: oss://bucket_name/path/to/hivesql.hive

    • Script content:

      use default;
       drop table demo;
       create table demo (userid int, username string, rating int);
       insert into demo values(100,"john",3),(200,"tom",4);
       select * from demo where rating=${rating};
  2. Go to the Data Platform tab.

    1. Log on to the Alibaba Cloud EMR console by using your Alibaba Cloud account.

    2. In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.

    3. Click the Data Platform tab.

  3. In the Projects section of the page that appears, find the project that you want to edit and click Edit Job in the Actions column.

  4. Create a Hive job.

    1. In the Edit Job pane on the left, right-click the folder on which you want to perform operations and select Create Job.

    2. In the Create Job dialog box, specify Name and Description, and select Hive from the Job Type drop-down list.

    3. Click OK.

  5. Edit job content.

    1. Click Job Settings in the upper-right corner of the page. On the Basic Settings tab, specify Key and Value in the Configuration Parameters section. Set Key to a variable specified in the script, for example, rating. config_parameter

    2. Enter the following code in the Content field of the job to use the -hivevar option to pass the parameters configured in the job to the variables in the script:

      -hivevar rating=${rating} -f ossref://bucket_name/path/to/hivesql.hive
  6. Run the job.

    The following figure shows the result of the job. submit_log

How do I enable the HDFS balancer in an EMR cluster and optimize the performance of the HDFS balancer?

  1. Log on to a node of your cluster.

  2. Run the following commands to switch to the hdfs user and run the HDFS balancer:

    su hdfs
    /usr/lib/hadoop-current/sbin/start-balancer.sh -threshold 10
  3. Check the running status of the HDFS balancer:

    • Method 1

      less /var/log/hadoop-hdfs/hadoop-hdfs-balancer-emr-header-xx.cluster-xxx.log
    • Method 2

      tailf /var/log/hadoop-hdfs/hadoop-hdfs-balancer-emr-header-xx.cluster-xxx.log
    Note

    If the command output includes Successfully, the HDFS balancer is running.

    The following table describes the HDFS balancer parameters.

    Parameter

    Description

    Threshold

    Default value: 10%. This value ensures that the disk usage on each DataNode differs from the overall usage in the cluster by no more than 10%.

    If the overall usage of the cluster is high, set this parameter to a smaller value.

    If a large number of new nodes are added to the cluster, you can set this parameter to a larger value to move data from the high-usage nodes to the low-usage nodes.

    dfs.datanode.balance.max.concurrent.moves

    Default value: 5.

    The maximum number of concurrent block moves that are allowed in a DataNode. Set this parameter based on the number of disks. We recommend that you set this parameter to 4 × Number of disks as the upper limit for a DataNode.

    For example, if a DataNode has 28 disks, set this parameter to 28 on the HDFS balancer and 112 on the DataNode. Adjust the value based on the cluster load. Increase the value when the cluster load is low and decrease the value when the cluster load is high.

    Note

    After you set this parameter for a DataNode, restart the DataNode for the parameter setting to take effect.

    dfs.balancer.dispatcherThreads

    The number of dispatcher threads used by the HDFS balancer to determine the blocks that need to be moved. Before the HDFS balancer moves a specific amount of data between two DataNodes, the HDFS balancer repeatedly retrieves block lists for moving blocks until the required amount of data is scheduled.

    Note

    Default value: 200.

    dfs.balancer.rpc.per.sec

    The number of remote procedure calls (RPCs) sent by dispatcher threads per second. Default value: 20.

    Before the HDFS balancer moves data between two DataNodes, it uses dispatcher threads to repeatedly send the getBlocks() RPC to the NameNode. This results in a heavy load on the NameNode. To avoid this issue and balance the cluster load, we recommend that you set this parameter to limit the number of RPCs sent per second.

    For example, you can decrease the value of the parameter by 10 or 5 for a cluster with a high load to minimize the impact on the overall moving progress.

    dfs.balancer.getBlocks.size

    The total data size of the blocks moved each time. Before the HDFS balancer moves data between two DataNodes, it repeatedly retrieves block lists for moving blocks until the required amount of data is scheduled. By default, the size of blocks in each block list is 2 GB. When the NameNode receives a getBlocks() RPC, the NameNode is locked. If an RPC queries a large number of blocks, the NameNode is locked for a long period of time. This slows down data writing. To prevent this issue, we recommend that you set this parameter based on the NameNode load.

    dfs.balancer.moverThreads

    Default value: 1000.

    Each block move requires a thread. This parameter limits the total number of concurrent moves.

    dfs.namenode.balancer.request.standby

    Default value: false.

    Specifies whether the HDFS balancer queries the blocks to be moved on the standby NameNode. When a NameNode receives a getBlocks() RPC, the NameNode is locked. If an RPC queries a large number of blocks, the NameNode is locked for a long period of time. This slows down data writing. If you use an HA cluster, the HDFS balancer sends RPCs only to the standby NameNode.

    dfs.balancer.getBlocks.min-block-size

    The minimum size of blocks to be queried by the getBlocks() RPC. After you set this parameter, the getBlocks() RPC skips blocks smaller than the minimum size. This improves the query efficiency. Default value: 10485760 (10 MB).

    dfs.balancer.max-iteration-time

    The maximum duration of each iteration for moving blocks between two DataNodes. Default value: 1200000. Unit: milliseconds.

    When the duration of an iteration exceeds the limit, the HDFS balancer automatically enters the next iteration.

    dfs.balancer.block-move.timeout

    Default value: 0. Unit: milliseconds.

    When the HDFS balancer moves blocks, an iteration may last for a long time because some block moves are still going on. You can set this parameter to avoid this issue.

    The following table describes the DataNode parameters.

    Parameter

    Description

    dfs.datanode.balance.bandwidthPerSec

    The bandwidth for each DataNode to balance the workloads of the cluster. We recommend that you set the bandwidth to 100 MB/s. You can also set the dfsadmin -setBalancerBandwidth parameter to adjust the bandwidth. You do not need to restart DataNodes.

    For example, you can increase the bandwidth when the cluster load is low and decrease the bandwidth when the cluster load is high.

    dfs.datanode.balance.max.concurrent.moves

    The maximum number of concurrent threads on a DataNode used by the HDFS balancer to move blocks.

What do I do if the Custom Configuration button is not displayed for a service on the EMR console?

  1. Log on to the master node of the cluster.

  2. Go to the following configuration template directory:

    cd /var/lib/ecm-agent/cache/ecm/service/HUE/4.4.0.3.1/package/templates/

    hue templates

    Use the HUE service as an example.

    • HUE is the name of the service directory.

    • 4.4.0.3.1 is the Hue version.

    • hue.ini is the configuration file.

  3. Run the following command to add the required custom configuration:

    vim hue.ini

    If the configuration item already exists, you can change the value based on the time.

  4. In the EMR console, restart the service for the configuration to take effect.

What do I do if a job stays in the SUBMITTING state for a long period of time?

In most cases, this problem occurs because a component in the EMRFLOW service is stopped. You must start the component in the EMR console.

  1. Go to the EMRFLOW page.

    1. Go to the page for a service that is deployed in your cluster, replace the service name at the end of the URL in the address bar with EMRFLOW, and then press Enter. hdfs

      Note

      In this example, you are redirected from the HDFS page to the EMRFLOW page.

    2. Click the Component Deployment tab.

  2. Start the component that is in the STOPPED state.

    1. On the Component Deployment tab, find the component that is in the STOPPED state and click Start in the Actions column. Stopped

    2. In the Cluster Activities dialog box, specify Description and click OK.

    3. In the Confirm message, click OK.

  3. Check whether the component is started.

    1. Click History in the upper-right corner.

    2. In the Activity History dialog box, click Start EMRFLOW FlowAgentDaemon in the Activity column. Start Flow

    3. Click emr-header-1 in the Instance Name column. emr-header

    4. Click START_FlowAgentDaemon_ON_emr-header-1 in the Task Name column. Start-FlowAgent

    5. View the information displayed in the Task Log section. If the content in the rectangle in the following figure is displayed, the component is started. SUCCESS

      Note

      If an error is reported after the component is started, fix the error based on the logs. If a permission-related error is reported, log on to the cluster in SSH mode and run the sudo chown flowagent:hadoop /mnt/disk1/log/flow-agent/* command to fix the error. Then, perform the preceding steps again to start the component that is in the STOPPED state.