DataWorks allows you to create E-MapReduce (EMR) JAR resources in the DataWorks console. You can upload a Java Archive (JAR) file that contains user-defined functions (UDFs) or open source MapReduce code as an EMR JAR resource. Then, you can reference the resource in compute nodes such as an EMR MR node. This topic describes how to create an EMR JAR resource by uploading a file, commit the resource, and reference the resource in compute nodes such as an EMR MR node.

Prerequisites

  • An EMR cluster is created. The inbound rules of the security group to which the cluster belongs include the following rules:
    • Action: Allow
    • Protocol type: Custom TCP
    • Port range: 8898/8898
    • Authorization object: 100.104.0.0/16
  • An EMR compute engine instance is bound to the required workspace. The EMR option is displayed only after you bind an EMR compute engine instance to the workspace on the Workspace Management page. For more information, see Configure a workspace.
  • If you integrate Hive with Ranger in EMR, you need to modify whitelist configurations and restart Hive before you develop EMR nodes in DataWorks. Otherwise, the error message Cannot modify spark.yarn.queue at runtime or Cannot modify SKYNET_BIZDATE at runtime is returned when you run EMR nodes.
    1. You can modify the whitelist configurations by using custom parameters in EMR. Append key-value pairs to the value of a custom parameter. In this example, the custom parameter for Hive components is used. The following sample code provides an example:
      hive.security.authorization.sqlstd.confwhitelist.append=tez.*|spark.*|mapred.*|mapreduce.*|ALISA.*|SKYNET.*
      Note In the code, ALISA.*and SKYNET.*are special configurations for DataWorks.
    2. After the whitelist configurations are modified, restart the Hive service to make the configurations take effect. For more information about how to restart a service, see Restart a service.

Procedure

  1. Go to the DataStudio page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. In the top navigation bar, select the region where your workspace resides, find the workspace, and then click Data Analytics in the Actions column.
  2. On the DataStudio page, move the pointer over the Create icon and choose EMR > Resource > EMR JAR.
    Alternatively, you can find the required workflow, right-click the workflow name, and then choose Create > EMR > Resource > EMR JAR.
  3. In the Create Resource dialog box, configure the following parameters.
    Create Resource
    Parameter Description
    Resource Name The name of the resource that you want to create. The resource name must have the suffix .jar.
    Location The folder in which the resource is stored. The default value is the path of the current folder. You can modify the path.
    Resource Type The type of the resource. Set the value to EMR JAR.
    Engine Instance The E-MapReduce compute engine instance to which the resource belongs. Select an instance from the drop-down list.
    Storage path The storage path of the resource. Valid value: OSS and HDFS.
    • If you select OSS, you must click Authorize next to OSS to authorize DataWorks and EMR to access Object Storage Service (OSS). Then, select a folder.
    • If you select HDFS, enter a storage path.
    File The file that you want to upload. You can click Upload, select a file from your on-premises machine, and then click Open.
  4. Click Create.
  5. Click the Save and Submit icons in the top toolbar to save and commit the resource to the development environment.

What to do next

After you create an EMR JAR resource, you can reference the resource in the code of compute nodes such as an EMR MR node. The following figure shows how to reference the resource. For more information, see Create an EMR MR node. Reference the resource