All Products
Search
Document Center

DataWorks:Create and use an EMR JAR resource

Last Updated:Aug 11, 2023

DataWorks allows you to create E-MapReduce (EMR) Java Archive (JAR) resources in the DataWorks console. You can upload a JAR file that contains user-defined functions (UDFs) or open source MapReduce code as an EMR JAR resource. Then, you can reference the resource in compute nodes such as an EMR MR node. This topic describes how to create an EMR JAR resource by uploading a file, commit the resource, and reference the resource in compute nodes such as an EMR MR node.

Prerequisites

  • An Alibaba Cloud EMR cluster is created, and an inbound rule that contains the following content is added to the security group to which the cluster belongs.
    • Action: Allow
    • Protocol type: Custom TCP
    • Port range: 8898/8898
    • Authorization object: 100.104.0.0/16
  • An EMR compute engine instance is associated with your workspace. The EMR folder is displayed only after you associate an EMR compute engine instance with the workspace on the Workspace Management page. For more information, see Create and manage workspaces.
  • If you integrate Hive with Ranger in EMR, you must modify whitelist configurations and restart Hive before you develop EMR Hive nodes in DataWorks. Otherwise, the error message Cannot modify spark.yarn.queue at runtime or Cannot modify SKYNET_BIZDATE at runtime is returned when you run EMR Hive nodes.
    1. You can modify the whitelist configurations by using custom parameters in EMR. You can append key-value pairs to the value of a custom parameter. In this example, the custom parameter for Hive components is used. The following code provides an example:
      hive.security.authorization.sqlstd.confwhitelist.append=tez.*|spark.*|mapred.*|mapreduce.*|ALISA.*|SKYNET.*
      Note In the preceding code, ALISA.* and SKYNET.* are specific to DataWorks.
    2. After the whitelist configurations are modified, you must restart the Hive service to make the configurations take effect. For more information, see Restart a service.
  • An exclusive resource group for scheduling is created, and the resource group is associated with the virtual private cloud (VPC) where the EMR cluster resides. For more information, see Create and use an exclusive resource group for scheduling.
    Note You can use only exclusive resource groups for scheduling to run EMR Hive nodes.

Procedure

  1. Go to the DataStudio page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. In the top navigation bar, select the region in which the workspace that you want to manage resides. Find the workspace and click DataStudio in the Actions column.
  2. On the DataStudio page, move the pointer over the Create icon icon and choose EMR > Resource > EMR JAR.
    You can also find the workflow in which you want to create an EMR JAR resource, right-click the workflow name, and then choose Create > EMR > Resource > EMR JAR.
  3. In the Create Resource dialog box, set the parameters.
    Create Resource dialog box
    ParameterDescription
    Resource NameThe name of the resource that you want to create. The resource name must have the suffix .jar.
    LocationThe folder for storing the resource. The default value is the path of the current folder. You can modify the path based on your business requirements.
    File TypeThe type of the resource. Set the value to EMR JAR.
    Engine InstanceThe EMR compute engine instance to which the resource belongs. Select an instance from the drop-down list.
    Storage pathThe storage path of the resource. Valid values: OSS and HDFS.
    • If you select OSS, you must click Authorize next to OSS to authorize DataWorks and EMR to access Object Storage Service (OSS). Then, select a folder.
    • If you select HDFS, enter a storage path.
    FileThe file that you want to upload. You can click Upload, select a file from your on-premises machine, and then click Open.
  4. Click Create.
  5. Click the Save icon and Submit icon icons in the top toolbar to save and commit the resource to the development environment.
    Note
    • You must select a resource group for scheduling when you commit the EMR JAR resource. We recommend that you use an exclusive resource group for scheduling. If no exclusive resource groups for scheduling are available, you can purchase and configure one. For more information, see Create and use an exclusive resource group for scheduling.
    • When you commit the resource by using an exclusive resource group for scheduling, you can view the logs generated during the commission process in the log area of the page. After the resource is committed, messages are displayed in the log area and on the page to inform you of the result.

What to do next

After you create an EMR JAR resource, you can reference the resource in the code of compute nodes such as an EMR MR node. The following figure shows how to reference the resource. For more information, see Create an EMR MR node. Reference the resource