You can create an E-MapReduce (EMR) MR node to process a large dataset by using multiple parallel map tasks. This way, you can perform parallel computing on large datasets. This topic describes how to create an EMR MR node, edit the node code, and then run the node.

Prerequisites

  • An EMR cluster is created. The inbound rules of the security group to which the cluster belongs include the following rules:
    • Action: Allow
    • Protocol type: Custom TCP
    • Port range: 8898/8898
    • Authorization object: 100.104.0.0/16
  • An EMR compute engine instance is bound to the required workspace. The EMR option is displayed only after you bind an EMR compute engine instance to the workspace on the Workspace Management page. For more information, see Configure a workspace.
  • Open source code is uploaded as an EMR JAR resource if you want to reference the open source code in your EMR MR node. For more information, see Create an EMR JAR resource.
  • User-defined functions (UDFs) are uploaded as EMR JAR resources and are registered if you want to reference the UDFs in your EMR MR node. For more information, see Create an E-MapReduce function.

Create an EMR MR node

  1. Go to the DataStudio page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. In the top navigation bar, select the region where your workspace resides, find the workspace, and then click Data Analytics in the Actions column.
  2. On the page that appears, move the pointer over the Create icon icon and choose EMR > EMR MR.
    Alternatively, you can click the related workflow in the left-side navigation pane, right-click EMR, and then choose Create > EMR MR.
  3. In the Create Node dialog box, set the Node Name and Location parameters.
    Note The node name must be 1 to 128 characters in length and can contain letters, digits, underscores (_), and periods (.).
  4. Click Commit.

Develop the node

  1. Optional:Reference a UDF or open source code.
    Before you reference a UDF or open source code, perform the required preparations. For more information, see Create an EMR JAR resource and Create an E-MapReduce function. After you commit the UDF or open source code, you can reference it in the node code. In the following example, open source code is referenced.
    1. In the left-side navigation pane, click Resource in the EMR folder, right-click the resource that needs to be referenced, and select Insert Resource Path.
      Insert Resource Path
    2. The code resource is referenced if the message shown in the following figure appears on the configuration tab of the EMR MR node.
      Resource referenced
  2. Save and commit the node.
    Notice You must set the Rerun and Parent Nodes parameters before you can commit the node.
    1. Click the Save icon in the toolbar to save the node.
    2. Click the Commit icon in the toolbar.
    3. In the Commit Node dialog box, enter your comments in the Change description field.
    4. Click OK.
    In a workspace in standard mode, you must click Deploy in the upper-right corner after you commit the node. For more information, see Deploy nodes.

Test the node

Test the node. For more information, see View auto triggered nodes.