All Products
Search
Document Center

E-MapReduce:Submit jobs through Alibaba Cloud DataWorks

Last Updated:Oct 15, 2025

Alibaba Cloud DataWorks supports creating Hive, Spark SQL, Spark, and other nodes on E-MapReduce to configure and schedule task workflows. It also provides metadata management and data quality monitoring alert features to help users efficiently develop and govern data. This topic describes how to submit jobs through Alibaba Cloud DataWorks.

Supported cluster types

DataWorks currently supports registering the following cluster types:

  • DataLake cluster (new data lake)

  • Custom cluster

  • Hadoop cluster (old data lake)

Important
  • You can use EMR Hadoop clusters of the following versions in DataWorks:

    EMR-3.38.2, EMR-3.38.3, EMR-4.9.0, EMR-5.6.0, EMR-3.26.3, EMR-3.27.2, EMR-3.29.0, EMR-3.32.0, EMR-3.35.0, EMR-4.3.0, EMR-4.4.1, EMR-4.5.0, EMR-4.5.1, EMR-4.6.0, EMR-4.8.0, EMR-5.2.1, EMR-5.4.3

  • Hadoop clusters (old data lake) are no longer recommended. You must migrate to DataLake clusters as soon as possible. For more information, see Migrate Hadoop clusters to DataLake clusters.

Limits

  • Task type: You cannot run EMR Flink tasks in the DataWorks console.

  • Task running: You can use a serverless resource group (recommended) or an old-version exclusive resource group for scheduling to run an EMR task.

  • Task governance:

    • Only SQL tasks in EMR Hive, EMR Spark, and EMR Spark SQL nodes can be used to generate data lineages. If your EMR cluster is of V3.43.1, V5.9.1, or a minor version later than V3.43.1 or V5.9.1, you can view the table-level lineages and field-level lineages of the preceding nodes that are created based on the cluster.

      Note

      For Spark-based EMR nodes, if the EMR cluster is of V5.8.0, V3.42.0, or a minor version later than V5.8.0 or V3.42.0, the Spark-based EMR nodes can be used to view table-level and field-level lineages. If the EMR cluster is of a minor version earlier than V5.8.0 or V3.42.0, only the Spark-based EMR nodes that use Spark 2.x can be used to view table-level lineages.

    • If you want to manage metadata for a DataLake or custom cluster in DataWorks, you must configure EMR-HOOK in the cluster first. If you do not configure EMR-HOOK in the desired cluster, metadata cannot be displayed in real time, audit logs cannot be generated, and data lineages cannot be displayed in DataWorks. In addition, EMR governance tasks cannot be run. EMR-HOOK can be configured for EMR Hive and EMR Spark SQL services. For more information, see Use the Hive extension feature to record data lineage and historical access information and Use the Spark SQL extension feature to record data lineage and historical access information.

  • Supported regions: EMR Serverless Spark is available in the China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen), Singapore, Germany (Frankfurt), and US (Silicon Valley) regions.

  • For an EMR cluster for which Kerberos authentication is enabled, you must add inbound rules of UDP ports to the security group of the EMR cluster for the CIDR block of the vSwitch with which a resource group is associated.

    Note

    To add an inbound rule, perform the following operations: Log on to the EMR console. Go to the Basic Information tab of your EMR cluster. In the Security section of the Basic Information tab, click the image icon to the right of the Cluster Security Group parameter. On the Security Group Details tab of the Security Groups page, click the Inbound tab in the Access Rule section. On the Inbound tab, click Add Rule. Set the Protocol Type parameter to Custom UDP, the Port Range parameter to the configuration specified in the /etc/krb5.conf file of your EMR cluster, and the Authorization Object parameter to the CIDR block of the vSwitch with which a resource group is associated.

Prerequisites

  • The following permissions have been granted.

    Only RAM users or RAM roles with the following identities can register EMR clusters. For operation details, see Grant permissions to RAM users.

    • Alibaba Cloud account.

    • RAM user or RAM role that has both the DataWorks workspace administrator role and the AliyunEMRFullAccess policy.

    • RAM user or RAM role that has both the AliyunDataWorksFullAccess and AliyunEMRFullAccess policies.

  • The corresponding type of EMR cluster has been purchased. In this example, the region of the EMR cluster is China (Shanghai).

    For more information about the cluster types that DataWorks supports registering, see Supported cluster types.

Precautions

  • If you want to isolate EMR data in the development environment from EMR data in the production environment by using a workspace in standard mode, you must register different EMR clusters in the development and production environments of the workspace. In addition, the metadata of the EMR clusters must be stored by using one of the following methods:

  • You can register an EMR cluster to multiple workspaces within the same Alibaba Cloud account but cannot register an EMR cluster to multiple workspaces across Alibaba Cloud accounts. For example, if you register an EMR cluster to a workspace within the current Alibaba Cloud, you cannot register the cluster to a workspace in another Alibaba Cloud account.

  • If a DataWorks resource group and an EMR cluster are deployed in the same virtual private cloud (VPC) and use the same vSwitch, but the resource group cannot connect to the EMR cluster as expected, check the security group rules of the EMR cluster and add the CIDR block of the vSwitch and inbound rules of ports of common open source components to the security group rules of the EMR cluster to ensure that you can use the DataWorks resource group to access the EMR cluster as expected. For more information, see Manage security groups.

Prepare a DataWorks environment

Before you develop tasks in DataWorks, you must activate DataWorks. For more information, see Prepare an environment.

Step 1: Create a workspace

If a workspace exists in the China (Shanghai) region, skip this step and use the existing workspace.

  1. Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region.

  2. In the left-side navigation pane, click Workspace. On the Workspaces page, click Create Workspace to create a workspace in standard mode. For more information, see Create a workspace. For a workspace in standard mode, the development environment is isolated from the production environment.

Step 2: Create a serverless resource group

This tutorial requires a serverless resource group for data synchronization and scheduling. Therefore, you need to purchase and configure a serverless resource group.

  1. Purchase a serverless resource group.

    1. Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region. In the left-side navigation pane, click Resource Group to go to the Resource Groups page.

    2. On the Resource Groups page, click Create Resource Group. On the buy page, set Region and Zone to China (Shanghai), specify the resource group name, configure other parameters as prompted, and then follow on-screen instructions to pay for the resource group. For information about the billing details of serverless resource groups, see Billing of serverless resource groups.

      Note

      If no virtual private cloud (VPC) or vSwitch exists in the current region, click the link in the parameter description to go to the VPC console to create one. For more information about VPCs and vSwitches, see What is VPC?

  2. Associate the serverless resource group with the DataWorks workspace.

    You can use the serverless resource group that you purchased in subsequent operations only after you associate the serverless resource group with a workspace.

    Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region. In the left-side navigation pane, click Resource Group. On the Resource Groups page, find the serverless resource group that you purchased, and click Associate Workspace in the Actions column. In the Associate Workspace panel, find the workspace with which you want to associate and click Associate in the Actions column.

  3. Enable the serverless resource group to access the Internet.

    The test data used in this tutorial must be obtained over the Internet. By default, the serverless resource group cannot be used to access the Internet. You must configure an Internet NAT gateway for the VPC with which the serverless resource group is associated and configure an EIP for the VPC to establish a network connection between the VPC and the network environment of the test data. This way, you can use the serverless resource group to access the test data.

    1. Go to the Internet NAT Gateway page in the VPC console. In the top navigation bar, select the China (Shanghai) region.

    2. Click Create Internet NAT Gateway and configure the parameters. The following table describes the key parameters that are required in this tutorial. You can retain the default values for the parameters that are not described in the following table.

      Parameter

      Description

      Region

      Select China (Shanghai).

      VPC

      Select the VPC and vSwitch with which the resource group is associated.

      To view the VPC and vSwitch with which the resource group is associated, perform the following operations: Log on to the DataWorks console. In the top navigation bar, select the region in which you activate DataWorks. In the left-side navigation pane, click Resource Group. On the Resource Groups page, find the created resource group and click Network Settings in the Actions column. In the Data Scheduling & Data Integration section of the VPC Binding tab on the page that appears, view the VPC and vSwitch with which the resource group is associated. For more information about VPCs and vSwitches, see What is VPC?

      Associate vSwitch

      Access Mode

      Select SNAT-enabled Mode.

      EIP

      Select Purchase EIP.

      Service-linked Role

      Click Create Service-linked Role to create a service-linked role if this is the first time you create a NAT gateway.

    3. Click Buy Now. On the Confirm page, read the terms of service, select the check box for Terms of Service, and then click Activate Now.

For more information about how to create and use a serverless resource group, see Use serverless resource groups.

Step 3: Register the EMR cluster to DataWorks and initialize the resource group

You can use the EMR cluster in DataWorks only if you register the cluster to DataWorks.

  1. Go to the Register EMR Cluster page.

    1. Go to the SettingCenter page.

      Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region. In the left-side navigation pane, choose More > Management Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.

    2. In the left-side navigation pane of the SettingCenter page, click Cluster Management. On the Cluster Management page, click Register Cluster. In the Select Cluster Type dialog box, click E-MapReduce. The Register EMR Cluster page appears.

  2. Register the EMR cluster to DataWorks.

    On the Register EMR Cluster page, configure cluster information. The following table describes the key parameters.

    Parameter

    Description

    Alibaba Cloud Account to Which Cluster Belongs

    Set it to Current Alibaba Cloud Account.

    Cluster Type

    Select Data Lake.

    Default Access Identity

    Set it to Cluster Account: hadoop.

    Pass Proxy User Information

    Set it to Pass.

  3. Initialize the resource group.

    1. Go to the Cluster Management page in SettingCenter. Find the EMR cluster that is registered to DataWorks and click Initialize Resource Group in the section that displays the information of the EMR cluster.

    2. In the Initialize Resource Group dialog box, find the desired resource group and click Initialize.

    3. After the initialization is complete, click OK.

    Important

    You must make sure that the initialization of the resource group is successful. Otherwise, tasks that use the resource group may fail. If the initialization of the resource group fails, you can view the failure cause and perform a network connectivity diagnosis as prompted.

For more information about how to register an EMR cluster, see DataStudio (old version): Associate an EMR computing resource.

Submit EMR jobs

Submit EMR Hive jobs

Step 1: Create an EMR Hive node

  1. Go to the DataStudio page.

    Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O&M > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.

  2. Create an EMR Hive node.

    1. Find the desired workflow, right-click the name of the workflow, and then choose Create Node > EMR > EMR Hive.

      Note

      Alternatively, you can move the pointer over the Create icon and choose Create Node > EMR > EMR Hive.

    2. In the Create Node dialog box, configure the Name, Engine Instance, Node Type, and Path parameters. Click Confirm. The configuration tab of the EMR Hive node appears.

      Note

      The node name can contain only letters, digits, underscores (_), and periods (.).

Step 2: Develop an EMR Hive task

You can develop a Hive task on the configuration tab of the EMR Hive node.

Develop SQL code

In the SQL editor, develop node code. You can define variables in the ${Variable} format in the node code and configure the scheduling parameters that are assigned to the variables as values in the Scheduling Parameter section of the Properties tab. This way, the values of the scheduling parameters are dynamically replaced in the node code when the node is scheduled to run. For more information about how to use scheduling parameters, see Supported formats of scheduling parameters. Sample code:

show tables;
select '${var}'; -- You can assign a specific scheduling parameter to the var variable. 
select * from userinfo ;
Note

Run the Hive task

  1. In the toolbar, click the 高级运行 icon. In the Parameters dialog box, select the desired resource group from the Resource Group Name drop-down list and click Run.

    Note
    • If you want to access a computing resource over the Internet or a virtual private cloud (VPC), use the resource group for scheduling that is connected to the computing resource. For more information, see Network connectivity solutions.

    • If you want to change the resource group in subsequent operations, you can click the 高级运行 (Run with Parameters) icon to change the resource group in the Parameters dialog box.

    • If you use an EMR Hive node to query data, a maximum of 10,000 data records can be returned, and the total size of the returned data records cannot exceed 10 MB.

  2. Click the 保存 icon in the top toolbar to save the SQL statements.

  3. Optional. Perform smoke testing.

    You can perform smoke testing on the node in the development environment when you commit the node or after you commit the node. For more information, see Perform smoke testing.

Note

If you want to modify the queue to which jobs are committed, see Configure advanced parameters.

Step 3: Configure scheduling properties

If you want the system to periodically run a task on the node, you can click Properties in the right-side navigation pane on the configuration tab of the node to configure task scheduling properties based on your business requirements. For more information, see Overview.

Note

You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.

Step 4: Deploy the task

After a task on a node is configured, you must commit and deploy the task. After you commit and deploy the task, the system runs the task on a regular basis based on scheduling configurations.

  1. Click the 保存 icon in the top toolbar to save the task.

  2. Click the 提交 icon in the top toolbar to commit the task.

    In the Submit dialog box, configure the Change description parameter. Then, determine whether to review task code after you commit the task based on your business requirements.

    Note
    • You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.

    • You can use the code review feature to ensure the code quality of tasks and prevent task execution errors caused by invalid task code. If you enable the code review feature, the task code that is committed can be deployed only after the task code passes the code review. For more information, see Code review.

If you use a workspace in standard mode, you must deploy the task in the production environment after you commit the task. To deploy a task on a node, click Deploy in the upper-right corner of the configuration tab of the node. For more information, see Deploy nodes.

Submit EMR Spark SQL jobs

Step 1: Create an EMR Spark SQL node

  1. Go to the DataStudio page.

    Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O&M > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.

  2. Create an EMR Spark SQL node.

    1. Find the desired workflow, right-click the workflow name, and then choose Create Node > EMR > EMR Spark SQL.

      Note

      Alternatively, you can move the pointer over the Create icon and choose Create Node > EMR > EMR Spark SQL.

    2. In the Create Node dialog box, configure the Name, Engine Instance, Node Type, and Path parameters. Click Confirm to go to the EMR Spark SQL node configuration tab.

      Note

      The node name can contain letters, digits, underscores (_), and periods (.).

Step 2: Develop an EMR Spark SQL task

You can perform the following operations to develop an EMR Spark SQL task on the configuration tab of the EMR Spark SQL node:

Develop SQL code

In the SQL editor, develop node code. You can define variables in the ${Variable} format in the node code and configure the scheduling parameters that are assigned to the variables as values in the Properties>Scheduling Parameter section. This way, the values of the scheduling parameters are dynamically replaced in the node code when the node is scheduled to run. For more information about how to use scheduling parameters, see Supported formats of scheduling parameters. Sample code:

SHOW TABLES; 
-- Define a variable named var in the ${var} format. If you assign the ${yyyymmdd} parameter to the variable as a value, you can create a table whose name is suffixed with the data timestamp.
CREATE TABLE IF NOT EXISTS userinfo_new_${var} (
ip STRING COMMENT'IP address',
uid STRING COMMENT'User ID'
)PARTITIONED BY(
dt STRING
); -- You can assign a specific scheduling parameter to the var variable.
Note
  • The size of SQL statements for the node cannot exceed 130 KB.

  • If multiple EMR data sources are associated with DataStudio in your workspace, you must select one from the data sources based on your business requirements. If only one EMR data source is associated with DataStudio in your workspace, you do not need to select a data source.

Execute SQL statements

  1. Click the 高级运行 icon in the top toolbar. In the Parameters dialog box, select a created resource group for scheduling and click Run.

    Note
    • If you want to access a data source over the Internet or a virtual private cloud (VPC), you must use the resource group for scheduling that is connected to the data source. For more information, see Network connectivity solutions.

    • If you want to change the resource group in subsequent operations, you can click the Run With Parameters 高级运行 icon to change the resource group in the Parameters dialog box.

    • If you use an EMR Spark SQL node to query data, a maximum of 10,000 data records can be returned, and the total size of the returned data records cannot exceed 10 MB.

  2. Click the 保存 icon in the top toolbar to save SQL statements.

  3. Optional. Perform smoke testing.

    You can perform smoke testing on the node in the development environment when you commit the node or after you commit the node. For more information, see Perform smoke testing.

Note

If you want to modify the queue to which jobs are committed, see Configure advanced parameters.

Step 3: Configure scheduling properties

If you want the system to periodically run a task on the node, you can click Properties in the right-side navigation pane on the configuration tab of the node to configure task scheduling properties based on your business requirements. For more information, see Overview.

Note

You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.

Step 4: Deploy the task

After a task on a node is configured, you must commit and deploy the task. After you commit and deploy the task, the system runs the task on a regular basis based on scheduling configurations.

  1. Click the 保存 icon in the top toolbar to save the task.

  2. Click the 提交 icon in the top toolbar to commit the task.

    In the Submit dialog box, configure the Change description parameter. Then, determine whether to review task code after you commit the task based on your business requirements.

    Note
    • You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.

    • You can use the code review feature to ensure the code quality of tasks and prevent task execution errors caused by invalid task code. If you enable the code review feature, the task code that is committed can be deployed only after the task code passes the code review. For more information, see Code review.

If you use a workspace in standard mode, you must deploy the task in the production environment after you commit the task. To deploy a task on a node, click Deploy in the upper-right corner of the configuration tab of the node. For more information, see Deploy nodes.

Submit EMR Spark jobs

Step 1: Create an EMR Spark node

  1. Go to the DataStudio page.

    Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O&M > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.

  2. Create an EMR Spark node.

    1. Find the desired workflow, right-click the workflow name, and then choose Create Node > EMR > EMR Spark.

      Note

      Alternatively, you can move the pointer over the Create icon and choose Create Node > EMR > EMR Spark.

    2. In the Create Node dialog box, configure the Name, Engine Instance, Node Type, and Path parameters. Click Confirm. The configuration tab of the EMR Spark node appears.

      Note

      The node name can contain letters, digits, underscores (_), and periods (.).

Step 2: Develop a Spark task

You can use one of the following methods based on your business requirements to develop a Spark task on the configuration tab of the EMR Spark node:

Method 1: Upload and reference an EMR JAR resource

DataWorks allows you to upload a resource from your on-premises machine to DataStudio before you reference the resource. You must obtain and store the JAR package that is generated after the code of a Spark task is compiled in EMR. The method for storing a JAR package varies based on the size of the JAR package.

You can upload the JAR package to the DataWorks console as an EMR JAR resource and commit the resource. You can also store the JAR package in HDFS of EMR. For a Spark cluster that is created on the EMR on ACK page or an EMR Serverless Spark cluster, you cannot upload resources to HDFS.

A JAR package is less than 200 MB in size
  1. Create an EMR JAR resource.

    You can upload the JAR package from your on-premises machine to the DataWorks console as an EMR JAR resource. This way, you can manage the JAR package in the DataWorks console in a visualized manner. After you create an EMR JAR resource, you must commit the resource. For more information, see Create and use an EMR resource.

    image.png

    Note

    The first time you create an EMR JAR resource, you must perform authorization as prompted first if you want the JAR package to be stored in OSS after the JAR package is uploaded.

  2. Reference the EMR JAR resource.

    1. Double-click the name of the created EMR Spark node to go to the configuration tab of the node.

    2. Find the desired EMR JAR resource under Resource in the EMR folder, right-click the resource name, and then select Insert Resource Path.

    3. Resource reference code is automatically added to the configuration tab of the EMR Spark node. Sample code:

      ##@resource_reference{"spark-examples_2.12-1.0.0-SNAPSHOT-shaded.jar"}
      spark-examples_2.12-1.0.0-SNAPSHOT-shaded.jar

      If the automatic addition of the preceding code is successful, the resource is referenced. spark-examples_2.12-1.0.0-SNAPSHOT-shaded.jar is the name of the JAR package that you uploaded.

    4. Rewrite the code of the EMR Spark node and add the spark-submit command. The following sample code is only for reference.

      Note

      You cannot add comments when you write code for an EMR Spark node. If you add comments, an error is reported when you run the EMR Spark node. You can refer to the following sample code to rewrite the code of an EMR Spark node.

      ##@resource_reference{"spark-examples_2.11-2.4.0.jar"}
      spark-submit --class org.apache.spark.examples.SparkPi --master yarn  spark-examples_2.11-2.4.0.jar 100

      Components:

      • org.apache.spark.examples.SparkPi: the main class of the task in the compiled JAR package.

      • spark-examples_2.11-2.4.0.jar: the name of the JAR package that you uploaded.

      • You can keep the settings of other parameters unchanged. You can also run the following command to view the help documentation for using the spark-submit command and modify the spark-submit command based on your business requirements.

        Note
        • If you want to use a parameter that is simplified by running the spark-submit command, such as --executor-memory 2G, in an EMR Spark node, you need to add the parameter to the code of the EMR Spark node.

        • You can use Spark nodes on YARN to submit jobs only if your nodes are in cluster mode.

        • If you commit a node by using spark-submit, we recommend that you set deploy-mode to cluster rather than client.

        spark-submit --help

        image.png

A JAR package is greater than or equal to 200 MB in size
  1. Store the JAR package in HDFS of EMR.

    You cannot upload the JAR package from your on-premises machine to the DataWorks console as a DataWorks resource. We recommend that you store the JAR package in HDFS of EMR and record the storage path of the JAR package. This way, you can reference the JAR package in this path when you use DataWorks to schedule Spark tasks.

  2. Reference the JAR package.

    You can reference the JAR package by specifying the storage path of the JAR package in the code of an EMR Spark node.

    1. Double-click the name of the created EMR Spark node to go to the configuration tab of the node.

    2. Write the spark-submit command. Example:

      spark-submit --master yarn
      --deploy-mode cluster
      --name SparkPi
      --driver-memory 4G
      --driver-cores 1
      --num-executors 5
      --executor-memory 4G
      --executor-cores 1
      --class org.apache.spark.examples.JavaSparkPi
      hdfs:///tmp/jars/spark-examples_2.11-2.4.8.jar 100

      Parameter description:

      • hdfs:///tmp/jars/spark-examples_2.11-2.4.8.jar: the storage path of the JAR package in HDFS.

      • org.apache.spark.examples.JavaSparkPi: the main class of the task in the compiled JAR package.

      • Other parameters are configured in the EMR cluster that is used. You can modify the parameters based on your business requirements. You can also run the following command to view the help documentation for using the spark-submit command and modify the spark-submit command based on your business requirements.

        Important
        • If you want to use a parameter that is simplified by running the spark-submit command, such as --executor-memory 2G, in an EMR Spark node, you need to add the parameter to the code of the EMR Spark node.

        • You can use Spark nodes on YARN to submit jobs only if your nodes are in cluster mode.

        • If you commit a node by using spark-submit, we recommend that you set deploy-mode to cluster rather than client.

        spark-submit --help

        image.png

Method 2: Reference an OSS resource

The current node can reference an OSS resource by using the OSS REF method. When you run a task on the node, DataWorks automatically loads the OSS resource specified in the node code. This method is commonly used in scenarios in which JAR dependencies are required in EMR tasks or EMR tasks need to depend on scripts.

  1. Develop a JAR package.

    1. Prepare code dependencies.

      You can access the EMR cluster and view required code dependencies in the /usr/lib/emr/spark-current/jars/ path of the master node. The following information uses Spark 3.4.2 as an example. You must open an existing IntelliJ IDEA project, and add Project Object Model (POM) dependencies and reference plug-ins.

      Add POM dependencies
      <dependencies>
              <dependency>
                  <groupId>org.apache.spark</groupId>
                  <artifactId>spark-core_2.12</artifactId>
                  <version>3.4.2</version>
              </dependency>
              <!-- Apache Spark SQL -->
              <dependency>
                  <groupId>org.apache.spark</groupId>
                  <artifactId>spark-sql_2.12</artifactId>
                  <version>3.4.2</version>
              </dependency>
      </dependencies>
      Reference plug-ins
      <build>
              <sourceDirectory>src/main/scala</sourceDirectory>
              <testSourceDirectory>src/test/scala</testSourceDirectory>
              <plugins>
                  <plugin>
                      <groupId>org.apache.maven.plugins</groupId>
                      <artifactId>maven-compiler-plugin</artifactId>
                      <version>3.7.0</version>
                      <configuration>
                          <source>1.8</source>
                          <target>1.8</target>
                      </configuration>
                  </plugin>
                  <plugin>
                      <artifactId>maven-assembly-plugin</artifactId>
                      <configuration>
                          <descriptorRefs>
                              <descriptorRef>jar-with-dependencies</descriptorRef>
                          </descriptorRefs>
                      </configuration>
                      <executions>
                          <execution>
                              <id>make-assembly</id>
                              <phase>package</phase>
                              <goals>
                                  <goal>single</goal>
                              </goals>
                          </execution>
                      </executions>
                  </plugin>
                  <plugin>
                      <groupId>net.alchim31.maven</groupId>
                      <artifactId>scala-maven-plugin</artifactId>
                      <version>3.2.2</version>
                      <configuration>
                          <recompileMode>incremental</recompileMode>
                      </configuration>
                      <executions>
                          <execution>
                              <goals>
                                  <goal>compile</goal>
                                  <goal>testCompile</goal>
                              </goals>
                              <configuration>
                                  <args>
                                      <arg>-dependencyfile</arg>
                                      <arg>${project.build.directory}/.scala_dependencies</arg>
                                  </args>
                              </configuration>
                          </execution>
                      </executions>
                  </plugin>
              </plugins>
          </build>
    2. Write code. Sample code:

      package com.aliyun.emr.example.spark
      
      import org.apache.spark.sql.SparkSession
      
      object SparkMaxComputeDemo {
        def main(args: Array[String]): Unit = {
          // Create a Spark session.
          val spark = SparkSession.builder()
            .appName("HelloDataWorks")
            .getOrCreate()
      
          // Display the Spark version.
          println(s"Spark version: ${spark.version}")
        }
      }
    3. Package the code into a JAR file.

      After you write and save the preceding code, package the code into a JAR file. In this example, a file named SparkWorkOSS-1.0-SNAPSHOT-jar-with-dependencies.jar is generated.

  2. Upload the JAR file.

    1. Log on to the OSS console. In the top navigation bar, select a desired region. Then, in the left-side navigation pane, click Buckets.

    2. On the Buckets page, find the desired bucket and click the bucket name to go to the Objects page.

      In this example, the onaliyun-bucket-2 bucket is used.

    3. On the Objects page, click Create Directory to create a directory that is used to store the JAR file.

      In the Create Directory panel, set Directory Name to emr/jars and click OK.

    4. Upload the JAR file to the created directory.

      Go to the created directory. Click Upload Object. In the Files to Upload section, click Select Files and add the SparkWorkOSS-1.0-SNAPSHOT-jar-with-dependencies.jar file. Then, click Upload Object.

  3. Reference the JAR file.

    1. Write code that is used to reference the JAR file.

      On the configuration tab of the EMR Spark node, write code that is used to reference the JAR file.

      spark-submit --class com.aliyun.emr.example.spark.SparkMaxComputeDemo --master yarn ossref://onaliyun-bucket-2/emr/jars/SparkWorkOSS-1.0-SNAPSHOT-jar-with-dependencies.jar

      Parameter description:

      Parameter

      Description

      class

      The full name of the main class to be executed.

      master

      The running mode of the Spark application.

      ossref file path

      Format: ossref://{endpoint}/{bucket}/{object}

      • endpoint: the endpoint of OSS. If the endpoint parameter is left empty, only a resource in an OSS bucket that resides in the same region as the current EMR cluster can be referenced.

      • bucket: a container that is used to store objects in OSS. Each bucket has a unique name. You can log on to the OSS console to view all buckets within the current logon account.

      • object: a file name or path that is stored in a bucket.

    2. Run a task on the EMR Spark node.

      After you write code, click the image icon and select a created serverless resource group to run a task on the EMR Spark node. After the task finishes running, record application IDs that are displayed in the console, such as application_1730367929285_xxxx.

    3. View results.

      Create an EMR Shell node and run the yarn logs -applicationId application_1730367929285_xxxx command on the EMR Shell node to view running results.

      image

(Optional) Configure advanced parameters

You can configure Spark-specific parameters on the Advanced Settings tab of the configuration tab of the current node. For more information about how to configure the parameters, see Spark Configuration. The following table describes the advanced parameters that can be configured for different types of EMR clusters.

DataLake cluster or custom cluster: created on the EMR on ECS page

Advanced parameter

Description

queue

The scheduling queue to which jobs are committed. Default value: default.

If you have configured a workspace-level YARN queue when you register an EMR cluster to a DataWorks workspace, the following configurations apply:

  • If you select Yes for Global Settings Task Precedence, the YARN queue that is configured when you register the EMR cluster is used to run Spark tasks.

  • If you do not select Yes for Global Settings Task Precedence, the YARN queue that is configured for the EMR Spark node is used to run Spark tasks.

For information about EMR YARN, see YARN schedulers. For information about queue configuration when you register an EMR cluster, see Configure a global YARN queue.

priority

The priority. Default value: 1.

FLOW_SKIP_SQL_ANALYZE

The manner in which SQL statements are executed. Valid values:

  • true: Multiple SQL statements are executed at a time.

  • false (default): Only one SQL statement is executed at a time.

Note

This parameter is available only for testing in the development environment of a DataWorks workspace.

Others

  • You can also add a custom Spark parameter for the EMR Spark node on the Advanced Settings tab, such as spark.eventLog.enabled : false . When you commit the code of the EMR Spark node, DataWorks adds the custom parameter to the code in the --conf key=value format.

  • You can also configure global Spark parameters. For more information, see Configure global Spark parameters.

    Note

    If you want to enable Ranger permission control, you must add spark.hadoop.fs.oss.authorization.method=ranger when you configure global Spark parameters to allow Ranger permission control to take effect.

Hadoop cluster: created on the EMR on ECS page

Advanced parameter

Description

queue

The scheduling queue to which jobs are committed. Default value: default.

If you have configured a workspace-level YARN queue when you register an EMR cluster to a DataWorks workspace, the following configurations apply:

  • If you select Yes for Global Settings Task Precedence, the YARN queue that is configured when you register the EMR cluster is used to run Spark tasks.

  • If you do not select Yes for Global Settings Task Precedence, the YARN queue that is configured for the EMR Spark node is used to run Spark tasks.

For information about EMR YARN, see YARN schedulers. For information about queue configuration when you register an EMR cluster, see Configure a global YARN queue.

priority

The priority. Default value: 1.

FLOW_SKIP_SQL_ANALYZE

The manner in which SQL statements are executed. Valid values:

  • true: Multiple SQL statements are executed at a time.

  • false: Only one SQL statement is executed at a time.

Note

This parameter is available only for testing in the development environment of a DataWorks workspace.

USE_GATEWAY

Specifies whether to use a gateway cluster to commit jobs on the current node. Valid values:

  • true: Use a gateway cluster to commit jobs.

  • false: Use no gateway cluster to commit jobs. Jobs are automatically committed to the master node.

Note

If the EMR cluster to which the node belongs is not associated with a gateway cluster but the USE_GATEWAY parameter is set to true, jobs may fail to be committed.

Others

  • You can also add a custom Spark parameter for the EMR Spark node on the Advanced Settings tab, such as spark.eventLog.enabled : false . When you commit the code of the EMR Spark node, DataWorks adds the custom parameter to the code in the --conf key=value format.

  • You can also configure global Spark parameters. For more information, see Configure global Spark parameters.

    Note

    If you want to enable Ranger permission control, you must add spark.hadoop.fs.oss.authorization.method=ranger when you configure global Spark parameters to allow Ranger permission control to take effect.

Spark cluster: created on the EMR on ACK page

Advanced parameter

Description

queue

This parameter is not supported.

priority

This parameter is not supported.

FLOW_SKIP_SQL_ANALYZE

The manner in which SQL statements are executed. Valid values:

  • true: Multiple SQL statements are executed at a time.

  • false: Only one SQL statement is executed at a time.

Note

This parameter is available only for testing in the development environment of a DataWorks workspace.

Others

  • You can also add a custom Spark parameter for the EMR Spark node on the Advanced Settings tab, such as spark.eventLog.enabled : false . When you commit the code of the EMR Spark node, DataWorks adds the custom parameter to the code in the --conf key=value format.

  • You can also configure global Spark parameters. For more information, see Configure global Spark parameters.

EMR Serverless Spark cluster

For more information about parameter settings, see the Step 3: Submit a Spark task section of the "Use the spark-submit CLI to submit a Spark job" topic.

Advanced parameter

Description

queue

The scheduling queue to which jobs are committed. Default value: dev_queue.

priority

The priority. Default value: 1.

FLOW_SKIP_SQL_ANALYZE

The manner in which SQL statements are executed. Valid values:

  • true: Multiple SQL statements are executed at a time.

  • false: Only one SQL statement is executed at a time.

Note

This parameter is available only for testing in the development environment of a DataWorks workspace.

SERVERLESS_RELEASE_VERSION

The version of the Spark engine. By default, the value specified by the Default Engine Version parameter on the Register EMR Cluster page is used. To go to the Register EMR Cluster page, you can perform the following operations: Go to the SettingCenter page. In the left-side navigation pane, click Cluster Management. On the Cluster Management page, click Register Cluster and select E-MapReduce in the Select Cluster Type dialog box. You can configure this parameter to specify different engine versions for different types of tasks.

SERVERLESS_QUEUE_NAME

The resource queue. By default, the value specified by the Default Resource Queue parameter on the Register EMR Cluster page is used. You can add queues to meet resource isolation and management requirements. For more information, see Manage resource queues.

Others

  • You can also add a custom Spark parameter for the EMR Spark node on the Advanced Settings tab, such as spark.eventLog.enabled : false . When you commit the code of the EMR Spark node, DataWorks adds the custom parameter to the code in the --conf key=value format.

  • You can also configure global Spark parameters. For more information, see Configure global Spark parameters.

Execute SQL statements

  1. Click the 高级运行 icon in the top toolbar. In the Parameters dialog box, select a created resource group for scheduling and click Run.

    Note
    • If you want to access a computing source over the Internet or a virtual private cloud (VPC), use the resource group for scheduling that is connected to the computing source. For more information, see Network connectivity solutions.

    • If you want to change the resource group in subsequent operations, you can click the 高级运行 (Run with Parameters) icon to change the resource group in the Parameters dialog box.

    • If you use an EMR Spark node to query data, a maximum of 10,000 data records can be returned, and the total size of the returned data records cannot exceed 10 MB.

  2. Click the 保存 icon in the top toolbar to save SQL statements.

  3. Optional. Perform smoke testing.

    You can perform smoke testing on the node in the development environment when you commit the node or after you commit the node. For more information, see Perform smoke testing.

Step 3: Configure scheduling properties

If you want the system to periodically run a task on the node, you can click Properties in the right-side navigation pane on the configuration tab of the node to configure task scheduling properties based on your business requirements. For more information, see Overview.

Note

You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.

Step 4: Deploy the task

After a task on a node is configured, you must commit and deploy the task. After you commit and deploy the task, the system runs the task on a regular basis based on scheduling configurations.

  1. Click the 保存 icon in the top toolbar to save the task.

  2. Click the 提交 icon in the top toolbar to commit the task.

    In the Submit dialog box, configure the Change description parameter. Then, determine whether to review task code after you commit the task based on your business requirements.

    Note
    • You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.

    • You can use the code review feature to ensure the code quality of tasks and prevent task execution errors caused by invalid task code. If you enable the code review feature, the task code that is committed can be deployed only after the task code passes the code review. For more information, see Code review.

If you use a workspace in standard mode, you must deploy the task in the production environment after you commit the task. To deploy a task on a node, click Deploy in the upper-right corner of the configuration tab of the node. For more information, see Deploy nodes.

What to do next

After a task is deployed, it is automatically added to Operation Center. You can view the task running status in Operation Center or manually trigger the task to run. For more information, see Operation Center.

FAQ

  • After I prepare the DataWorks environment and submit an EMR Hive job, the java.net.ConnectException: Connection timed out (Connection timed out) error occurs.

    • Check whether the EMR cluster and DataWorks environment are correctly configured as required in the documentation, and confirm whether the DataWorks resource group and EMR are associated with the same VPC and vSwitch.

    • Check the security group rules of the EMR cluster to ensure that port 10000 of the ECS instance is open. For more information, see Manage security groups. When you submit jobs of other components in DataWorks, you need to open the corresponding ECS ports. For more information, see Commonly used ports of open source components.

References

  • If your task needs to be periodically scheduled to run, you need to define the scheduling-related properties of the task, including the scheduling cycle, scheduling dependencies, and scheduling parameters. For more information, see Node scheduling configuration.

  • If your task requires complex string processing or mathematical operations, you can create user-defined functions in DataWorks. For more information, see Create an EMR function.